Inferential Statistics

Central Limit Theorem

The Central Limit Theorem (CLT) states that if you repeatedly sample a random variable a large number of times, the distribution of the sample mean will approach a normal distribution regardless of the initial distribution of the random variable.

The normal distribution takes on the form:

with the mean and standard deviation given by μ and σ, respectively.

The mathematical definition of the CLT is as follow: for any given random variable X, as approaches infinity,

The CLT provides the basis for much of hypothesis testing, which we will discuss shortly. At a fundamental level, you can consider the implication of this theorem on coin flipping: the probability of getting some number of heads flipped over a large n should be approximately that of a normal distribution. Whenever you are asked to reason about any particular distribution over a large sample size, you should remember to think of the CLT, regardless of whether it is Binomial, Poisson, or any other distribution.

Uber Interview Question:

Q. Explain the Central Limit Theorem. Why is it useful?

Ans: The CLT states that if any random variable, regardless of distribution, is sampled a large enough number of times, the sample mean will approximately be normally distributed. This allows for studying the properties for any statistical distribution as long as there is a large enough sample size.

Like Uber, any company with a lot of data, this concept is core to the various experimentation platforms used in the product. For a real-world example, consider testing whether adding a new feature increases rides booked in the Uber platform, where each X is an individual ride and is a Bernoulli random variable (i.e., the rider books or does not book a ride). Then, if the sample size is sufficiently large, we can assess the statistical properties of the total number of bookings and the booking rate (rides booked/ rides opened on the app). These statistical properties play a crucial role in hypothesis testing, allowing companies like Uber to decide whether or not to add new features in a data-driven manner.

Hypothesis testing

Problem	Score	Time
What p-value represents?	30	4:25
Choose for the statement	30	5:13
Hypothesis Testing in Salary of Data Scientists	50	31:26

Central limit theorem

Problem	Score	Time
CLT	30	1:48
Mean of sampling distribution	30	3:57
Sampling Distribution Mean	30	2:46
Standard Error of Sampling Distribution	30	3:37
Sampling error and Sampling Size	30	1:21

Distribution analysis: multivariate

Problem	Score	Time
Correlation-analysis	30	2:10
Normal random variable	30	1:30
When multivariate analysis	30	3:18
Multivariate	30	2:14
Dependent variables	30	2:51

Estimation and sampling

Problem	Score	Companies	Time	Status
Number of random samples	30		2:57
Team Selection	30		1:02