Inferential Statistics

Central Limit Theorem

The Central Limit Theorem (CLT) states that if you repeatedly sample a random variable a large number of times, the distribution of the sample mean will approach a normal distribution regardless of the initial distribution of the random variable.

The normal distribution takes on the form:

with the mean and standard deviation given by μ and σ, respectively.

The mathematical definition of the CLT is as follow: for any given random variable X, as approaches infinity,

The CLT provides the basis for much of hypothesis testing, which we will discuss shortly. At a fundamental level, you can consider the implication of this theorem on coin flipping: the probability of getting some number of heads flipped over a large n should be approximately that of a normal distribution. Whenever you are asked to reason about any particular distribution over a large sample size, you should remember to think of the CLT, regardless of whether it is Binomial, Poisson, or any other distribution.

Uber Interview Question:

Q. Explain the Central Limit Theorem. Why is it useful?

Ans: The CLT states that if any random variable, regardless of distribution, is sampled a large enough number of times, the sample mean will approximately be normally distributed. This allows for studying the properties for any statistical distribution as long as there is a large enough sample size.

Like Uber, any company with a lot of data, this concept is core to the various experimentation platforms used in the product. For a real-world example, consider testing whether adding a new feature increases rides booked in the Uber platform, where each X is an individual ride and is a Bernoulli random variable (i.e., the rider books or does not book a ride). Then, if the sample size is sufficiently large, we can assess the statistical properties of the total number of bookings and the booking rate (rides booked/ rides opened on the app). These statistical properties play a crucial role in hypothesis testing, allowing companies like Uber to decide whether or not to add new features in a data-driven manner.

Hypothesis testing

Problem	Score	Time
What p-value represents?	30	4:58
Choose for the statement	30	6:31
Hypothesis Testing in Salary of Data Scientists	50	23:53

Central limit theorem

Problem	Score	Time
CLT	30	1:57
Mean of sampling distribution	30	4:08
Sampling Distribution Mean	30	3:18
Standard Error of Sampling Distribution	30	3:42
Sampling error and Sampling Size	30	1:24

Distribution analysis: multivariate

Problem	Score	Time
Correlation-analysis	30	2:31
Normal random variable	30	1:50
When multivariate analysis	30	3:36
Multivariate	30	2:30
Dependent variables	30	3:00

Estimation and sampling

Problem	Score	Companies	Time	Status
Number of random samples	30		3:24
Team Selection	30		1:42