Question 1: What is the Central Limit Theorem and why is it important?
The Central Limit Theorem (CLT) allows for understanding a full population by showing that the average of multiple random samples will form a normal distribution, regardless of the population's original distribution, provided the sample size is large enough. This theorem is crucial for estimating population parameters, such as average height, using smaller samples and for calculating the margin of error when surveying the entire population is impossible.
Read more here
Question 2: What is the difference between type I vs type II error?
“A type I error occurs when the null hypothesis is true, but is rejected. A type II error occurs when the null hypothesis is false, but erroneously fails to be rejected.”
Read more here
Question 3: What is linear regression? What do the terms p-value, coefficient, and r-squared value mean? What is the significance of each of these components?
Linear regression is a statistical method used to model the relationship between a dependent variable (target) and one or more independent variables (predictors). It produces a regression equation that describes the mathematical relationship between these variables, allowing you to understand how changes in the predictors affect the response.
1. P-values
- Meaning: In regression, the p-value for each independent variable tests the null hypothesis that the variable has no effect (the coefficient is equal to zero).
- Significance: It helps you determine which relationships in your model are statistically significant.
◦ A low p-value (typically < 0.05) indicates that you can reject the null hypothesis, meaning the predictor likely has a meaningful effect on the response variable.
◦ A larger p-value suggests that changes in the predictor are not associated with changes in the response, implying the variable may not be a necessary part of the model.
2. Coefficients
- Meaning: Regression coefficients represent the mean change in the dependent variable for every one-unit change in the independent variable, while holding other predictors in the model constant.
- Significance: They provide a mathematical magnitude and direction for the relationship.
◦ A positive coefficient indicates that as the predictor increases, the response variable also tends to increase.
◦ A negative coefficient indicates that as the predictor increases, the response variable tends to decrease.
◦ They allow you to quantify exactly how much impact each specific factor has on the outcome.
3. R-squared
- Meaning: R-squared is a statistical measure that represents the proportion of the variance for the dependent variable that's explained by the independent variables in the model. It is expressed as a percentage (between 0% and 100%).
- Significance: It serves as a measure of Goodness-of-Fit.
◦ A higher R-squared indicates that the model fits the data points well and explains a large portion of the variability.
◦ However, a high R-squared doesn't necessarily mean the model is "good" (it could be overfitted), just as a low R-squared doesn't always mean the model is "bad" (especially in fields like psychology where human behavior is inherently hard to predict).
Question 4: What are the assumptions required for linear regression?
There are four major assumptions:
1. There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data.
2. The errors or residuals of the data are normally distributed and independent from each other.
3. There is minimal multicollinearity between explanatory variables.
4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

