# Assignment 4: Statistics
![Statistics meme](https://i.imgur.com/rRpLck9.jpg)
**DUE DATE: 03/25/2024**
## Overview
This assignment consists of two parts, one on statistical tests and another on regression models. Accept [the handout](https://classroom.github.com/a/YhVje6R6), complete the TODOs in the files `regression.py` and `stats_tests.py` and write answers to the written questions in `writeup.md`.
When submitting your files on Gradescope, the autograder will run for `regression.py` and `stats_tests.py`. To ensure compatibility with the autograder, you should not modify the stencil unless instructed otherwise. For this assignment, please write your solutions in the respective .py files and `writeup.md`. Failing to do so may hinder with the autograder and result in a low grade.
## Part 1: Regression
Play around with [this OLS visualization](https://seeing-theory.brown.edu/regression-analysis/index.html#section1) and answer the following questions in `writeup.md`. Move the data points around and see how it affects the best-fit regression line.
1. In general, for what distribution of data points is the SSE minimized and why is it minimized in this scenario?
2. What can you say about the relationship between the $\hat{\beta_0},\hat{\beta_1}, \bar{x}, \bar{y}$?
### Linear Regression.
Open up the file `regression.py` and complete the function `regression()`. This function takes in two-dimensional data and creates a best-fit regression line of the form $Y=mX+b$, where $m$ is the gradient and $b$ is the intercept value by splitting into training and testing data. In this function, you are only permitted to use Statsmodels' OLS package and the functions provided in `util.py`. Make sure to return the training MSE (mean-squared error) by using `eval_measures` and R-squared using `r2_score`.
To check that you've implemented manual regression correctly, create an array $X1$ of 200 equally-spaced values from $0$ to $200$ using [`np.linspace`](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html), Note: don't worry about floating point values making the output off past the hundreths place. Then create another array $Y1$ with 200 values that increases from $0$ to $398$ with a gradient of $2$ using [`np.arange`](https://numpy.org/doc/stable/reference/generated/numpy.arange.html). Finally run `regression()` on this set of points: you should get a $b$ value of about $0$ and an $m$ value of about $2$ (within a few decimals).
Then complete the `linear_fit_func()` function, which takes in an array $x$, the gradient $m$ and intercept $b$ of the predicted model and outputs the form of the model you want to check: here for a line we have $Y=mX+b$.
Another way of conducting a regression is using SciPy's `optimize.curve_fit()` with `linear_fit_func()`in the first argument. Learn more about [`curve_fit()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) here. Run a regression with `curve_fit()` with the dummy datapoints earlier and check that you get the same linear line.
### Nonlinear Regression.
Not all data fits a linear trend: what happens if we run `curve_fit()` from above with the same linear fit function but on a dataset that is clearly not linear in trend?
Create two arrays $X_2, Y_2$ this time with 200 equally-spaced $x$-values from 0 to 1 and $y$-values that follow the exponential function $Y=2e^{4x}$ and run `curve_fit()` on these arrays using the linear fit function — you should get an error or results that don't make sense.
Using what you've learned about `curve_fit()`, fill out the function `exp_fit_func()` so you can create best-fit lines for functions of the form above. Your function can take in $A,B$ for a curve of the form $Y=Ae^{Bx}+c$.
### Polynomial Fit Regression Models.
In reality, we might not have a good idea of what the best-fit regression model of a given dataset would look like, so it might be difficult to use `curve_fit()`. A tool we can use is NumPy's `polyfit()` function that can estimate a best-fit degree $d$ polynomial (a polynomial of degree $d$ is a function of the form $a_dx^d+a_{d-1}x^{d-1}+...+a_1x_1+a_0$).
Use `polyfit()` to estimate the best-fit polynomial for both $X_1, Y_1$ and $X_2, Y_2$ as datapoints. In `writeup.md`, explain whether using [`curve_fit()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) or `polyfit()` works better for each set of data and why this is the case for that dataset.
### Evaluating Regression Models.
Now that we've seen how to use SciPy's `curve_fit()` to fit data by specifying different fitting functions, we want a way of comparing how different regression models perform for a given dataset.
Complete the `mse()` function, which takes in the predicted values given by the best fit line and original datapoints, then outputs the mean-squared error (MSE) of the data to its best fit line. Here, optimize your code by taking advantage of NumPy operations.
The mean-squared error is the most common 'error-function' used for regressions, but in some situations we use other measures of evaluation for regression models such as the MAE (mean absolute error) when we think wrong predictions should be penalized linearly (instead of quadratically). Others use RMSE (root mean-squared error) to reduce how fast the error grows as well. To read more about error functions and why we use different ones, [click here](https://towardsdatascience.com/understanding-the-3-most-common-loss-functions-for-machine-learning-regression-23e0ef3e14d3).
## Part 2: Hypothesis Testing
![](https://i.imgur.com/c1rR4Nu.jpg)
Play around with [this Hypothesis Testing visualization](https://bcdudek.net/betaprob/) by selecting different parameters. Answer the following questions in `writeup.md`.
- Keeping all other parameters equal, what trends do you observe as you increase the standard error for the mean? Why do you think this happens?
- What about decreasing the Type 1 error rate?
Fill in the TODOs in `stats_tests.py` to complete the functions `one_sample_ttest()` `two_sample_ttest()` `paired_ttest()` and `chisquared_independence_test()`.
In `writeup.md` indicate what type of test matches best with each scenario below, and then generate its p-value with the appropriate test in `run_tests.py`. To run a specific test, run `python run_tests.py -s=<insert_tests_scenario_number_here>`.
1. A factory calibrates their milk bottle-filling machine to 150ml each and wants to ensure there is no significant deviation in the volume of milk in the bottles they produce. Out of 105 bottles, 5 of them did not have available data, 35 had 150ml, 25 had 135ml, and 40 had 160ml. How likely is their machine calibrated wrongly, given these results?
2. We measure the grams of protein in two different brands of energy bars by sampling from 10 bars in each brand. Brand A had 12, 14, 13, 11, 12, 12, 11, 14, 17, 10 grams in each of their bars, whereas Brand B had 10, 10, 10, 12, 18, 18, 19, 12, 11, 10 grams each. How likely are the average grams of protein in the two brands different from one another?
3. A study wants to find out if there is a correlation between someone being exposed to Disease X and getting the disease or vice versa. They record 15 exposed participants having the disease, 5 unexposed participants having the disease, 20 unexposed not having the disease, and 7 exposed to the disease not getting it.
4. A hospital wants to check if their treatment was successful in changing cholesterol levels for their patients. They tested the patients cholesterol levels in 2010 and then retested the same patients in 2011. In 2010 the 10 patients cholesterol levels were 110mg, 115mg, 150mg, 120mg, 99mg, 132mg, 182mg, 183mg, 89mg, and 91mg. In 2011 the same 10 patients cholesterol levels were 84mg, 88mg, 88mg, 89mg, 90mg, 90mg, 91mg, 90mg, 90mg, 90mg.
## Part 3: Additional Topics
### Linear Regression with Robust Standard Errors
To run linear regression, we need to make sure that the assumptions of the model (linearity, homoscedasticity, independence, and normality) hold true in our dataset. In this section, we are going to talk about the "homoscedasticity" model assumption.
A homoskedastic linear regression model is characterized by having a contant error regardless of the value of the independent variable. In other words, the variance of the error of the model should be the same accross values of the independent variable. In a scatterplot, homoscedasticity can be observed when data points with the same independent value are as spread out from their mean as any other group of data points that share a different independent variable value.
Heteroskedasticity is when the model's error has a variable variance. In a scatterplot, we can visualize heteroskedasticity when we observe a fan-out or fan-in shape formed by the data points. Feel free to read more on heteroskedasticity vs. homoscedasticity in [this](https://medium.com/@sandhyaselvamani248/heteroskedasticity-vs-homoskedasticity-assumption-of-linear-reg-3fc0f3cfbbe7) article.
Running standard linear regression on a dataset with heteroskedasticity can affect the validity of our model. To correct for this, we should account for heteroskedasticity when training the linear regression model. This procedure is called running linear regression with robust standard errors. Feel free to read more about how this is done in [this](https://spureconomics.com/robust-standard-errors-and-ols-standard-errors/#:~:text=With%20heteroscedasticity%20or%20non%2Dconstant,White's%20Heteroscedasticity%20Consistent%20Standard%20Errors.) article.
We will now see linear regression with robust standard errors in action! We will be analyzing a dataset about income versus expenditure to observe how heteroskedaticity impacts OLS.
**In `writeup.md`, answer the following questions.**
Run `python robust_linear_regression.py` to display a scatterplot highlighting Total Household Income against Miscellaneous Goods and Services Expenditure. The scatterplot will be saved in a file called `heteroskedasticity_scatter.png`.
1. What do you notice about the relationship between Total Household Income and Miscellaneous Goods and Services Expenditure? What issues might this pose when using OLS on this dataset?
2. Using what you have learned above about robust standard errors, how can we address the concerns identified in (1)? Why?
Next, run
```python
python robust_linear_regression.py ols
python robust_linear_regression.py ols_robust
```
This will display the OLS regression results without and with robust standard errors respectively. We will be using [HC1 Robust Standard Errors](https://spureconomics.com/robust-standard-errors-and-ols-standard-errors/#:~:text=HC1%20Robust%20Standard,non%2Dconstant%20variance.).
3. How does the confidence interval of the coefficients differ, if at all, between the two summaries? Which option is more likely to be accurate, and why?
<!-- ### Multiple Hypothesis Testing
Consider flipping a fair coin 10 times. What are the chances that we land on heads all 10 times? Though this number may be low, as we repeat the trial thousands of times, we inevitably will run into cases when we in fact do land on heads 10 times.
A similar issue arises with multiple hypothesis testing, i.e. running multiple statistical tests on data. This causes a problem because the more tests we run the more likely it is that we get a statistically significant difference due to chance (i.e. making a Type I error). Look at the case where we have 20 hypotheses to test and a significance level of 0.05. We find that the probability of at least one significant result rises to roughly 64%.
$P(\text{at least one significant result}) = 1 - P(\text{no significant results})$
$= 1 - (1 - 0.05)^{20}$
$\approx 0.64$
In order to mitigate the probability of getting a random statistical difference, we can adjust the alpha level of our multiple hypothesis tests. There are several methods to do this, including the Bonferroni correction.
#### In `writeup.md`, answer the following questions:
Question 1: Run the file `multi_testing.py` and analyze the outputted bar graphs. What do you notice about the p-values before and after multi-hypothesis test correction? What differences are there in the statistically significant test results? Provide a brief explanation for why these differences occur.
Question 2: In the provided stencil code, we used Bonferroni correction. This is computed as yielding a new alpha of $\frac{\alpha}{n}$, for our current rejection threshold of $\alpha$ and the number of tests $n$ (or an inverse operation to increase the value of the p-values). Consider a different correction method and compare its results to Bonferroni correction (consider the [statsmodels multitests documentation](https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html)).
> **Hint**: Based on the relatedness of features, are there some correction methods that are better suited to reduce errors caused by multiple hypothesis testing? -->
## Part 4: Socially Responsible Computing
Please write your answers to the following questions in `writeup.md`.(20 points)
### Same Data, Different Story
While performing regression analysis, data scientists make several choices that require more than the data alone. Understanding the context of data is necessary to avoid statistical misinterpretation (along with many other issues!).
The goal of the following questions is to:
- Recognize how technical choices in an analysis can be used to generate different conclusions
- Demonstrate the importance of context to decisions during the data analysis process
Your responses should be thoughtful, provide justification for your claims, and be concise but complete. See the response guide for more guidance.
### Questions
> A note on grading: we’re not trying to see how much of these readings or resources you’ve been able to recall. We are grading based on personal reflection of the work you have done / will be doing and consideration of new points of view.
> Pro tip: we love examples and actionable steps!
1. Watch [this video](https://www.youtube.com/watch?v=ebEkn-BiW5k) explaining a statistical phenomenon called Simpson's Paradox. Explain what you learnt in your own words. How do you see this being an issue with any projects you will be working on in the future. (It is okay to imagine a fictional situation - we’re grading based on reflection and not on accuracy).
2. Data scientists make many decisions throughout the analysis process that can lead to a variety of results. Read the Intro and Part 1 of [this article](https://fivethirtyeight.com/features/science-isnt-broken/#part1) and experiment with the ["Hack Your Way To Scientific Glory" interactive](https://projects.fivethirtyeight.com/p-hacking/). By manipulating the variables to get "Publishable" and “Unpublishable...”
- State the statistical results (in terms of p-values) supporting the claim that “Democrats are bad for the economy”, describing the terms you used to get this result.
- State the statistical results (in terms of p-values) supporting the claim that “Republicans are bad for the economy”, describing the terms used.
- Write a short reflection about your above experiments. Whether specific to the data or broadly applicable to statistical analysis of all kinds, what should you avoid or emphasize if you present or publish the results of such an analysis?
<!-- (2 points) Fill out this anonymous mid-semester feedback form for the Socially Responsible Computing (SRC) component of assignments. Follow instructions at the end of the form to get credit for filling it out. -->
---
Made by Joanna, Joshua, and Kazen in Spring 2023