--- tags: homework --- # Statistics Final Assignment -- Personal ## 1. > The Crime Rate data set gives a variety of variables by US state at two time points 10 years apart. Rows: Each row represents an individual state in the United States of America. Columns: Each column measures an individual variable thought to affect crime rate, these are measured at the start of the study and then 10 years later. The later observations are recorded in columns suffixed with “10”. ### a. > Is the Crime Rate (CrimeRate) in the southern states (Southern=1) higher than the other states (Southern=0)? $\mu_1:$ The mean of Crime Rate in southern states $\mu_2:$ The mean of Crime Rate in other states Using 2 sample t-test for unknown, unequal variance. $H_0:\mu_1=\mu_2$ $H_1:\mu_1>\mu_2$ ```python= import statistics as st import seaborn as sns import numpy as np import pandas as pd from scipy import stats data = pd.read_csv("Crime_R.csv") # obtain data arr1 = data.loc[data['Southern'] == 1] arr2 = data.loc[data['Southern'] == 0] # perform test ans = stats.ttest_ind( arr1['CrimeRate'], arr2['CrimeRate'], equal_var=False, alternative='greater' ) print(ans) ``` Running the code above will get this result. `Ttest_indResult(statistic=-0.4005489620244038, pvalue=0.6545694106774924)` Therefore, we cannot reject the null hypothesis, i.e. the crime rate in southern states are likely not higher than other states. ### b. > Have crime rates increased in 10 years (CrimeRate vs. CrimeRate10)? $\mu_1:$ The mean of Crime Rate $\mu_2:$ The mean of Crime Rate 10 years later Using paired t-test. $H_0: \mu_1=\mu_2$ $H_1: \mu_1<\mu_2$ ```python= import statistics as st import seaborn as sns import numpy as np import pandas as pd from scipy import stats data = pd.read_csv("Crime_R.csv") # perform test ans = stats.ttest_rel( data['CrimeRate'], data['CrimeRate10'], alternative='less' ) print(ans) ``` Running the code above will get this result. `TtestResult(statistic=0.47082462711212, pvalue=0.6800044630564288, df=46)` Therefore, we cannot reject the null hypothesis, i.e. the crime rate is likely not increased in 10 years. ### c. > Divide the education time (Education) into high education time (>13), median education time (>11 and <=13), and low education time (<=11). Are the Crime Rate (CrimeRate) different among these education groups? Assume the population variances of Crime Rate are the same. $S:$ The set of education groups $\tau_i:$ is a parameter associated with the $i^{\text{th}}$ treatment called the $i^{\text{th}}$ treatment effect. $H_0:\forall\tau_i=0$ $H_1:\exists\tau_i\neq 0$ Using ANOVA test. ```python= import statistics as st import seaborn as sns import numpy as np import pandas as pd from scipy import stats data = pd.read_csv("Crime_R.csv") # seperate data arr1 = data.loc[data['Education'] > 13] data = data.loc[data['Education'] <= 13] arr2 = data.loc[data['Education'] > 11] arr3 = data.loc[data['Education'] <= 11] # perform test ans = stats.f_oneway( arr1['CrimeRate'], arr2['CrimeRate'], arr3['CrimeRate'] ) print(ans) ``` `F_onewayResult(statistic=0.41265295943293256, pvalue=0.6644267591971134)` The crime rates are likely not different in these education groups. ### d. > Is there a relationship between high youth unemployment (HighYouthUnemploy) and southern states (Southern)? $\hat p_1:$ High youth unemployment in southern states $\hat p_2:$ High youth unemployment in other states **Test 1** $H_0:\hat p_1=\hat p_2$ $H_1:\hat p_1 < \hat p_2$ **Test 2** $H_0:\hat p_1=\hat p_2$ $H_1:\hat p_1 > \hat p_2$ Using 2 sample binominal z-test. ```python= import statistics as st import seaborn as sns import numpy as np import pandas as pd from statsmodels.stats.proportion import proportions_ztest data = pd.read_csv("Crime_R.csv") # seperate data arr1 = data.loc[data['Southern'] == 1] arr2 = data.loc[data['Southern'] == 0] # calculate occurrence of 1s cnt1 = arr1['HighYouthUnemploy'].sum() cnt2 = arr2['HighYouthUnemploy'].sum() # perform test 1 ans1 = proportions_ztest( [cnt1, cnt2], [len(arr1), len(arr2)], alternative='smaller' ) # perform test 2 ans2 = proportions_ztest( [cnt1, cnt2], [len(arr1), len(arr2)], alternative='larger' ) print(ans1) print(ans2) ``` Running the code above will get this result. ``` (-2.7117196055429518, 0.003346759331043149) (-2.7117196055429518, 0.9966532406689569) ``` The High youth unemployment in southern states is likely higher than in other states. ## 2. > In the JHU CSSE COVID-19 Dataset > In the global dataset, is the Fatality Ratio (Case_Fatality_Ratio) on 2022/07/07 (file name 07-07-2022) lower than the Fatality Ratio on 2021/07/07? (file name 07-07-2021) $\mu_1:$ The mean of fatality ratio on 2021/07/07 $\mu_2:$ The mean of fatality ratio on 2022/07/07 $H_0:\mu_1=\mu_2$ $H_1:\mu_1>\mu_2$ Using 2 sample t-test. ```python= import statistics as st import seaborn as sns import numpy as np import pandas as pd from scipy import stats data1 = pd.read_csv("20210707.csv") data2 = pd.read_csv("20220707.csv") # perform test ans = stats.ttest_ind( data1['Case_Fatality_Ratio'].dropna(), data2['Case_Fatality_Ratio'].dropna(), equal_var=False, alternative='greater' ) print(ans) ``` `Ttest_indResult(statistic=0.2546296007662016, pvalue=0.3995078941229881)` Therefore, there's no evidence to say that the fatality ratio on 2022/07/07 is lower than on 2021/07/07.