---
tags: homework
---
# Statistics Final Assignment -- Personal
## 1.
> The Crime Rate data set gives a variety of variables by US state at two time points 10 years apart. Rows: Each row represents an individual state in the United States of America. Columns: Each column measures an individual variable thought to affect crime rate, these are measured at the start of the study and then 10 years later. The later observations are recorded in columns suffixed with “10”.
### a.
> Is the Crime Rate (CrimeRate) in the southern states (Southern=1) higher than the other states (Southern=0)?
$\mu_1:$ The mean of Crime Rate in southern states
$\mu_2:$ The mean of Crime Rate in other states
Using 2 sample t-test for unknown, unequal variance.
$H_0:\mu_1=\mu_2$
$H_1:\mu_1>\mu_2$
```python=
import statistics as st
import seaborn as sns
import numpy as np
import pandas as pd
from scipy import stats
data = pd.read_csv("Crime_R.csv")
# obtain data
arr1 = data.loc[data['Southern'] == 1]
arr2 = data.loc[data['Southern'] == 0]
# perform test
ans = stats.ttest_ind(
arr1['CrimeRate'],
arr2['CrimeRate'],
equal_var=False,
alternative='greater'
)
print(ans)
```
Running the code above will get this result.
`Ttest_indResult(statistic=-0.4005489620244038, pvalue=0.6545694106774924)`
Therefore, we cannot reject the null hypothesis, i.e. the crime rate in southern states are likely not higher than other states.
### b.
> Have crime rates increased in 10 years (CrimeRate vs. CrimeRate10)?
$\mu_1:$ The mean of Crime Rate
$\mu_2:$ The mean of Crime Rate 10 years later
Using paired t-test.
$H_0: \mu_1=\mu_2$
$H_1: \mu_1<\mu_2$
```python=
import statistics as st
import seaborn as sns
import numpy as np
import pandas as pd
from scipy import stats
data = pd.read_csv("Crime_R.csv")
# perform test
ans = stats.ttest_rel(
data['CrimeRate'],
data['CrimeRate10'],
alternative='less'
)
print(ans)
```
Running the code above will get this result.
`TtestResult(statistic=0.47082462711212, pvalue=0.6800044630564288, df=46)`
Therefore, we cannot reject the null hypothesis, i.e. the crime rate is likely not increased in 10 years.
### c.
> Divide the education time (Education) into high education time (>13), median education time (>11 and <=13), and low education time (<=11). Are the Crime Rate (CrimeRate) different among these education groups? Assume the population variances of Crime Rate are the same.
$S:$ The set of education groups
$\tau_i:$ is a parameter associated with the $i^{\text{th}}$ treatment called the $i^{\text{th}}$ treatment effect.
$H_0:\forall\tau_i=0$
$H_1:\exists\tau_i\neq 0$
Using ANOVA test.
```python=
import statistics as st
import seaborn as sns
import numpy as np
import pandas as pd
from scipy import stats
data = pd.read_csv("Crime_R.csv")
# seperate data
arr1 = data.loc[data['Education'] > 13]
data = data.loc[data['Education'] <= 13]
arr2 = data.loc[data['Education'] > 11]
arr3 = data.loc[data['Education'] <= 11]
# perform test
ans = stats.f_oneway(
arr1['CrimeRate'],
arr2['CrimeRate'],
arr3['CrimeRate']
)
print(ans)
```
`F_onewayResult(statistic=0.41265295943293256, pvalue=0.6644267591971134)`
The crime rates are likely not different in these education groups.
### d.
> Is there a relationship between high youth unemployment (HighYouthUnemploy) and southern states (Southern)?
$\hat p_1:$ High youth unemployment in southern states
$\hat p_2:$ High youth unemployment in other states
**Test 1**
$H_0:\hat p_1=\hat p_2$
$H_1:\hat p_1 < \hat p_2$
**Test 2**
$H_0:\hat p_1=\hat p_2$
$H_1:\hat p_1 > \hat p_2$
Using 2 sample binominal z-test.
```python=
import statistics as st
import seaborn as sns
import numpy as np
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
data = pd.read_csv("Crime_R.csv")
# seperate data
arr1 = data.loc[data['Southern'] == 1]
arr2 = data.loc[data['Southern'] == 0]
# calculate occurrence of 1s
cnt1 = arr1['HighYouthUnemploy'].sum()
cnt2 = arr2['HighYouthUnemploy'].sum()
# perform test 1
ans1 = proportions_ztest(
[cnt1, cnt2],
[len(arr1), len(arr2)],
alternative='smaller'
)
# perform test 2
ans2 = proportions_ztest(
[cnt1, cnt2],
[len(arr1), len(arr2)],
alternative='larger'
)
print(ans1)
print(ans2)
```
Running the code above will get this result.
```
(-2.7117196055429518, 0.003346759331043149)
(-2.7117196055429518, 0.9966532406689569)
```
The High youth unemployment in southern states is likely higher than in other states.
## 2.
> In the JHU CSSE COVID-19 Dataset
> In the global dataset, is the Fatality Ratio (Case_Fatality_Ratio) on 2022/07/07 (file name 07-07-2022) lower than the Fatality Ratio on 2021/07/07? (file name 07-07-2021)
$\mu_1:$ The mean of fatality ratio on 2021/07/07
$\mu_2:$ The mean of fatality ratio on 2022/07/07
$H_0:\mu_1=\mu_2$
$H_1:\mu_1>\mu_2$
Using 2 sample t-test.
```python=
import statistics as st
import seaborn as sns
import numpy as np
import pandas as pd
from scipy import stats
data1 = pd.read_csv("20210707.csv")
data2 = pd.read_csv("20220707.csv")
# perform test
ans = stats.ttest_ind(
data1['Case_Fatality_Ratio'].dropna(),
data2['Case_Fatality_Ratio'].dropna(),
equal_var=False,
alternative='greater'
)
print(ans)
```
`Ttest_indResult(statistic=0.2546296007662016, pvalue=0.3995078941229881)`
Therefore, there's no evidence to say that the fatality ratio on 2022/07/07 is lower than on 2021/07/07.