MLR_lab_CJS - HackMD

--- title: "MLR_lab_CJS" author: "Simon Sharp, Jon Ekberg, Chia Shen Tsai" output: html_document date: "2022-11-13" --- ## Introduction The central question Spotwood et al. investigated is the existence of an association between COVID-19 cases, nature accessibility, and other sociodemographic characteristics. They attempted to utilize ZIP-Code-scale data to examine whether the negative association between COVID- 19 case rates and greenness shown with county-level data in the United States still holds at the finer geospatial level. The variables are all in the scale of ZIP Code Tabulation Areas (ZCTAs). The primary response variable is the number of COVID cases per 100,000 people between March and September 2020 (fetched on 1 October 2020). The explanatory variables include greenness, such as the Normalized Difference Vegetation Index (NDVI) and park access, and sociodemographic features, such as the proportion of White people and people of colour (POC), age, income, and population density. Notably, Spotswood et al. excluded rural areas from their analysis while we kept the rural data in our model. Furthermore, the "Urban" variable is the only binary variable in our dataset. We hypothesized that more green space and higher income would lead to lower Covid-19 case rates. In order to test this hypothesis, we used Covid-19 case rates as our response variable, and selected median income, proportion of white residents, NDVI, percent parks, median age, and aridity, as well as the dummy variable urban, as our explanatory variables. We selected Arizona, Florida, and Maine as the states on which to conduct our analysis.  ## Summary statistics We chose Florida (n=912), Arizona (n=353), and Maine (n=376) as the analysis target. The distributions of variables across different State varied (See Table1). Overall, Florida has the most COVID-19 cases rates (mean = 3064.34, median = 2611.34), Arizona the second (mean = 2645.94, median = 1968.03), and Maine the least (mean = 238.479, median = 149.115). The greenness also differ across states (mean = 0.32(AZ), 0.58(FL), 0.78(ME)), but the standard deviations (AZ 0.09, Fl 0.10, ME 0.04) show that the difference among each state is relatively small. In terms of population structure, Maine has more White Majority communities, Florida the second, and Arizona the least. In Arizona, the distribution of COVID-19 cases per 100,000 people is highly positively skewed, having a mean of 2645.94 and a standard deviation of 5618.20. For Maine, the distribution of population density is the variable that skewed the most significantly (skewness = 12.558) while all the population density values across the dataset is close to zero in the rough scale. ## Model interpretation and limitations ### Florida Model For Florida, we created our model using the original data with the dependent variable (Covid19) logged. Our group used 8 explanatory variables to describe our dependent variable; we used the variable to describe whether a zip code is urban or rural, we used median income, proportion of white population, vegetation index, percent park, median age, aridity index, and population density. We logged two of the variables: Covid19 (our dependent variable) and popDens (one of our explanatory variables). The results of the model show that the White population proportion is strongly with covid 19 cases in Florida (See Figure 2). The results of the model also show that the vegetation index (NDVI) and aridity is moderatly associated with covid-19 cases in Florida(See Figure 2). For every 1% decrease in proportion white population there is a 1.35 percent increase in the rate of covid 19 cases in the state of Florida holding all other variables constant. For every 1 unit decrease in the vegetation index there is a 1.25 unit increase in the rate of covid 19 cases in the state of Florida holding all other variables constant. And for every 1% decrease in aridity there is a 1.02 percent increase in the rate of covid 19 cases in the state of Florida holding all other variables constant. The adjusted R^2^ value for our Flroida model is 0.2124, which is lower than the other two models we created for Florida, but we decided to use the logged data for our final model since this model has the lowest value in the AIC table (a value of 2,220.13 versus 16,430.77 for the original and scaled data) (See Table 5). Therefore we decided that this model with two of the variables logged is the model that fits our data best. This model has a F-statistic of 30.79 on 8 and 876 degrees of freedom with a very low p-value of 2.2e-16 (See Figure 2). Since this p-value is less than 0.05 there is at least one explanatory variable in our model that is statistically significant with our response variable. ### Arizona Model For Arizona, we used the original data and logged the aridity and the population density variables to establish the linear regression model. One observation was removed due to the unusual occurance of 0 in White population proportion, which is highly possible a mistake from the original data. We selected the model because the adjusted R^2^ (0.458) is the highest compared to models established with scaled data (See Table 7) and the AIC (544.39) is the lowest (see Table 8). The results show that the White population proportion, the aridity and the population density are strongly associated with the number of COVID-19 cases in Arizona (See Table 7). Every 1% decrease in the White population proportion is associated with a 1.34% change in the rate of confirmed cases, holding all other variables constant. Furthermore, every 1% decrease in aridity also asscoiated with a 35% increase in the amounts, and every 1% increase in the population density is moderately associated with a 7.26% increase in COVID-19 cases, holding all other variables constant. The percentage of park area is mildly positively associated with the change in COVID-19 case rates in ZIP code areas. Every 1% increase in the percentage is associated with 2.09% increase in COVID-19 case rates, holding all other variables constant. The p-value of F-statitics is remarkably small (F-statistic=35.47, df=8 and 319, p-value < 0.001), indicating there is at least one explanatory variable associated with the change of the response variable. ### Maine Model For Maine, we used the original data with the response variable logged (Covid-19 case rate), and two explanatory variables logged (NDVI and Population Density). The results of the model show that whether a zip code is urban or rural, the median income, and NDVI, median age, and the population density all have statistically significant impacts on Covid-19 case rates (see Table 14). An area being urban as opposed to rural was associated with an 81% increase in Covid-19 case rates, holding all other variables constant. An increase of Median Income of $1 is associated with a 0.002% increase in Covid-19 case rates, holding all other variables equal. An increase in NDVI of 1% was associated with a decrease of 2.7% in Covid-19 case rates, holding all other variables equal. An increase of one year in median age is associated with a 2.09% increase in Covid-19 case rates, holding all other variables constant. Finally, an increase of 1% in population density is associated with a decrease of 0.13% in Covid-19 case rates. The adjusted R^2^ value for the model was 0.197, which was the highest among the three models I constructed (original data, scaled data, logged data). Additionally, when comparing the models using AIC, the logged model has the lowest value (see Table 13). Therefore, we determined that this model best fit the data. Based on the F-statistic (F-statistic=8.826, df=8 and 248, p-value<0.001) we are able to reject the null hypothesis that all slope coefficients are equal to zero. ### Modelling Limitations The OLS model we used and the negative binomial mixed effect model Spotwood et al. used are potentially confronted with the spatial autocorrelation and endogeneity issues. Spotwood et al. attempted to fix the spatial autocorrelation issue with simultaneous autoregressive models (SAR). For potential endogeneity problem which may omit important variables, they incorporated the Instrument Variable (IV) regression using the two-stage least squares method. ## Assumptions For the Florida model, the results of the Breusch-Pagan test (BP=23.114, df=8, p-value=0.003221) indicate that we cannot reject the null hypothesis that homoscedasticity is present in the model (See Table 3). This result can be confirmed with the residual vs fitted plot for Florida which shows an expected mean close to zero(See Figure 3). The results of the VIF test indicate that there is no multicollinearity between the explanatory variables (VIF scores all lower than 5) (See Table 4). For the Arizona model, given that there is no number larger than 5, the result of VIF test indicates there is no multicollinearity problem (see Table 9). In terms of heteroskedasticity, although the residual v.s. fitted plot (See Figure 7) shows the average of the expected value on each fitted point is close to zero, the Breush-Pagen test manifests a strong possibilty that the variables are heteroskedastic (BP=35.24 ,df=8, p-value<0.001, Table 10). To address the violation of assumption, we transformed the standard error with heteroskedasticity-consistent standard error estimators (Hayes & Cai 2007), the results are shown in Table 6. For the Maine model, the results of the Breusch-Pagan test (BP=10.594, df=8, p-value=0.2258) indicate that we cannot reject the null hypothesis that homoscedasticity is present in the model (see Table 11). This is consistent with the residuals vs. fitted graph for Maine, which shows an expected mean relatively close to zero (see Figure 10). The results of the VIF test indicate that there is no multicollinearity between the explanatory variables (VIF scores all lower than 5) (see Table 12). ## Conclusions The results of our models indicated different impacts of explanatory variables across the states we examined. For example, while NDVI had a statistically significant negative association with COVID-19 in both Florida and Arizona, it did not have a statistically significant impact in Arizona. This could be due to the fact that Arizona is a dryer state with less natural greenery. Aridity also had a different impact of COVID-19 case rates based on the state. In Florida, aridity had a statistically significant positive correlation with COVID-19 case rates in Florida, while it had a statistically significant negative association with case rates in Arizona. Median age also showed different results across different states, with a positive association with COVID-19 case rates in Maine, but a negative association in Florida. Overall, neither variable we were interested in (access to greenery and higher income) showed to have consistent results across all states examined. This suggests that the pattern we were hoping to see does not exist as clearly at such a fine spatial level as zip codes. This does not show the same negative correlation between income, green space, and percent POC that Spotswood et. al. found in their analysis. ## Appendix **Table 1. Summary Table for Explanatory Variables** ![](https://i.imgur.com/dYne9St.png) ### Florida Model Figures **Figure 1.** ![](https://i.imgur.com/RLcedlR.png) **Florida Log-transformed Variables Scatterplot Matrix** **Figure 2.** ![](https://i.imgur.com/U6l32ws.png) Florida Log-transformed Model **Figure 3.** ![](https://i.imgur.com/QaoJacX.png) **Florida Log-transformed Residual vs Fitted** **Figure 4.** ![](https://i.imgur.com/bTbUlfO.png) **Florida Log-transformed Cook's Distance Plot** **Figure 5.** ![](https://i.imgur.com/FWlC02d.png) **Florida Log-transformed Breusch Pagan Test** **Table 2. Florida Log-transformed VIF Test** ![](https://i.imgur.com/BrOEffH.png) **Table 3. Breusch Pagan Table** ![](https://i.imgur.com/CXf65kS.png) **Table 4. VIF Table of Florida Model** ![](https://i.imgur.com/kPNLYhg.png) **Table 5. Florida AIC Comparison** ![](https://i.imgur.com/ZG03f3v.png) **Table 6. Florida Model Comparison** ![](https://i.imgur.com/kgyoLBP.png) ### Arizona Model Figures **Figure 6.** ![](https://i.imgur.com/iLZ88uP.jpg) **Arizona's Scatterplot Matrix** **Figure 7.** ![](https://i.imgur.com/OIU05Oh.png) **Residual v Fitted Plot of Arizona Model without Outlier** **Figure 8.** ![](https://i.imgur.com/lFORAM0.png) **Cook's Distance Plot of Arizona Model without Outlier** **Table 7. Arizona Model Results Comparison** ![](https://i.imgur.com/P3RQaoh.png) **Table 8. Arizona AIC Comparison** ![](https://i.imgur.com/BdxPmj7.png) **Table 9. Arizona VIF test** ![](https://i.imgur.com/JpxO18Y.png) **Table 10. Breusch-Pagen Test of Arizona Model** ![](https://i.imgur.com/yuOpvcj.png) ### Maine Model Figures Figures **Figure 9.** ![](https://i.imgur.com/oDgxGqR.png) Maine Scatter Matrix **Figure 10.** ![](https://i.imgur.com/OL3hjTU.png) Maine Residual vs. Fitted **Table 11. Maine BP Test** ![](https://i.imgur.com/95378WK.png) **Table 12. Maine VIF Test** ![](https://i.imgur.com/ThRVhaa.png) **Table 13. Maine AIC Comparison** ![](https://i.imgur.com/m2Zh3nM.png) **Table 14. Maine Model Comparison** ![](https://i.imgur.com/SsMVn0y.png) ## Reference Hayes, Andrew F., and Li Cai. "Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation." Behavior research methods 39.4 (2007): 709-722.  ### Division of labor We each ran all tests and created all graphs and tables for one state. (1) Florida: Jon (2) Maine: Simon (3) Arizona: Chia Shen We all worked together on the writing of the report, the stitching together of the code, and the final troubleshooting and knitting.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.