owned this note
owned this note
Published
Linked with GitHub
---
title: "MLR_lab_CJS"
author: "Simon Sharp, Jon Ekberg, Chia Shen Tsai"
output: html_document
date: "2022-11-13"
---
## Introduction
The central question Spotwood et al. investigated is the existence of an association between COVID-19 cases, nature accessibility, and other sociodemographic characteristics. They attempted to utilize ZIP-Code-scale data to examine whether the negative association between COVID- 19 case rates and greenness shown with county-level data in the United States still holds at the finer geospatial level.
The variables are all in the scale of ZIP Code Tabulation Areas (ZCTAs). The primary response variable is the number of COVID cases per 100,000 people between March and September 2020 (fetched on 1 October 2020). The explanatory variables include greenness, such as the Normalized Difference Vegetation Index (NDVI) and park access, and sociodemographic features, such as the proportion of White people and people of colour (POC), age, income, and population density. Notably, Spotswood et al. excluded rural areas from their analysis while we kept the rural data in our model. Furthermore, the "Urban" variable is the only binary variable in our dataset.
We hypothesized that more green space and higher income would lead to lower Covid-19 case rates. In order to test this hypothesis, we used Covid-19 case rates as our response variable, and selected median income, proportion of white residents, NDVI, percent parks, median age, and aridity, as well as the dummy variable urban, as our explanatory variables. We selected Arizona, Florida, and Maine as the states on which to conduct our analysis.
<!--
About ZCTAs:
https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html
-->
## Summary statistics
We chose Florida (n=912), Arizona (n=353), and Maine (n=376) as the analysis target. The distributions of variables across different State varied (See Table1). Overall, Florida has the most COVID-19 cases rates (mean = 3064.34, median = 2611.34), Arizona the second (mean = 2645.94, median = 1968.03), and Maine the least (mean = 238.479, median = 149.115). The greenness also differ across states (mean = 0.32(AZ), 0.58(FL), 0.78(ME)), but the standard deviations (AZ 0.09, Fl 0.10, ME 0.04) show that the difference among each state is relatively small.
In terms of population structure, Maine has more White Majority communities, Florida the second, and Arizona the least. In Arizona, the distribution of COVID-19 cases per 100,000 people is highly positively skewed, having a mean of 2645.94 and a standard deviation of 5618.20. For Maine, the distribution of population density is the variable that skewed the most significantly (skewness = 12.558) while all the population density values across the dataset is close to zero in the rough scale.
## Model interpretation and limitations
### Florida Model
For Florida, we created our model using the original data with the dependent variable (Covid19) logged. Our group used 8 explanatory variables to describe our dependent variable; we used the variable to describe whether a zip code is urban or rural, we used median income, proportion of white population, vegetation index, percent park, median age, aridity index, and population density. We logged two of the variables: Covid19 (our dependent variable) and popDens (one of our explanatory variables). The results of the model show that the White population proportion is strongly with covid 19 cases in Florida (See Figure 2). The results of the model also show that the vegetation index (NDVI) and aridity is moderatly associated with covid-19 cases in Florida(See Figure 2). For every 1% decrease in proportion white population there is a 1.35 percent increase in the rate of covid 19 cases in the state of Florida holding all other variables constant. For every 1 unit decrease in the vegetation index there is a 1.25 unit increase in the rate of covid 19 cases in the state of Florida holding all other variables constant. And for every 1% decrease in aridity there is a 1.02 percent increase in the rate of covid 19 cases in the state of Florida holding all other variables constant.
The adjusted R^2^ value for our Flroida model is 0.2124, which is lower than the other two models we created for Florida, but we decided to use the logged data for our final model since this model has the lowest value in the AIC table (a value of 2,220.13 versus 16,430.77 for the original and scaled data) (See Table 5). Therefore we decided that this model with two of the variables logged is the model that fits our data best. This model has a F-statistic of 30.79 on 8 and 876 degrees of freedom with a very low p-value of 2.2e-16 (See Figure 2). Since this p-value is less than 0.05 there is at least one explanatory variable in our model that is statistically significant with our response variable.
### Arizona Model
For Arizona, we used the original data and logged the aridity and the population density variables to establish the linear regression model. One observation was removed due to the unusual occurance of 0 in White population proportion, which is highly possible a mistake from the original data. We selected the model because the adjusted R^2^ (0.458) is the highest compared to models established with scaled data (See Table 7) and the AIC (544.39) is the lowest (see Table 8).
The results show that the White population proportion, the aridity and the population density are strongly associated with the number of COVID-19 cases in Arizona (See Table 7). Every 1% decrease in the White population proportion is associated with a 1.34% change in the rate of confirmed cases, holding all other variables constant. Furthermore, every 1% decrease in aridity also asscoiated with a 35% increase in the amounts, and every 1% increase in the population density is moderately associated with a 7.26% increase in COVID-19 cases, holding all other variables constant. The percentage of park area is mildly positively associated with the change in COVID-19 case rates in ZIP code areas. Every 1% increase in the percentage is associated with 2.09% increase in COVID-19 case rates, holding all other variables constant. The p-value of F-statitics is remarkably small (F-statistic=35.47, df=8 and 319, p-value < 0.001), indicating there is at least one explanatory variable associated with the change of the response variable.
### Maine Model
For Maine, we used the original data with the response variable logged (Covid-19 case rate), and two explanatory variables logged (NDVI and Population Density). The results of the model show that whether a zip code is urban or rural, the median income, and NDVI, median age, and the population density all have statistically significant impacts on Covid-19 case rates (see Table 14). An area being urban as opposed to rural was associated with an 81% increase in Covid-19 case rates, holding all other variables constant. An increase of Median Income of $1 is associated with a 0.002% increase in Covid-19 case rates, holding all other variables equal. An increase in NDVI of 1% was associated with a decrease of 2.7% in Covid-19 case rates, holding all other variables equal. An increase of one year in median age is associated with a 2.09% increase in Covid-19 case rates, holding all other variables constant. Finally, an increase of 1% in population density is associated with a decrease of 0.13% in Covid-19 case rates.
The adjusted R^2^ value for the model was 0.197, which was the highest among the three models I constructed (original data, scaled data, logged data). Additionally, when comparing the models using AIC, the logged model has the lowest value (see Table 13). Therefore, we determined that this model best fit the data. Based on the F-statistic (F-statistic=8.826, df=8 and 248, p-value<0.001) we are able to reject the null hypothesis that all slope coefficients are equal to zero.
### Modelling Limitations
The OLS model we used and the negative binomial mixed effect model Spotwood et al. used are potentially confronted with the spatial autocorrelation and endogeneity issues. Spotwood et al. attempted to fix the spatial autocorrelation issue with simultaneous autoregressive models (SAR). For potential endogeneity problem which may omit important variables, they incorporated the Instrument Variable (IV) regression using the two-stage least squares method.
## Assumptions
For the Florida model, the results of the Breusch-Pagan test (BP=23.114, df=8, p-value=0.003221) indicate that we cannot reject the null hypothesis that homoscedasticity is present in the model (See Table 3). This result can be confirmed with the residual vs fitted plot for Florida which shows an expected mean close to zero(See Figure 3). The results of the VIF test indicate that there is no multicollinearity between the explanatory variables (VIF scores all lower than 5) (See Table 4).
For the Arizona model, given that there is no number larger than 5, the result of VIF test indicates there is no multicollinearity problem (see Table 9). In terms of heteroskedasticity, although the residual v.s. fitted plot (See Figure 7) shows the average of the expected value on each fitted point is close to zero, the Breush-Pagen test manifests a strong possibilty that the variables are heteroskedastic (BP=35.24 ,df=8, p-value<0.001, Table 10). To address the violation of assumption, we transformed the standard error with heteroskedasticity-consistent standard error estimators (Hayes & Cai 2007), the results are shown in Table 6.
For the Maine model, the results of the Breusch-Pagan test (BP=10.594, df=8, p-value=0.2258) indicate that we cannot reject the null hypothesis that homoscedasticity is present in the model (see Table 11). This is consistent with the residuals vs. fitted graph for Maine, which shows an expected mean relatively close to zero (see Figure 10). The results of the VIF test indicate that there is no multicollinearity between the explanatory variables (VIF scores all lower than 5) (see Table 12).
## Conclusions
The results of our models indicated different impacts of explanatory variables across the states we examined. For example, while NDVI had a statistically significant negative association with COVID-19 in both Florida and Arizona, it did not have a statistically significant impact in Arizona. This could be due to the fact that Arizona is a dryer state with less natural greenery. Aridity also had a different impact of COVID-19 case rates based on the state. In Florida, aridity had a statistically significant positive correlation with COVID-19 case rates in Florida, while it had a statistically significant negative association with case rates in Arizona. Median age also showed different results across different states, with a positive association with COVID-19 case rates in Maine, but a negative association in Florida.
Overall, neither variable we were interested in (access to greenery and higher income) showed to have consistent results across all states examined. This suggests that the pattern we were hoping to see does not exist as clearly at such a fine spatial level as zip codes. This does not show the same negative correlation between income, green space, and percent POC that Spotswood et. al. found in their analysis.
## Appendix
**Table 1. Summary Table for Explanatory Variables**

### Florida Model Figures
**Figure 1.**

**Florida Log-transformed Variables Scatterplot Matrix**
**Figure 2.**

Florida Log-transformed Model
**Figure 3.**

**Florida Log-transformed Residual vs Fitted**
**Figure 4.**

**Florida Log-transformed Cook's Distance Plot**
**Figure 5.**

**Florida Log-transformed Breusch Pagan Test**
**Table 2. Florida Log-transformed VIF Test**

**Table 3. Breusch Pagan Table**

**Table 4. VIF Table of Florida Model**

**Table 5. Florida AIC Comparison**

**Table 6. Florida Model Comparison**

### Arizona Model Figures
**Figure 6.**

**Arizona's Scatterplot Matrix**
**Figure 7.**

**Residual v Fitted Plot of Arizona Model without Outlier**
**Figure 8.**

**Cook's Distance Plot of Arizona Model without Outlier**
**Table 7. Arizona Model Results Comparison**

**Table 8. Arizona AIC Comparison**

**Table 9. Arizona VIF test**

**Table 10. Breusch-Pagen Test of Arizona Model**

### Maine Model Figures Figures
**Figure 9.**

Maine Scatter Matrix
**Figure 10.**

Maine Residual vs. Fitted
**Table 11. Maine BP Test**

**Table 12. Maine VIF Test**

**Table 13. Maine AIC Comparison**

**Table 14. Maine Model Comparison**

## Reference
Hayes, Andrew F., and Li Cai. "Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation." Behavior research methods 39.4 (2007): 709-722.
<!--
Aricle Link: https://link.springer.com/content/pdf/10.3758/BF03192961.pdf
-->
### Division of labor
We each ran all tests and created all graphs and tables for one state.
(1) Florida: Jon
(2) Maine: Simon
(3) Arizona: Chia Shen
We all worked together on the writing of the report, the stitching together of the code, and the final troubleshooting and knitting.
<!--
# Code
## Communal Code
### Common Environment
```{r libraries, warning= FALSE, message = FALSE}
# Require Packages
library(GGally)
library(devtools)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(gridExtra)
library(gt)
library(knitr)
library(moments)
library(tidyverse)
library(raster)
library(janitor)
library(sp)
library(rgdal)
library(spdep)
library(car)
library(showtext)
library(olsrr)
library(IDPmisc)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(lmtest)
library(sandwich)
rm(list = ls())
```
### data preparation: extract
```{r read and scale, warning= FALSE, message = FALSE}
states<-readOGR("Covid_Nature_Data_FINAL.shp")
states$NDVI_scaled <- as.vector(scale(states$NDVI))
states$PctPark_scaled <- as.vector(scale(states$PctPark))
states$popDens_scaled <- as.vector(scale(states$popDens))
states$medAge_scaled <- as.vector(scale(states$medAge))
states$Days_scaled <- as.vector(scale(states$Days))
states$propWht_scaled <- as.vector(scale(states$propWht))
states$propPOC_scaled <- as.vector(scale(states$propPOC))
states$medIncome_scaled <- as.vector(scale(states$medInc))
states$BUI_scaled <- as.vector(scale(as.numeric(states$BUI)))
states$AridIndex_scaled <- as.vector(scale(states$Arid))
```
```{r echo = FALSE, warning= FALSE, message = FALSE}
states.df<-states@data
head(states.df)
```
### Summary Table
```{r summary table, warning= FALSE, message = FALSE}
summary_states.df<-states.df%>%
dplyr::filter(STATE=="Maine"| STATE=="Florida" | STATE=="Arizona")
summary_states.df <- summary_states.df[,c(1,2,7,9,10,13,16,17,18,19,20)]
```
``` {r tranfsorm to long 2, warning= FALSE, message = FALSE}
summary_states.long <- summary_states.df%>%
tidyr::pivot_longer(cols=-c(STATE, County), names_to="variable",values_to="value")
head(summary_states.long)
```
``` {r summary table 2}
summary.table<-summary_states.long%>%
group_by(STATE, variable)%>%
dplyr::summarize(
obs=length(value[!is.na(value)]),
max=round(max(value, na.rm=TRUE),3),
min=round(min(value, na.rm=TRUE),3),
mean=round(mean(value, na.rm=TRUE),3),
median=round(median(value, na.rm=TRUE),3),
sd=round(sd(value, na.rm=TRUE),3),
skew=round(skewness(value, na.rm = TRUE),3)
)
head(summary.table,20)
```
```{r summary table gt}
summary.table %>%
gt() %>%
tab_header(
title = md("Summary Statistics of COVID-19 Case Rates and All Explanatory Variables in Arizona, Florida, and Maine")) %>%
fmt_passthrough (columns=c(`STATE`)) %>%
fmt_number(columns = c(obs), decimals = 0) %>%
fmt_number(columns = c(max), decimals = 2) %>%
fmt_number(columns = c(min), decimals = 2) %>%
fmt_number(columns = c(mean), decimals = 2) %>%
fmt_number(columns = c(median), decimals = 2) %>%
fmt_number(columns = c(sd), decimals = 2) %>%
fmt_number(columns = c(skew), decimals = 2)%>%
cols_label(
STATE="State",
obs = "Observations",
max = "Maximum",
min = "Minimum",
mean = "Mean",
median = "Median",
sd = "SD",
skew = "Skewness")
```
### data preparation: state filter
``` {r maine, warning= FALSE, message = FALSE}
maine.df <- filter(states.df, states.df$STATE == "Maine")
head(maine.df)
```
``` {r maine log transformations, warning= FALSE, message = FALSE}
maine.df.log<-maine.df%>%
mutate(Covid19.log=log(Covid19),
NDVI.log=log(NDVI),
popDens.log=log(popDens)
)
maine.df.log
```
``` {r maine remove INF, warning= FALSE, message = FALSE}
maine.df.log$Covid19.log[is.infinite(maine.df.log$Covid19.log)] <- NA
```
## Arizona Model (Chia Shen)
```{r arizona, warning= FALSE, message = FALSE}
Arizona.df <- filter(states.df, states.df$STATE == "Arizona")
head(Arizona.df)
AZ.df <- dplyr::select(Arizona.df, County, STATE, Urban, medInc, propWht, NDVI, PctPark, medAge, Arid, popDens, Covid19)
head(AZ.df)
AZ.s.df <- dplyr::select(Arizona.df, County, STATE, Urban, medIncome_scaled, propWht_scaled, NDVI_scaled, PctPark_scaled, medAge_scaled, AridIndex_scaled, popDens_scaled, Covid19)
head(AZ.s.df)
AZ.df[c(201,224),]
AZ.df[AZ.df$log.Covid19 < 0.001,]
```
### matrix
```{r matrix, warning= FALSE, message = FALSE}
AZ.m.df <- dplyr::select(AZ.df, c(3,4,5,6,7,8,9,10,11))
ggpairs(AZ.m.df, columns = 1:9, title = "Original Variables Correlation Matrix", axisLabels = "show")
AZ.m.s.df <- dplyr::select(AZ.s.df, c(3,4,5,6,7,8,9,10,11))
ggpairs(AZ.m.s.df, columns = 1:9, title = "Scaled Variables Correlation Matrix", axisLabels = "show")
```
### model 1.1: original, log-transform some variables, remove NA
```{r log-transformed model, warning= FALSE, message = FALSE}
AZ.df$log.Arid <- log(AZ.df$Arid)
AZ.df$log.popDens <- log(AZ.df$popDens)
AZ.df$log.Covid19 <- log(AZ.df$Covid19)
AZ.m.df <- dplyr::select(AZ.df, c(3,4,5,6,7,8,12,13,14))
ggpairs(AZ.m.df, columns = 1:9, title = "Log-transformed Variables Correlation Matrix", axisLabels = "show")
AZ.df[which(is.infinite(AZ.m.df$log.Covid19)),]
AZ.m.df$log.Covid19[is.infinite(AZ.m.df$log.Covid19)] <- NA #There are actually Covid cases in these ZCTA, but somehow they are 0 in the data
ggpairs(AZ.m.df, columns = 1:9, title = "Log-transformed Variables Correlation Matrix", axisLabels = "show")
```
```{r Logged orginal model, warning= FALSE, message = FALSE}
f1 <- log.Covid19 ~ Urban + medInc + propWht + NDVI + PctPark + medAge + log.Arid + log.popDens
m1 <- lm(f1, data=AZ.m.df, na.action = na.exclude) #Exclude the data that has NA in it
summary(m1)
```
```{r M1.1 Assumption test, warning= FALSE, message = FALSE}
vif(m1)
plot(m1,1)
plot(m1,4) #cook's distance
ols_plot_cooksd_bar(m1)
bptest(m1) #breusch-pagen test result reflects heteroskedasticiy problem
coeftest(m1, vcov = vcovHC(m1, type = "HC1"))
```
### model 1.2: original, log-transform some variables, remove NA, outliers
```{r inspect and exclude outlier in log data, warning= FALSE, message = FALSE}
AZ.df[c(201,224),] #Yuma actually have much more White population, not 0 as the data indicates
AZ.m.df.nol <- AZ.m.df[-c(224),]
m1.nol <- lm(f1, data=AZ.m.df.nol)
summary(m1.nol)
AZ.vif <- vif(m1.nol)
plot(m1.nol,1)
ols_plot_cooksd_bar(m1.nol) #cook's distance
bptest(m1.nol)
AZ.bp <- data.frame(
bp = as.numeric(c(35.24)),
df = as.numeric(c(8)),
"p-value" = as.numeric(c(0.00002))
)
AZ.bp %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
everything() ~ px(40)) %>%
tab_header(
title = md("BP Test Result of Arizona Model"))
AZ.vifp <- as.data.frame(t(as.data.frame(AZ.vif)))
AZ.vifp %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
everything() ~ px(80)) %>%
tab_header(
title = md("VIF of Arizona Model"))%>%
fmt_number(columns = c(Urban), decimals = 2) %>%
fmt_number(columns = c(medInc), decimals = 2) %>%
fmt_number(columns = c(propWht), decimals = 2) %>%
fmt_number(columns = c(NDVI), decimals = 2) %>%
fmt_number(columns = c(PctPark), decimals = 2)%>%
fmt_number(columns = c(medAge), decimals = 2) %>%
fmt_number(columns = c(log.Arid), decimals = 2) %>%
fmt_number(columns = c(log.popDens), decimals = 2)
```
```{r robust standard error for m1.nol, warning= FALSE, message = FALSE}
coeftest(m1.nol, vcov = vcovHC(m1.nol, type = "HC1")) #source: https://www.r-econometrics.com/methods/hcrobusterrors/
```
### model 2.1: scaled
```{r scaled model, warning= FALSE, message = FALSE}
f2 <- Covid19 ~ Urban + medIncome_scaled + propWht_scaled +
NDVI_scaled + PctPark_scaled + medAge_scaled +
AridIndex_scaled + popDens_scaled
m2 <- lm(f2, data=AZ.m.s.df)
summary(m2)
```
```{r Assumption test for scaled model AZ, warning= FALSE, message = FALSE}
bptest(m2)
vif(m2)
plot(m2,1)
plot(m2,4) #cook's distance
AZ.df[224,] #Yuma's propWht is 0, which is not reasonable
coeftest(m2, vcov = vcovHC(m1.nol, type = "HC1"))
```
### model 2.2: scaled, remove outlier
```{r remove outlier, warning= FALSE, message = FALSE}
AZ.m.s.df.nol <- AZ.m.s.df[-c(224),]
m2.nol <- lm(f2, data=AZ.m.s.df.nol)
summary(m2.nol)
bptest(m2.nol)
vif(m2.nol)
plot(m2.nol,1)
ols_plot_cooksd_bar(m2.nol)
```
```{r robust standard error for m2.nol, warning= FALSE, message = FALSE}
coeftest(m2.nol, vcov = vcovHC(m2.nol, type = "HC1"))
```
```{r AIC comparison, warning= FALSE, message = FALSE}
AIC_AZ <- AIC(m1.nol,m2,m2.nol)
mr <- c("Logged Model Without Outlier","Scaled Model","Scaled Model Without Outlier")
AIC_AZ <- cbind(AIC_AZ,mr)
AIC_AZ = AIC_AZ[,c(3,1,2)]
AIC_AZ
```
```{r model comparison table, warning= FALSE, message = FALSE}
AIC_AZ %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
everything() ~ px(80)) %>%
tab_header(
title = md("AIC for Arizona Models")) %>%
fmt_number(columns = c(df), decimals = 0) %>%
fmt_number(columns = c(AIC), decimals = 2)%>%
cols_label(
mr = "Model")
tab_model(m1.nol,m2,m2.nol,
dv.labels = c("Logged Data without Outlier","Scaled Data", "Scaled Data without Outlier"),
vcov.fun = "HC1",
show.se = TRUE,
CSS = list(css.table = '+font-family: Arial;')) #source:https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html
#change the font: https://stackoverflow.com/questions/73103366/how-to-change-font-style-to-arial-for-tab-model-output
```
## Jon Ekberg's Florida Model
### Scaled Variables
```{r scaling, warning= FALSE, message = FALSE}
# Scale all variables
states$NDVI_scaled <- as.vector(scale(states$NDVI))
states$PctPark_scaled <- as.vector(scale(states$PctPark))
states$popDens_scaled <- as.vector(scale(states$popDens))
states$medAge_scaled <- as.vector(scale(states$medAge))
states$Days_scaled <- as.vector(scale(states$Days))
states$propWht_scaled <- as.vector(scale(states$propWht))
states$propPOC_scaled <- as.vector(scale(states$propPOC))
states$medIncome_scaled <- as.vector(scale(states$medInc))
states$BUI_scaled <- as.vector(scale(as.numeric(states$BUI)))
states$AridIndex_scaled <- as.vector(scale(states$Arid))
```
### Loading Dataframes
```{r, warning= FALSE, message = FALSE}
table(states.df$STATE)
```
```{r, warning= FALSE, message = FALSE}
states.df<-states@data
head(states.df)
```
```{r, warning= FALSE, message = FALSE}
Florida.df <- dplyr::filter(states.df, states.df$STATE == "Florida")
head(Florida.df)
```
```{r, warning= FALSE, message = FALSE}
FL.df <- dplyr::select(Florida.df, County, STATE, Urban, medInc, propWht, NDVI, PctPark, medAge, Arid, popDens, Covid19)
head(FL.df)
```
```{r, warning= FALSE, message = FALSE}
FL.m.df <- dplyr::select(Florida.df, Urban, medInc, propWht, NDVI, PctPark, medAge, Arid, popDens, Covid19)
```
```{r, warning= FALSE, message = FALSE}
FL.s.df <- dplyr::select(Florida.df, County, STATE, Urban, medIncome_scaled, propWht_scaled, NDVI_scaled, PctPark_scaled, medAge_scaled, AridIndex_scaled, popDens_scaled, Covid19)
head(FL.s.df)
```
```{r, warning= FALSE, message = FALSE}
FL.m.s.df <- dplyr::select(Florida.df, Urban, medIncome_scaled, propWht_scaled, NDVI_scaled, PctPark_scaled, medAge_scaled, AridIndex_scaled, popDens_scaled, Covid19)
head(FL.m.s.df)
```
library(GGally)
### Florida Scatterplot Matrix 1 (Original Variables)
```{r scatter matrix 1, warning= FALSE, message = FALSE}
FL.Matrix.Original<-c("Urban","medInc","propWht","NDVI","PctPark","medAge","Arid","popDens","Covid19")
ggpairs(FL.m.df,columns=FL.Matrix.Original,title="Florida Original Variables Correlation Matrix",axisLabels="show")
```

### Florida Scatterplot Matrix 2 (Scaled Variables)
```{r scatter matrix 2, warning= FALSE, message = FALSE}
FL.Matrix.Scaled <-c("Urban","medIncome_scaled","propWht_scaled","NDVI_scaled","PctPark_scaled","medAge_scaled","AridIndex_scaled","popDens_scaled","Covid19")
ggpairs(FL.m.s.df,columns=FL.Matrix.Scaled,title="Florida Scaled Variables Correlation Matrix",axisLabels="show")
```

### Florida Scatterplot Matrix 3 (Logged Variables)
```{r scatter matrix 3, warning= FALSE, message = FALSE}
FL.df$log.popDens <- log(FL.df$popDens)
FL.df$log.Covid19 <- log(FL.df$Covid19)
FL.Matrix.Logged <-c("Urban","medInc","propWht","NDVI","PctPark","medAge","Arid","log.popDens","log.Covid19")
ggpairs(FL.df,columns=FL.Matrix.Logged,title="Florida Logged Variables Correlation Matrix",axisLabels="show")
FL.df$log.popDens[is.infinite(FL.df$log.popDens)] <-0
FL.df$log.Covid19[is.infinite(FL.df$log.Covid19)] <- NA
ggpairs(FL.df, columns = FL.Matrix.Logged, title = "Florida Log-transformed Variables Correlation Matrix", axisLabels = "show")
```

### Florida Model 1 (Original Variables)
```{r florida model 1, warning= FALSE, message = FALSE}
FL.Model.Original<- lm(Covid19 ~ Urban+medInc+propWht+NDVI+PctPark+medAge+Arid+popDens, data=FL.df)
summary(FL.Model.Original)
```

```{r florida model 1 residual vs fitted, warning= FALSE, message = FALSE}
plot(FL.Model.Original,1)
```

```{r florida model 1 cooks distance, warning= FALSE, message = FALSE}
plot(FL.Model.Original,4)
```

```{r florida model 1 VIF test, warning= FALSE, message = FALSE}
vif(FL.Model.Original)
```

```{r florida model 1 BP test, warning= FALSE, message = FALSE}
bptest(FL.Model.Original)
```

### Florida Model 2 (Scaled Variables)
```{r florida model 2 scaled, warning= FALSE, message = FALSE}
FL.Model.Scaled<- lm(Covid19 ~ Urban+medIncome_scaled+propWht_scaled+NDVI_scaled+PctPark_scaled+medAge_scaled+AridIndex_scaled+popDens_scaled, data=FL.m.s.df)
summary(FL.Model.Scaled)
```

```{r florida model 2 residual vs fitted, warning= FALSE, message = FALSE}
plot(FL.Model.Scaled,1)
```

```{r florida model 2 cooks distance, warning= FALSE, message = FALSE}
plot(FL.Model.Scaled,4)
```

```{r florida model 2 VIF test, warning= FALSE, message = FALSE}
vif(FL.Model.Scaled)
```

```{r florida model 2 BP test, warning= FALSE, message = FALSE}
bptest(FL.Model.Scaled)
```

### Florida Model 3 (Logged Some Original Variables)
```{r florida model logged, warning= FALSE, message = FALSE}
Florida.model.logged<- lm(log.Covid19 ~ Urban+medInc+propWht+NDVI+PctPark+medAge+Arid+log.popDens, data=FL.df, na.action = na.exclude)
summary(Florida.model.logged)
```

```{r florida model 3 residual vs fitted, warning= FALSE, message = FALSE}
plot(Florida.model.logged,1)
```

```{r florida model 3 cooks distance, warning= FALSE, message = FALSE}
plot(Florida.model.logged,4)
```

```{r florida model 3 VIF test, warning= FALSE, message = FALSE}
FL.vif <- vif(Florida.model.logged)
FL.vif
```

```{r florida model 3 BP test, warning= FALSE, message = FALSE}
bptest(Florida.model.logged)
```

```{r florida combined bp vif, warning= FALSE, message = FALSE}
FL.bp <- data.frame(
bp = c(19.307),
df = c(7),
pvalue = c(0.007277)
)
FL.bp %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
everything() ~ px(40)) %>%
tab_header(
title = md("BP Test Result of Florida Model")
)
FL.vifp <- as.data.frame(t(as.data.frame(FL.vif)))
FL.vifp %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
everything() ~ px(80)) %>%
tab_header(
title = md("VIF of Florida Model"))%>%
fmt_number(columns = c(medInc), decimals = 2) %>%
fmt_number(columns = c(propWht), decimals = 2) %>%
fmt_number(columns = c(NDVI), decimals = 2) %>%
fmt_number(columns = c(PctPark), decimals = 2)%>%
fmt_number(columns = c(medAge), decimals = 2) %>%
fmt_number(columns = c(Arid), decimals = 2) %>%
fmt_number(columns = c(log.popDens), decimals = 2)
```


### Florida AIC Comparison
```{r florida AIC, warning= FALSE, message = FALSE}
AIC_FL <- AIC(FL.Model.Original,FL.Model.Scaled,Florida.model.logged)
AIC_FL$model <- c("Original Data", "Scaled Data", "Log Transformed Data")
AIC_FL <- relocate(AIC_FL,"model",)
AIC_FL
AIC_FL %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
everything() ~ px(80)) %>%
tab_header(
title = md("AIC for Florida Models")) %>%
fmt_number(columns = c(df), decimals = 0) %>%
fmt_number(columns = c(AIC), decimals = 2)%>%
cols_label(
model = "Model")
tab_model(FL.Model.Original,FL.Model.Scaled,Florida.model.logged, vcov.fun = "HC1", dv.labels=c("Original Data","Scaled Data","Log Transformed Data"),show.se = TRUE, CSS = list(css.table = '+font-family: Arial;')) #source:https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html
#change the font: https://stackoverflow.com/questions/73103366/how-to-change-font-style-to-arial-for-tab-model-output
```


## Maine Model (Simon)
### Scatter Matrixes
### Original Data
``` {r scatter matrix original maine, warning= FALSE, message = FALSE}
columns.select.original<-c("Urban","medInc","propWht","NDVI","PctPark","medAge","Arid","popDens","Covid19")
ggpairs(maine.df,columns=columns.select.original,title="",axisLabels="show")
```

### Scaled Data
``` {r scatter matrix scaled maine, warning= FALSE, message = FALSE}
columns.select.s<-c("Urban","medIncome_scaled","propWht_scaled","NDVI_scaled","PctPark_scaled","medAge_scaled","AridIndex_scaled","popDens_scaled","Covid19")
ggpairs(maine.df,columns=columns.select.s,title="",axisLabels="show")
```

### Logged Data
``` {r remove INF, warning= FALSE, message = FALSE}
maine.df.log$Covid19.log[is.infinite(maine.df.log$Covid19.log)] <- NA
```
``` {r scatter matrix logged maine, warning= FALSE, message = FALSE}
columns.select.logged<-c("Urban","medInc","propWht","NDVI.log","PctPark","medAge","Arid","popDens.log","Covid19.log")
ggpairs(maine.df.log,columns=columns.select.logged,title="",axisLabels="show")
```

### model 1: original data
``` {r linear model original maine, warning= FALSE, message = FALSE}
model.original.ME<- lm(Covid19 ~ Urban+medInc+propWht+NDVI+PctPark+medAge+Arid+popDens, data=maine.df)
summary(model.original.ME)
```

``` {r residual vs fitted original maine, warning= FALSE, message = FALSE}
plot(model.original.ME,1)
plot(model.original.ME,4)
```


### model 2: scaled data
``` {r linear model scaled maine, warning= FALSE, message = FALSE}
model.scaled.ME<- lm(Covid19 ~ Urban+medIncome_scaled+propWht_scaled+NDVI_scaled+PctPark_scaled+medAge_scaled+AridIndex_scaled+popDens_scaled, data=maine.df)
summary(model.scaled.ME)
```

``` {r residual vs fitted scaled maine, warning= FALSE, message = FALSE}
plot(model.scaled.ME,1)
plot(model.scaled.ME,4)
```


### model 3: log-transformed data, excl. NA
``` {r linear model log transformed maine, warning= FALSE, message = FALSE}
model.logged.ME<- lm(Covid19.log ~ Urban+medInc+propWht+NDVI.log+PctPark+medAge+Arid+popDens.log, data=maine.df.log, na.action = na.exclude)
summary(model.logged.ME)
```

``` {r residuals vs fitted logged, warning= FALSE, message = FALSE}
plot(model.logged.ME,1)
plot(model.logged.ME,4)
```


``` {r bp test logged maine, warning= FALSE, message = FALSE}
bptest(model.logged.ME)
```
``` {r bp test table, warning= FALSE, message = FALSE}
bp_ME <- data.frame(
BP = c(10.59),
df = c(8),
"p-value" = c(0.2258)
)
bp_ME %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
everything() ~ px(80)) %>%
tab_header(
title = md("Breusch-Pagan Test for Maine Model"))
```
``` {r vif test logged maine, warning= FALSE, message = FALSE}
vif(model.logged.ME)
```
``` {r vif test table, warning= FALSE, message = FALSE}
vif_ME <- data.frame(
Urban = c(1.65),
medInc = c(1.46),
propWht = c(1.33),
NDVI.log = c(1.92),
PctPark = c(1.12),
medAge = c(1.38),
Arid = c(1.60),
popDens.log = c(2.74)
)
vif_ME %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
everything() ~ px(100)) %>%
tab_header(
title = md("VIF Test for Maine Model"))
```
### Maine AIC Comparison
```{r maine AIC comparison, warning= FALSE, message = FALSE}
AIC_ME <- AIC(model.original.ME,model.scaled.ME,model.logged.ME)
AIC_ME$model <- c("Original Data", "Scaled Data", "Log Transformed Data")
AIC_ME <- relocate(AIC_ME,"model",)
AIC_ME
```
```{r maine AIC comparison table, warning= FALSE, message = FALSE}
AIC_ME %>%
gt() %>%
tab_options(
table.width = pct(80)) %>%
cols_width(
everything() ~ px(80)) %>%
tab_header(
title = md("AIC for Maine Models")) %>%
fmt_number(columns = c(df), decimals = 0) %>%
fmt_number(columns = c(AIC), decimals = 2)%>%
cols_label(
model = "Model")
tab_model(model.original.ME,model.scaled.ME,model.logged.ME, dv.labels = c("Original Data", "Scaled Data","Logged Data"), show.se = TRUE, CSS = list(css.table = '+font-family: Arial;'))
```
-->