--- title: "Causal Effect of Gender on Extra-Marital Affairs" author: "Aditya Agrawal, Ruikang Lan, Anisha Mula, Pahal Patangia, Michael Thompson, Yili Yu" date: "February 28, 2020" output: pdf_document header-includes: - \usepackage{setspace}\doublespacing fontsize: 12pt --- ## Introduction A local Law Firm is a Minneapolis regional accident liability litigator. The Law Firm was founded by Ruikang Lan in 2002 she found early success growing her firm to nearly 50 employees by 2008 when she named Aditya Agrawal as partner to the new small business law wing. The Law Firm is now nearly 120 employees but has remained so for the past five years. Mr. Agrawal and Ms. Lan are looking for ways to expand their firm. With a major firm in the local divorce settlement market having been debarred for gross misconduct in a discrimination case, Mr. Agrawal and Ms. Lan decided to promote Yili Yu to partner to lead their new divorce settlement division. As leader of the new division, Mr. Yu is looking to expand awareness of the firm’s new service to the greater Minneapolis audience. However, being a new wing of the firm there is a limited marketing budget allocated to marketing. Mr. Yu conducted some preliminary research to determine how to narrow the customer group to focus on. He contacted some colleagues in other states currently practicing in divorce settlements and asked where their clients are referred from. His colleagues said over 70% of clients choose their firm from radio commercials and word of mouth. Mr. Yu contacted the five top local radio stations getting details about pricing of advertisements and demographics of their audience. The cost of advertisements was competitive with one another. The demographics of each radio station primarily deferred from one another in gender and economic status. With most clientele being heterosexual couples and divorce occurring across economic status, Mr. Yu decided to continue research on the difference of gender in heterosexual divorce. According to the Institute for Family Studies, infidelity is the only common reason for divorce among the top three reasons from five separate studies (Stanley, 2017). To better define the demographics to focus the marketing budget on, Mr. Yu wants to know do husbands or wives tend to initiate those divorces first. According to the National Institute of Health the spouse who was cheated on is nearly 70-80% less likely to be the one who wanted the divorce more (England et al., 2014). Mr. Yu believes men are more likely to have affairs. Being those who cheated are more likely to initiate the divorce, he believes the Law Firm should disproportionately market to men and women. He presented his question to the the Law Firm data scientist who helped reframe into a causal question: How should the Law Firm focus their marketing budget to each gender? **Causal Question:** How should the Law Firm focus their marketing budget to each gender? ## Dataset / Measures The [dataset](https://pages.stern.nyu.edu/~wgreene/Text/Edition7/tablelist8new.htm) is obtained from NYU Stern Econometric Analysis, 7th and 8th Edition datasets. Originally, the data that is used to estimate the model was collected from two magazine surveys. The first survey was conducted in 1969 by Psychology Today (PT). A questionnaire on sex was published in the July 1969 issue of PT, and readers were asked to mail in their answers. Of the 20,000 surveys received, only 2000 were recorded. Of the 2000 recorded surveys, many more were discarded due to the survey being incomplete or a spouse having more than one divorce. The final dataset consists of 601 unique affair surveys. The questionnaires included questions about extramarital affairs, other aspects of sexual behavior, and demographic characteristics of the individual. The data is availble in R as a part of "AER" package. ```{r echo=FALSE, message=FALSE, warning=FALSE} library(AER) library(MatchIt) library(ggplot2) library(rbounds) library(patchwork) library(dplyr) data(Affairs) ``` ```{r echo=FALSE, message=FALSE, warning=FALSE} str(Affairs) ``` **Data Types Expalanation** * **Affair**: Target variable and categorical field of number of affairs within the past year. * **Age**: Continuous field of years since birth. * **Children**: Binary field of whether the couple has at least one child. * **Education**: Continuous field where numbers represent different levels of education. * **Gender**: Treatment variable and binary field of female or male. * **Occupation**: Continuous field where numbers represent different types of occupation such as blue collar vs professional degree. * **Rating**: Continuous field from 1 to 5 of the surveyee's reported rating of the marriage. * **Religiousness**: Continuous field from 1 to 5 of the surveyee's reported rating of their religousness. * **Years of Marriage**: Continuous field of the number of years of the marriage. The current structure of the data is not fit for our analysis. The following are some of the modifications needed prior to analysis: * **Affairs**: The field has to become binary from its categorical coding. We are interested in the effects of gender on having at least one extra-marital affair. * **Education & Occupation**: These field have to become categorical because it reflects the level of education. Occupation has to become categorical because its numerical coding is according to Hollingshead classification (reverse numbering) of different segments of occupation by economic status. * **Religiousness & Ratings** These fields should become categorical because it reflects the subjective scale of religiousness or marriage quality of a person. * **Children & Gender** These fields should become binary or factored to be used within the matching algorithm. ```{r echo=FALSE, message=FALSE, warning=FALSE} Affairs$X <- NULL Affairs$children <- as.factor(Affairs$children) Affairs$religiousness <- factor(Affairs$religiousness, order=TRUE, levels=c(1,2,3,4,5)) Affairs$education <- factor(Affairs$education, order=TRUE, levels=c(9,12,14,16,17,18,20)) Affairs$occupation <- as.factor(Affairs$occupation) Affairs$gender <- ifelse(Affairs$gender=='male',1,0) Affairs$rating <- factor(Affairs$rating, order=TRUE, levels=c(1,2,3,4,5)) Affairs$affairs <- ifelse(Affairs$affairs>0, 1,0) ``` Lets check the new structure now ```{r echo=FALSE, message=FALSE, warning=FALSE} str(Affairs) ``` We want to investigate if being a male results in higher tendency to have extra-marital affairs? For our analysis: X (treatment) -> Gender - Male or Female Y (outcome) -> Log odds of having an affair being a male to female The other variables in the dataset would be a part of control group. We shall analyze the distributions of different variables to understand their characterstics, and also check for potential outliers ``` {r echo=FALSE, message=FALSE, warning=FALSE} #Gender Affairs %>% group_by(gender) %>% summarise(count=n())%>% ggplot(aes(x=gender,y=count)) + geom_bar(stat = "identity",col="red",fill="green", alpha = .2, width = 0.3)+ ggtitle("Distribution of Gender")+ xlab("Gender")+ylab("Count") ``` ``` {r echo=FALSE, message=FALSE, warning=FALSE} #Ratings ggplot(data=Affairs,aes(x=rating)) + geom_histogram(col="red",fill="green", alpha = .2, binwidth = 0.5)+ ggtitle("Distribution of Ratings")+ xlab("Ratings") + ylab("Count") ``` ``` {r echo=FALSE, message=FALSE, warning=FALSE} #Ratings and Gender ggplot(data=Affairs,aes(x=rating,fill=gender)) + geom_histogram(binwidth = 0.5,position = 'dodge')+ ggtitle("Distribution of Ratings across gender")+ xlab("Ratings")+ylab("Count") ``` ``` {r echo=FALSE, message=FALSE, warning=FALSE} #years married ggplot(data=Affairs, aes(x=yearsmarried)) + geom_density(alpha=0.4,fill='skyblue') + ggtitle("Distribution of years married")+ xlab("Years married")+ ylab("Density") ``` ``` {r echo=FALSE, message=FALSE, warning=FALSE} #Years married and gender ggplot(data=Affairs,aes(x=yearsmarried)) + geom_histogram(binwidth = 2,col="red", fill="green", alpha = .2)+ ggtitle("Distribution of Years married across gender")+ xlab("Years married")+ ylab("Count") + facet_grid(.~gender) ``` ``` {r echo=FALSE, message=FALSE, warning=FALSE} #Age ggplot(data=Affairs,aes(x=age)) + geom_histogram(breaks = seq(0, 70, 5),col="red", fill="green", alpha = .2) + scale_x_continuous(breaks = seq(0, 70, 5),lim = c(10, 70))+ ggtitle("Distribution of Age")+ xlab("Age")+ ylab("Count") ``` ``` {r echo=FALSE, message=FALSE, warning=FALSE} #Age and gender ggplot(data=Affairs, aes(x=age, fill=gender)) + geom_density(alpha=0.4) + ggtitle("Distribution of age across gender")+ xlab("Age")+ylab("Density") ``` ``` {r echo=FALSE, message=FALSE, warning=FALSE} #Religiousness ggplot(data=Affairs,aes(x=religiousness)) + geom_histogram(col="red", fill="green", alpha = .2, binwidth=0.5)+ ggtitle("Distribution of religiousness score")+ xlab("Religiousness score")+ ylab("Count") ``` ``` {r echo=FALSE, message=FALSE, warning=FALSE} #Religiousness and gender ggplot(data=Affairs,aes(x=religiousness)) + geom_histogram(binwidth = 0.5,col="red", fill="green",alpha = .2)+ ggtitle("Distribution of religiousness across gender")+ xlab("Religiousness score")+ ylab("Count") + facet_grid(.~gender) ``` ``` {r echo=FALSE, message=FALSE, warning=FALSE} #Children and gender Affairs%>%group_by(gender,children)%>%summarise(count=n())%>% ggplot(aes(x=children,y=count)) + geom_bar(stat='identity',width = 0.5,col="red", fill="green", alpha = .2)+ ggtitle("Distribution of children across gender")+ xlab("Number of children")+ ylab("Count")+ facet_grid(.~gender) ``` ``` {r echo=FALSE, message=FALSE, warning=FALSE} #Education and gender ggplot(data=Affairs, aes(x=education, fill=gender)) + geom_density(alpha=0.4) + ggtitle("Distribution of education across gender")+ xlab("Education") + ylab("Density") ``` ``` {r echo=FALSE, message=FALSE, warning=FALSE} #Affairs and gender Affairs1<-Affairs Affairs1$affair1<-if_else(Affairs1$affairs>0,1,0) Affairs1%>%group_by(gender,affair1)%>%summarise(count=n())%>% group_by(gender)%>%mutate(prop=count/sum(count))%>% ggplot(aes(x=affair1,y=prop,fill=gender)) + geom_bar(stat='identity',position='dodge')+ scale_x_continuous(breaks=seq(0,1))+ ggtitle("Distribution of affairs across gender")+ xlab("If affair then 1 otherwise 0")+ ylab("Proportion") ``` ## Methods Before running the regression, we also need to verify our assumptions. First, we test the randimization test and see if the independent variables are correlated to our treatment effect, gender. #### Exhibit 1: T-Test Results for Independent variables & Treatment Effect ```{r echo=FALSE, message=FALSE, warning=FALSE} cat('t.test(yearsmarried ~ gender, data = Affairs), pvalue:', t.test(yearsmarried ~ gender, data = Affairs)[3][[1]], '') cat('\n') cat('t.test(as.numeric(children) ~ gender, data = Affairs), pvalue:', t.test(as.numeric(children) ~ gender, data = Affairs)[3][[1]], '') cat('\n') cat('t.test(as.numeric(religiousness) ~ gender, data = Affairs), pvalue:', t.test(as.numeric(religiousness) ~ gender, data = Affairs)[3][[1]],'') cat('\n') cat('t.test(as.numeric(education) ~ gender, data = Affairs), pvalue:', t.test(as.numeric(education) ~ gender, data = Affairs)[3][[1]], '') cat('\n') cat('t.test(as.numeric(occupation) ~ gender, data = Affairs), pvalue:', t.test(as.numeric(occupation) ~ gender, data = Affairs)[3][[1]], '') cat('\n') cat('t.test(as.numeric(rating) ~ gender, data = Affairs), pvalue:', t.test(as.numeric(rating) ~ gender, data = Affairs)[3][[1]], '') cat('\n') cat('t.test(age ~ gender, data = Affairs), pvalue:', t.test(age ~ gender, data = Affairs)[3][[1]], '') ``` We noticed that age, education and occupation are significantly correlated with gender, which means these factors are not fully randomized in our treatment groups. We need to match similar data points for every variables within trteatment groups to get rid of these correlated impact that will block the true influence of gender and the probability of having an affair. Before matching, we can view propensity score of the probabilty being male or female base on other variables to see the non-randomization pattern in our dataset. #### Exhibit 2: Propensity Score Before Matching ```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=8, fig.height=5} PScore = glm(gender ~ age + yearsmarried + children + education + religiousness+ occupation + rating, data = Affairs, family = "binomial")$fitted.values Affairs$PScore = PScore ggplot(Affairs, aes(x = PScore, color = factor(gender))) + geom_density() + labs(title='Ditribution before matching') + scale_color_discrete(name = "Gender", labels=c("Female", "Male")) ``` We tried different matching methods and examined their results. First we used exact matching to match all the data points with ratio equals to 1 based on all the other variables except gender and affairs, and we got a result with a comparably small data size (Exhibit 3). Exhibit 4 also shows the distribution of propensity score after exact matching. #### Exhibit 3: Summary Results From Exact Matching ```{r echo=FALSE, message=FALSE, warning=FALSE} match_output_exact <- matchit(gender ~ age + yearsmarried + children + education + religiousness+ occupation + rating, data = Affairs, method = 'exact', distance = "logit", replace = FALSE, ratio = 1) summary(match_output_exact)[2] data_match_exact = match.data(match_output_exact) ``` #### Exhibit 4: Propensity Score Distribution After Exact Matching ```{r echo=FALSE, message=FALSE, warning=FALSE} ggplot(data_match_exact, aes(x = PScore, color = factor(gender))) + geom_density() + labs(title='Distribution After Exact Matching') + scale_color_discrete(name = "Gender", labels=c("Female", "Male")) ``` From the Exhibit 4, it was pretty easy to conclude that the probabilty distribution of being males or females based on all the other dependent variables within the treatment and control groups do not have similar patterns, which proved that exact matching is not an ideal solution for us. We tried the second method which was propensity score matching(PSM). The algorithim will try to find the nearset pairs of data points in these two groups considering all the other factors. We got a dataset with more matched data points at 210 (Exhibit 5) and the distributions of the propensity score of these two groups (Exhibit 6) are equally matched. #### Exhibit 5: Summary Results From PSM Matching ```{r echo=FALSE, message=FALSE, warning=FALSE} match_output_psm <- matchit(gender ~ age + yearsmarried + children + education + religiousness+ occupation + rating, data = Affairs, method = 'nearest', distance = "logit", caliper = 0.01, replace = FALSE, ratio = 1) summary(match_output_psm)[2] data_match_psm = match.data(match_output_psm) ``` #### Exhibit 6: Propensity Score Distribution After PSM Matching ```{r echo=FALSE, message=FALSE, warning=FALSE,fig.width=8, fig.height=4} ggplot(data_match_psm, aes(x = PScore, color = factor(gender))) + geom_density() + labs(title='Distribution After PSM Matching') + scale_color_discrete(name = "Gender", labels=c("Female", "Male")) ``` We also examined the results from the third matching method, CEM matching. By bucketing the continuous variables into discrete values first, we got a different matching results. The sample size (Exhibit 7) we got was about 26 and the distribution plot (Exhibit 8) is not as perfect as the one from PSM matching (Exhibit 6). #### Exhibit 7: Summary Results From CEM Matching ```{r echo=FALSE, message=FALSE, warning=FALSE} match_output_cem <- matchit(gender ~ age + yearsmarried + children + education + religiousness+ occupation + rating, data = Affairs, method = 'cem', distance = "logit", ratio = 1) summary(match_output_cem)[2] data_match_cem = match.data(match_output_cem) ``` #### Exhibit 8: Propensity Score Distribution After CEM Matching ```{r echo=FALSE, message=FALSE, warning=FALSE,fig.width=8, fig.height=4} ggplot(data_match_cem, aes(x = PScore, color = factor(gender))) + geom_density() + scale_fill_discrete(name='Gender',labels=c("female","male"))+ labs(title='Ditribution After CEM Matching') ``` Hence, we concluded from the distribution plot and the sample size that the PSM matching will be a more appropriate solution in our case. We performed the Rosenbaum sensitivity test to inspect the robustness of our output results. From Exhibit 9, we could state that the matching results seem to be robust, having p-value increased by a small value as Gamma increased. #### Exhibit 9: Summary Results From Rosenbaum Sensitivity Test ```{r echo=FALSE, message=FALSE, warning=FALSE} psens((data_match_psm%>%filter(gender==1))$affairs, (data_match_psm%>%filter(gender==0))$affairs, Gamma = 1.5, GammaInc = .1) ``` Also, we applied a t-test to the variables after PSM matching against treatment effect and there were no sinificant relationships (Exhibit 10). We are confident the treatment variable gender is not correlated with any variable in our regression. We will proceed in using the PSM matching results in the remainder of the analysis. #### Exhibit 10: T-Test Results for Independent variables & Treatment ```{r echo=FALSE, message=FALSE, warning=FALSE} cat('t.test(yearsmarried ~ gender, data = data_match_all), pvalue:', t.test(yearsmarried ~ gender, data = data_match_all)[3][[1]], '') cat('\n') cat('t.test(as.numeric(children) ~ gender, data = data_match_all), pvalue:', t.test(as.numeric(children) ~ gender, data = data_match_all)[3][[1]], '') cat('\n') cat('t.test(as.numeric(religiousness) ~ gender, data = data_match_all), pvalue:', t.test(as.numeric(religiousness) ~ gender, data = data_match_all)[3][[1]],'') cat('\n') cat('t.test(as.numeric(education) ~ gender, data = data_match_alls), pvalue:', t.test(as.numeric(education) ~ gender, data = data_match_all)[3][[1]], '') cat('\n') cat('t.test(as.numeric(occupation) ~ gender, data = data_match_all), pvalue:', t.test(as.numeric(occupation) ~ gender, data = data_match_all)[3][[1]], '') cat('\n') cat('t.test(as.numeric(rating) ~ gender, data = data_match_all), pvalue:', t.test(as.numeric(rating) ~ gender, data = data_match_all)[3][[1]], '') cat('\n') cat('t.test(age ~ gender, data = data_match_all), pvalue:', t.test(age ~ gender, data = data_match_all)[3][[1]], '') ``` We then fitted a logistic regression with independent variable as affairs and treatment effect as gender to discover if there exist a relationship between gender and affairs. Exhibit 11 showed the results. #### Exhibit 12: Summary Results From Logit Regression ```{r echo=FALSE, message=FALSE, warning=FALSE} model <- glm(data=data_match_all, affairs ~ gender) summary(model) ``` ## Results The p-value for gender indicates that there is a 45.1% probability of observing as extreme as this results given gender has no impact on having affairs. Based on the defualt confidence level we choose, there is no evidence for us to reject Null Hypothesis: that gender has no causal effect on having affairs. In addition, p-value for the whole simple linear relationship is also very high, demonstrating that we need to add more complexity to the model. ## Threats to Causal Inference Selection Bias 1. This survey was sent only to the readers of the Psychology Today magazine which in turn may not be representative of the overall population. 2. The readers who have fully completed the survey are included in this data. This may lead to losing out information on other readers who may better help generalize to the true population. 3. This data is also limited to readers who have at max one divorce. Response Bias 1. Having an affair is not socially acceptable. People may be adversive of giving a correct answer as they fear social ramifications of an affair being known publicly. This provocates the participants to hide the truth while answering and thus, impeding the ability of the observations correctness. 2. Readers willingly participated. Different types of divorce and affair situations may not have evenly participated. 3. There will be different perceptions of infidelity between men and women which may lead to a bias how affairs and divorce are reported between genders. Measurement Error 1. Information provided in the data such as religiousness, relationship, rating and occupation are subjective to interpretation for different people possibly discrediting our PSM matching approach to ensure randomization. 2. There is a possibility of error while recording the survey information. 3. Its possible negative cases of affair in the sample are actually positive and were at the time unknown to the submiter of the survey. Presence of Omitted Variable Bias 1. The effect of who initiated the affair is confounded in our analysis. There could be a difference in the proportion of men and women taking the first step to having an extra marital affair. 2. We do not have information on where, what, and how the spouse was lead into infidelity. 3. Genetics could also be a factor in such behavior by the sample. If the parents of the cheating spouse also had affairs or divorce, it could have an effect on the spouse's propensity to have an affair or divorce too. ## Limitations 1. We performed a power t-test to check the delta we could infer from our 210 survey sample size. The results (Exhibit 12) showed that, from our sample we could only detect at minimum 27.4% difference in affair behavior between genders. This may be too large to notice a more subtle different between men and women on affairs given a 95% confidence level. 2. The surveys used were highly likely not random. We need to survey people at random and collect more data regarding their economic situation, demographics, parental history of divorce, and any other possible omitted variables to better represent the causal difference if present. #### Exhibit 12: Summary Results From Power T-Test ```{r} power.t.test(n=210,type=c("two.sample"), power=0.8, sig.level=0.05) ``` ## Conclusions & Recommendation In conclusion, we recommend Law firm focus on a few changes for their marketing plan. First, we recommend to market to male and female clients evenly for now. Second, we recommend looking into the difference of affairs rates with different demographics such as economic status and age. Third, we suggest the data scientist team conduct a randomized survey with a much larger dataset in order to conclude if there are smaller differences in the effects of gender on affair then were capable with this data. ## References England, P., Allison, P. D., & Sayer, L. C. (2014). When one spouse has an affair, who is more likely to leave?. Demographic research, 30, 535–546. https://doi.org/10.4054/DemRes.2014.30.18 Stanley, S. (2017, April 10). Reasons People Give for Divorce. Retrieved from https://ifstudies.org/blog/reasons-people-give-for-divorce