# Group 3 Progress Notes: Our reproducibility plan
The paper we are planning to reproduce is:
- [Behavioral Immune Trade-Offs: Interpersonal Value Relaxes Social Pathogen Avoidance](https://journals.sagepub.com/doi/full/10.1177/0956797620960011)
- [OSF](https://osf.io/4agk8/)
## Plan
There are 3 studies reported in this paper. For each study, the goal is to reproduce the …
1. demographic descriptives (reported in Participants) e.g.% of males, M and SD age, r value
2. figures (violin plot (with box plot inside it) of comfort contact by target type, and scatter plot of contact comfort by WTR + linear regression lines)
3. replicate all means and standard deviations mentioned in texts
### Study 1

### Study 2

### Study 3

## Figures


# Reproduction Doc
[Reproduction Doc](https://hackmd.io/@caitie11122/rJkfT3CYc)
## 24.06.22
Dataset: 0 = prioritise themselves (self), 1 = prioritise themselves (other)
- We standardised this so that the numbers are consistent
- We are trying to count the number of switch points
Working backwards
- Replicating the descriptive statistics, graphs (which have no code)
- Our current goal is to replicate: '*Mean contact comfort was 0.06 (SD = 1.97).*'
- Paper was written ironically such that this line was mentioned prior to the mention of the 'exclusion criteria'
Note: When we looked into the WTR task and dataset, we tried to understand what the numbers actually meant. Since it is a ratio we tried to divide the number to see if it replicates onto the given dataset, we noticed it worked for some of the numbers but other numbers don't align. This could possibly mean they didn't note it down accurately or perhaps some numbers were changed.
## 28.06.22
Currently struggling on what 'Caulsum' variable means
- Calculating the comfort contact mean: have tested the code that can calculate equivalent mean but getting different output to what is reported
- Possible reasongs: either what they reported is wrong or we are not using the same input they have used
- Noticed there is 'Caulsum' and 'Caulperson' however this is not located in the codebook or in the paper that indicates what it means
Creating a matrix using all 9 points
- 'Matrix': why are there two commas?
- matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimmanes = null) is what comes up when we look at the matrix output
- matrix is a 2-dimensional array that has m number of rows and n number of columns (google)
- end goal is to have a separate dataset: original dataset - exclusion critera dataset (this then will be used to calculate comfort contact mean as well as the boxplot graphs)
**Suggestion**: mutate a number of new columns, and calulate the difference between each column so that 0 = no change, non-zero (1 or -1 and code this as 1) = change and count how many non-zero numbers there are.
- Reproduce and clearly explain the code even if it is long and inefficient
Count function (total of non-zeros)
## Today's progress
Restarting the code of creating the exclusion criteria dataset
**Step 1**: Mutating 60 new columns calculating difference between each column
- Using Danielle's Week 3 module Video 7, we are able to make a difference column by using the mutate function. 'diff_1 <- study_1_WTF %>% mutate(diff = c(1) - c(2) %>% arrange(diff)'
- originally did it with the label of the column 'diff_1 <- study_1_WTF %>% mutate(diff = c(X37_.13) - c(X37_.6) %>% arrange(diff)' which gave the output successfully!
**Step 2**: To avoid mutating this for every column individually,
- for loops for columns and datasets:
- we want to generate 60 new columns, but can use transmute - changes existing column, not just creating new columns but also deleting old columns
Problem we found: There should be a 0 = self, 1 = other. In the codebook, it says '-0.25 = self and -0.45 = other' but for some reason, in column X37_13 which turns it vice versa which is based on the original code. We find this is inconsistent with the other columns which makes outputting subsequent data is confusing.
INstalled new package: matrixStats because it can sum the difference across all rows and columns
1. turned WTR data into a matrix and did row difference function
2. turned back into dataframe (not matrix)
3. changed all -1 to 1
4. by doing that they had number of switch points and tried to count it for each groups and used rowsums to do this - this worked
5. now trying to exclude >2
Throughout the week we all worked on the code individually:
The Data Exclusion Criteria includes:
1. Participants with more than 2 switch points (>2) within any of the 6 WTR anchors (n=35)
- TO CLARIFY: A switch point is defined as when they go from self to target (and NOT target to self) so from 0 to 1, making the difference = 1)
- this only works if you have implemented code that makes the columns run forwards:
- study_1_WTR<-study_1 %>%
select(40:49,50:59,60:69,70:79,80:89,90:99)
- the way the original code works, the difference has to be -1, but it is still when they switch from self to target.
- Remove repeated cases (e.g. if a participant had >2 switchpoints in both set 2 and 4, that would count as 1 exclusion)
- Only need to consider non-zero values
3. 3 participants whose descriptions of partners were nonsensical or demonstrated poor English
4. 2 participants who selected gender option indicating they were neither man nor a woman
Therefore total excluded participants = 35 + 3 + 2 = 40
This left remaining 464 participants
However, we suspect there may be overlap in the exclusion criteria because:
- when excluding based on WTR = 472 participants
- When excluding based on sex and language (criteria 2 and 3) = 499 participants
- Whhen excluding all criteria = 470 participants
*Helen's check of Study 1 Exclusion Data*
- Participant 178 and 503 are excluded as their gender option is neither man nor woman (3 = other)
- Participant 55, 421, 473 are excluded due to English ability
- To meet criteria 1, in excel I found the difference between the columns and only focused on 1 numbers, so I changed all -1 to 0
- Note I did the difference between column 2-1, 3-2,..., 10-9
- In this excel, I have changed the columns that have -0.25 and =-0.45 to 1 and 0 even though it is self (0) and other (1) because that is how they put it in the data
- If this doesn't work then I will change it back to 0 and 1 to match what the codebook says
Sophie: In the case that if we include going from 0 to non-zero in the first column, there are 6 additional exclusions (this is because at the beginning if it a nonzero number, they technically did not choose themselves first before choosing the target)
## 01/07/22
- Column (self to other) 2-1, 3-2, 4-3 etc (positive 1)
- Follow their data
1. Either what they put in their codebook is the same as what they did
2. Or they messed up and what they put in their codebook is not right
Currently:
We looked at only '1' where we have 483 participants even though we put in the exclusion criteria instead of 464.
- 'Mutate(switchpoint1 = rowSums(thing[,1:9]))
- Create a new column called 'switchpoint n' and adds up how many columns that has number 1
- Filter
- cuts out an rwo were count number is >2
- We are adding a column of participant number
- select just the columns related to WTR ratios
- study_1_WTR <- %>%
- select(49:40, 59:50, 69:60, 79:70, 89:80, 99:90)
- mutate(id = row_number())
Sophie's Excel
Excluded: 16, 24, 34, 55 [English people = double up with swtichpoint, they didn't write this in the paper], 76, 117 [marked as unsure as it starts with a 'other' (nonzero)], 124, 126, 147, 150, 178 (sex, but not a doubleup), 183 (on our code but not on excel raw data, code counting 0 to 0.25), 193, 194, 199, 205 (on our code but not on excel raw data: -0.25 cases), 241, 281, 295, 305 (on our code but not on excel raw data), 320, 358, 361 (on excel but not on our code, involves -0.45, haven't excluded), 363, 376, 397, 411, 415, 421 [English people = double up with swtichpoint, they didn't write this in the paper], 424, 429, 448, 473 [English people = double up with swtichpoint, they didn't write this in the paper], 494, 503 (Sex, but not a doubleup)
Code (n=33, just for switchpoint): 16, 24, 34, 55, 76, 117, 124, 126, 147, 150, 183, 193, 194, 199, 205, 241, 281, 295, 305, 320, 358, **361**, 363, 376, 397, 411, 415, 421, 424, 429, 448, 473, 494
In total, the excel removed
# finaldata variable
has 467 participants - excluded 3 ppl more than our code (470), and also 3 less than stated in the paper (464)
excluded no. 130, 181, 364 - was not excluded in our code [we don't know why they removed them, there is no logical reason why they removed, doesn't match any of their exclusion criteria data]
discrepancy with the 3 ppl is likely because they didn't realise the 3 overlaps between switchpoints and english exclusions
The mean contact comfort they reported was 0.06 even though, with the same dataset they used, we generated 0.067, which actually rounds up to 0.07.
## Study 2 Exclusion Criteria
> Data exclusion. We excluded 19 participants with
more than two switch points within any of the three WTR
anchors, seven participants whose descriptions of their
partners were nonsensical or demonstrated poor English,
and one participant who described their gender identity
as neither male nor female. The results reported below
are based on the remaining 403 participants. All outcomes of null-hypothesis significance testing (i.e., p <.05) remained when no exclusions were made.
Theoretically, n = 27 excluded, total participants = 430
Participants Excluded
1. Poor English Participants: 7, 55, 63, 93, 173, 209, 283
2. Gender other Participant: 232
3. Switchpoint > 2 Participants (self to other, 0 to 1)
58, 90, 114, 118, 149, 206, 209, 224, 260, 266, 316, 319, 332 (n=13)
total = 21
430-21 = 409
Code Exlusion:
- 7 (Eng)
- 55 (Eng)
- 58 (Switch)
- 61 (Not excluded on raw data)
- 62 (Eng)
- 64 (Not excluded on raw data)
- 77 (Not excluded on raw data)
- 90 (Switch)
- 93 (Eng)
- *114* (Not excluded on code)
- 118 (Switch)
- 130 (Not excluded on raw data)
- 140 (Not excluded on raw data)
- 149 (Switch)
- 173 (Eng)
- 181 (Not excluded on raw data)
- 201 (Not excluded on raw data)
- *206* (Not excluded on code)
- 209 (Switch)
- *224* (Not excluded on code)
- 229 (Eng)
- 232 (Eng)
- 260 (Switch)
- 263 (Not excluded on raw data)
- 266 (Switch)
- 283 (Eng)
- 286 (Not excluded on raw data)
- 316 (Switch)
- 319 (Switch)
- 332 (Switch)
**Caitlyn attempt #1 at Study 2 exclusion:**
- Applied study 1 exclusion criteria code to study 2
- Tried to simplify creation of Study_2_WTR select function like this:
```
study_2_WTR <- study_2 %>%
select(55:84) # the relevant columns for WTR
```
- Aside from above, code the same as for Study 1 (except 3 less WTR anchors, thus 3 less switchpoint columns etc)
- This resulted in a dataset that was 413 people (10 more than stated in the paper)
- Compared with authors' final dataset by adding into original code a mutated column for participant number
- discrepancies between their final dataset and ours are listed below:
- 61, 64, 90, 118, 181, 201, 229, 260, 263, 266, 286
**Caitlyn attempt #2 at Study 2 exclusion:**
- As the only thing I had changed (except semantics) was that select function, I copied the same format as Study 1, so it then looked like this:
```
study_2_WTR <- study_2 %>%
select(64:55, 74:65, 84:75) # the three sections of columns per anchor
```
- Once I changed this and ran the code again, I got a dataset with 404 observations- just one more than the original authors
- So, I cross-checked the original code's final dataset with mine, and found the one outlier: participant 263. Like those 3 random people in study 2, I can see no logical reason to exclude this participant, so I manually removed them by participant number
- The dataset is now identical to that used by the original authors- exclusion criteria achieved :)
## Study 3 Exclusion Criteria
- Tried to run original script for Study 3- error with uninstalled packages
- installed missing packages
- Report says 827 in final dataset- FinalData dataset from original code matches this (unlike Study 1)
- Changed Study 2's exclusion criteria code very minimally
- adjusted select statement to have accurate columns
- changed English_exclude to English_Exclude in final filter statement due to inconsistency of variable naming
- put a '3' instead of a '2' in relevant places so the code pulls on study 3's data
- Result of the above code: dataset with 833 participants (vs 827 needed)
- Mutated new column in original script final dataset:
- part_num=raw$participant
- now able to compare participant numbers for original final dataset and the dataset generated by our code to find discrepancies in excluded participants (there should be 6)
- Manually compared original dataset to our dataset
- 7 were excluded from original but not ours
- 1 was excluded from ours but not original
- overall discrepancy of 6
- Some of these should be excluded where they aren't being by our code
- eg: participant 104 has 3 switchpoints in anchor 3, but isn't being excluded for some reason
- Changed the -1 to 1 in the mutation of switchpoints and now more confused as this produced a different set that was also mostly correct
- set discrepancies for -1:
- 104, 352, 387, 401, 518, 619, 775 (also removes 488, but original code does not)
- set discrepancies for 1:
- 50, 327, 619, 651, 767
- Items that should vs shouldn't be removed:
- shouldn't be removed:
- 619 (Similar to study 1 and 2 when they're randomly excluded)
- 488 (Original code doesn't)
- should be removed:
- 50, 104, 327, 352, 387, 401, 518, 651, 767, 775
- aka: all participants missed by -1 and 1 approach (except the one shared part- 619) should be excluded, but are not being excluded by one of these methods. Why tf?
### sophie's attempt at study 3 exclusions
- manually compared excluded participant numbers from datasets produced by original code and our code (both -1 and 1 versions) --> 3 datasets in total
- overall discrepancy of 2 participants
- 4 were excluded from original but not ours
- 2 were excluded from ours but not original
- discrepancies:
- -1 version of our code:
- 775, 779 (excluded from original)
- 488, 734 (not excluded from original)
- 1 version of our code:
- 50, 327 (excluded from original)
- number of excluded participants corresponds with remaining number of participants in each dataset, so fairly certain that no participants have been missed in manual count
## 5.7.22 Progress
- Been using discrepancies for -1
We are reproducing the boxplot graph figures using code
- study 1, 2 and 3 are successfully being reproduced
- For study 3, our two graphs are switched around (x axis points are reversed for no apparent version)
- found scale_x_reverse to switch them! (did not work)
- experimenting by putting the function in different lines
- it worked actoer adding x1 = factor (data_excluded_3$Value, c(1,0))
How to combine ggplot graphs together
- Overall violin plot
- currently trying to combine the ggplots together into one figure, sharing a y-axis
## 12/07/22 Progress
We got the scatterplots! Although we couldn't reproduce it with our own code we were able to use the original code
- the variable WTRTOTAL could not be found because the authors did not share it so we are unable to reproduce this
- but we were able to get the scatterplots in RStudio successfully and we played around with its aesthetics
Presentation Plan: 10 mins long
1. What was the point of your paper? What did they do and what did they find? Sophie
2. What was your reproducibility goal, what did you set out to do? Helen
3. How did you go?
- Demographics, means: everyone (a triumph, not really a challenge): Sophie
- Exclusion criteria: How we got realising this which came up when we were doing comfort contact - Caitlyn
- Violin plots: Hana
- Scatterplots: Helen
4. What did you learn from the process?
- About learning R: Sophie
- about computational reproducibility: Helen
- about working in a team: Hana
Time:
Part 1 & 2: 2 mins max
- Sophie: 1 min
- Helen: 1 min
Part 3: 5 mins
- Sophie: <1 min
- Caitlyn: 2-3 min
- Hana: <1 min
- Helen: <1 min
Part 4: 3 mins
- Sophie: 1 min
- Helen: 1 min
- Hana: 1 min
## study 1 WTR exclusions (sophie)
'ambiguous cases' = where the first column in the anchor had a non-zero value and this was counted as a switchpoint
first count: 39 excluded (including 6 ambiguous cases)
- realised that i counted some 0 -> -0.25 pairs (for both ambiguous/non-ambiguous cases) which was wrong since -0.25 = self, so second count was attempted to get more accurate count with this in mind
second count: 36 excluded (including 9 ambiguous cases)
- realised that there were some overlaps with English exclusions, so third count accounted for these overlaps
third count: 33 excluded (including 8 ambiguous cases)