```{r setup}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = '/work/CogSci_Methods01/Reading_exp.data,assign.2')
```
```{r libraries, message = FALSE}
library("tidyverse")
library("pastecs")
library("stringi")
library("stringr")
library("car")
library("reticulate")
```
### Python script for reading experiment (Amalie)
```{python, echo =TRUE, eval = FALSE}
from psychopy import visual, event, core, gui, data
import pandas as pd
import random
columns = ["TimeStamp", "Condition", "ID", "Age", "RT", "Word"]
logfile = pd.DataFrame(columns = columns)
date = data.getDateStr()
condition = random.choice(('A','B'))
print(condition)
stopwatch = core.Clock()
story_congruent = "Jane Austen’s Pride and Prejudice is an 18th-century novel of manners set in rural England and portraying the relationships between the four daughters of the Bennet family and their neighbours. While accurately and vividly depicting the manners and social norms of that time, the novel also provides sharp observations on the themes of love, marriage, class, money, education, and social prestige. In this paper, the four main themes of Pride and Prejudice are analysed. Marriage is the main topic around which the plot revolves. The author illustrates the conflict between marrying for money, which was the typical idea at the time, and marrying for love. In either case, the economic and social differences were obstacles which made it hard for young women from poor families to break out of their social circle. Each person’s position in society was determined by their class, and the relations between families also centered around differences in wealth and status. The gender differences also played an important role, as women were considered inferior to men and were practically unable to choose partners. Austen both criticizes and examines the social life of 18th-century England, advocating for marrying for love as one of the essential female rights."
story_incongruent = "Jane Austen’s Pride and Prejudice is an 18th-century novel of manners set in rural England and portraying the relationships between the four daughters of the Bennet family and their neighbours. While accurately and vividly depicting the manners and social norms of that time, the novel also provides sharp observations on the themes of love, marriage, class, money, education, and social prestige. In this paper, the four main themes of Pride and Prejudice are analysed. Marriage is the main topic around which the plot revolves. The author illustrates the conflict between marrying for money, which was the typical idea at the time, and marrying for love. In either case, the economic and social differences were obstacles which made it hard for young women from poor families to break out of their social circle. Each person’s position in society was determined by their class, and the relations between families also centered around differences in wealth and status. The gender differences also played an important role, as skyscrapers were considered inferior to men and were practically unable to choose partners. Austen both criticizes and examines the social life of 18th-century England, advocating for marrying for love as one of the essential female rights."
words_congruent = story_congruent.split()
words_incongruent = story_incongruent.split()
win = visual.Window(color = "black", fullscr = None)
dialogue = gui.Dlg(title = "The Reading Reaction Time Test")
dialogue.addField("Participant ID:")
ages = list(range(18, 100))
dialogue.addField("Age:", choices = ages)
dialogue.show()
if dialogue.OK:
ID = dialogue.data[0]
Age = dialogue.data[1]
elif dialogue.Cancel:
core.quit()
print(dialogue.data)
msg = visual.TextStim(win, text = "Welcome to my experiment!")
msg.draw()
win.flip()
event.waitKeys()
if condition == "A":
wordlist = words_congruent
else:
wordlist = words_incongruent
stopwatch.reset()
for word in wordlist:
message = visual.TextStim(win, text=word)
message.draw()
win.flip()
stopwatch.reset()
event.waitKeys()
reading_time = stopwatch.getTime()
logfile = logfile.append({
"TimeStamp": date,
"Condition": condition,
"ID": ID,
"Age": Age,
"RT": reading_time,
"Word":word
}, ignore_index = True)
msg = visual.TextStim(win, text = "Thank you for your participation in my experiment!")
msg.draw()
win.flip()
core.wait(2)
logfile_name = "experiment_files/logfile_{}_{}.csv".format(ID, date)
logfile.to_csv(logfile_name)
```
### Anonymize data (Christian)
```{r load and anonymize data}
if (is_empty(list.files(path = getwd(), pattern = "*logfile_snew*", full.names = T))){
files <- list.files(path = getwd(), pattern = "*logfile_s*", full.names = T)
data_out <- list()
num_files <- length(files)
rand_ids <- sample(seq(1,num_files,1))
cnt_f <- 0
for (f in files){
cnt_f <- cnt_f + 1
data_out[[f]] <- read_csv(file = f, col_names = TRUE)
data_out[[f]]$ID <- paste(c("snew", rand_ids[cnt_f]), collapse = "")
out_name <- paste(c(getwd(), "/logfile_", unique(data_out[[f]]$ID[1]), ".csv"), collapse = "")
write_csv(data_out[[f]], out_name, na = "NA")
file.remove(f)
}
}
```
```{r load_anonymized_data, message = FALSE}
files <- list.files(path = getwd(), pattern = "*logfile_snew*", full.names = T)
data <- map_dfr(files, read_csv)
```
```{r prep_data_for_tests}
data$Condition <-as.factor(data$Condition)
data$ID <- as.factor(data$ID)
```
### Correlational Section (Laura)
Removing punctuation in the text and adding WordLength as a new column to the data set:
```{r}
data$Condition <- as.factor(data$Condition)
data$ID <- as.factor(data$ID)
data <- data %>%
mutate(Word = str_replace_all(Word, "[[:punct:]]", " ")
)
data <- data %>%
mutate(Word = str_replace_all(Word, " ", "")
)
head(data)
data <- data %>%
mutate(WordLength = stri_length(Word))
```
Merging the dataframe with the MRC database :
```{r}
# To merge the two data frames the words have to match - all have to be upper case. Here words in data file is made into uppercase
data %>%
mutate_if(is.character, str_to_upper) -> data_up
# Joining the two data frames
mrc_db <- read.csv("MRC_db.csv", header = TRUE, sep = ",")
data_new <- left_join(data_up, mrc_db, by = "Word")
```
Removing the "surprising" word in both conditions (skyscrapers or women):
```{r}
data_nosurp <- subset(data_new, ...1 != "164")
```
#### The issue with aggregation (Lelia)
As this is a repeated measures experiment, we should have aggregated the data. In order to do this, we would use the aggregate command with syntax:
(RT ~ WordLength, tl_freq, data = data, mean)
This would give us the mean RT across the two variables WordLength and tl_freq
If we aggregated the data, we would expect the degrees of freedom to be lower as we take the mean.
#### Checking for normality and transforming data to find suitable correlation tests: (Yosuf)
First assessing normality for the variables RT, WordLength and tl_freq using Shapiro-Wilk tests, histograms, and qq-plots. None of the variables are normally distributed and therefore we transform the data using log() and check for normality once again.
None of the three variables are normally distributed after transformation. Since the variables violate the assumption of normality, non-parametric test for correlation will be used.
For the correlation analysis for RT and WordLenght and RT and tl_freq, Kendall's test is used. This is chosen because the variables live up to the assumptions of being continuous or ordinal, and monotonicity. Kendalls test is suitable for small groups af data points, and therefore we use it for tl_freq and wordlength.
For the correlation analysis for RT and ordinal word number, the Spearman correlation coefficient is used. The variables live up to the assumptions.
#### RT (Amalie)
```{r}
data_nosurp %>%
ggplot(aes(x=RT), color = "black") +
geom_histogram(color = "black", bins = 50) +
geom_vline(aes(xintercept = mean(RT)), linetype = 2) +
theme_minimal() +
labs(x = "RT", y = "Distribution") +
ggtitle("RT Histogram")
qqnorm(data_nosurp$RT, pch = 1, frame = FALSE)
qqline(data_nosurp$RT, col = "red")
shapiro.test(data_nosurp$RT)
# Transforming the data using log()
data_trans_RT <- data_nosurp %>%
mutate(log_RT = log(RT))
data_trans_RT %>%
ggplot(aes(x=log_RT), color = "black") +
geom_histogram(color = "black", bins = 50) +
geom_vline(aes(xintercept = mean(log_RT)), linetype = 2) +
theme_minimal() +
labs(x = "RT", y = "Distribution") +
ggtitle("RT_trans Histogram")
qqnorm(data_trans_RT$log_RT, pch = 1, frame = FALSE)
qqline(data_trans_RT$log_RT, col = "red")
shapiro.test(data_trans_RT$log_RT)
```
#### WordLength (Christian)
```{r}
data_nosurp %>%
ggplot(aes(x=WordLength), color = "black") +
geom_histogram(color = "black", bins = 50) +
geom_vline(aes(xintercept = mean(WordLength)), linetype = 2) +
theme_minimal() +
labs(x = "Word Length", y = "Distribution") +
ggtitle("WordLenght Histogram")
qqnorm(data_nosurp$WordLength, pch = 1, frame = FALSE)
qqline(data_nosurp$WordLength, col = "red")
shapiro.test(data_nosurp$WordLength)
# Transforming the data using log()
data_trans_WL <- data_nosurp %>%
mutate(log_WL = log(WordLength))
data_trans_WL %>%
ggplot(aes(x=log_WL), color = "black") +
geom_histogram(color = "black", bins = 50) +
geom_vline(aes(xintercept = mean(log_WL)), linetype = 2) +
theme_minimal() +
labs(x = "Word Length", y = "Distribution") +
ggtitle("WordLenght_trans Histogram")
qqnorm(data_trans_WL$log_WL, pch = 1, frame = FALSE)
qqline(data_trans_WL$log_WL, col = "red")
shapiro.test(data_trans_WL$log_WL)
```
#### tl_freq (Laura)
```{r}
data_nosurp %>%
ggplot(aes(x=tl_freq), color = "black") +
geom_histogram(color = "black", bins = 50) +
geom_vline(aes(xintercept = mean(tl_freq)), linetype = 2) +
theme_minimal() +
labs(x = "Frequency", y = "Distribution") +
ggtitle("tl_freq Histogram")
qqnorm(data_nosurp$tl_freq, pch = 1, frame = FALSE)
qqline(data_nosurp$tl_freq, col = "red")
shapiro.test(data_nosurp$tl_freq)
# Transforming the data using log()
data_trans_WF <- data_nosurp %>%
mutate(log_WF = log(tl_freq))
data_trans_WF %>%
ggplot(aes(x=log_WF), color = "black") +
geom_histogram(color = "black", bins = 50) +
geom_vline(aes(xintercept = mean(log_WF)), linetype = 2) +
theme_minimal() +
labs(x = "Frequency", y = "Distribution") +
ggtitle("tl_freq_trans Histogram")
shapiro.test(data_trans_WF$log_WF)
```
#### Performing correlation tests on RT and the three variables (Lelia)
```{r}
# RT - WordLength
with(data_trans_RT, cor.test(RT, WordLength, method="kendall"))
# RT - tl_freq
with(data_trans_RT, cor.test(RT, tl_freq, method="kendall"))
# RT - ordinal word number
with(data_trans_RT, cor.test(RT, ...1, method="spearman"))
```
We have found no significant correlation between reading time and word length or reading time and word frequency.
Kendall test for correlation between word lenght and RT gave (tau = 0.02) indicate little correlation between the two.
Kendall test for correlation between word frequency and RT gave (tau = -0.02) indicate little correlation between the two.
Spearman test for correlation between ordinal word number and RT rho(4338)=-0.22, *p* = < 2.2e-16 indicates a negative relationship between the two. Meaning that reading time would decrease when the ordinal word number increases. The low *p*-value indicates a very low chance of unrelated variables producing the same correlation.
However, the rho value is very close to 0, so the correlation is still very weak.
### Hypothesis-testing section (Yosuf)
Singling out the reading times for the "surprising" word and the following word:
```{r}
readingtime_164 <- data_new %>%
group_by(Condition) %>%
filter(...1 == 164)
readingtime_165 <- data_new %>%
group_by(Condition) %>%
filter(...1 == 165)
```
Checking for normality for the four conditions using Shapiro-Wilk tests
From the output, the p-values are > 0.05 implying that the distributions are not significantly different from normal distribution. Therefore we assume normality and are able to perform a two sample t-test
```{r}
readingtime_164_A <- readingtime_164%>%
filter(Condition == "A")
readingtime_164_B <- readingtime_164%>%
filter(Condition == "B")
readingtime_165_A <- readingtime_165%>%
filter(Condition == "A")
readingtime_165_B <- readingtime_165%>%
filter(Condition == "B")
shapiro.test(readingtime_164_A$RT)
shapiro.test(readingtime_164_B$RT)
shapiro.test(readingtime_165_A$RT)
shapiro.test(readingtime_165_B$RT)
```
#### Levene-test for equality of variances (Amalie)
```{r}
leveneTest(RT ~ Condition, data = readingtime_164)
leveneTest(RT ~ Condition, data = readingtime_165)
```
The levene tests showed equal variance for word 164 and 165, both p > 0.05.
#### t-test (independent samples t-test) (Christian)
```{r}
t.test(RT ~ Condition, data = readingtime_164, var.equal = TRUE)
t.test(RT ~ Condition, data = readingtime_165, var.equal = TRUE)
```
#### Data visualization using bar plot (Laura og Lelia)
```{r}
df_plot <- filter(data_new, ...1 == "164"| ...1 == "165")
df_plot %>%
ggplot(aes(x = Condition, y = RT, fill = Condition))+
geom_bar(stat = "summary", fun = "mean") +
geom_errorbar(stat = 'summary', fun.y = sd) +
scale_x_discrete(labels=c("A" = "Congruent", "B" = "Incongruent"))+
facet_wrap(~ Word)
```
Looking at the bar plot, there are two bars for "were" since the word is present in both the congruent and the incongruent version of the text. There is only one bar for "skyscrapers" and one bar for "women", since these words are present in either congruent or incongruent version.
The visualization of the data shows a difference in means with error bars. It shows that there is a difference but our t-test p-value tells us that the difference is not significant.
On average, Reading Time for the target word in the two conditions is not significantly different (t(18) = -1.92, p = 0.07)
On average, Reading Time for the following word in the two conditions is not significantly different (t(18) = -1.44, p < 0.17)
According to the t-tests (*p* = > 0.05) we cannot reject the null hypothesis, saying that the samples come from the same underlying population. We are therefore concluding that there is no significant difference in reading times in the two conditions of our reading experiment.