Assignment 0
---
title: "Assignment 1 - Data Cleaning - Instructions"
author: NATASHA K: JØRGENSEN
date: "[FILL IN THE DATE]"
output: html_document
---
# Brushing up your code skills
In this first part of the assignment we will brush up your programming skills, and make you familiar with the data sets you will be analyzing for assignment 1.
In this warm-up assignment you will:
1) Create a Github (or gitlab) account, link it to your RStudio, and create a new repository/project
2) Use small nifty lines of code to transform several data sets into just one. The final data set will contain only the variables that are needed for the analysis in the next parts of the assignment
3) Warm up your tidyverse skills (especially the sub-packages stringr and dplyr), which you will find handy for later assignments.
N.B: Usually you'll also have to doc/pdf with a text. Not for Assignment 0.
## Learning objectives:
- Become comfortable with tidyverse (and R in general)
- Test out the git integration with RStudio
- Build expertise in data wrangling (which will be used in future assignments)
## 0. First an introduction on the data
# Language development in Autism Spectrum Disorder (ASD)
Reference to the study: https://www.ncbi.nlm.nih.gov/pubmed/30396129
Background: Autism Spectrum Disorder (ASD) is often related to language impairment, and language impairment strongly affects the patients ability to function socially (maintaining a social network, thriving at work, etc.). It is therefore crucial to understand how language abilities develop in children with ASD, and which factors affect them (to figure out e.g. how a child will develop in the future and whether there is a need for language therapy).
However, language impairment is always quantified by relying on the parent, teacher or clinician subjective judgment of the child, and measured very sparsely (e.g. at 3 years of age and again at 6).
In this study we videotaped circa *30 kids with ASD* and *circa 30 comparison kids* (matched by linguistic performance at visit 1) for ca. 30 minutes of naturalistic interactions with a parent. We repeated the data collection *6 times per kid*, with 4 months between each visit.
We transcribed the data and counted:
i) the amount of words that each kid uses in each video. Same for the parent.
ii) the amount of unique words that each kid uses in each video. Same for the parent.
iii) the amount of morphemes per utterance (Mean Length of Utterance) displayed by each child in each video. Same for the parent.
Different researchers involved in the project provide you with different data sets:
1) demographic and clinical data about the children (recorded by a clinical psychologist)
2) length of utterance data (calculated by a linguist)
3) amount of unique and total words used (calculated by a fumbling jack-of-all-trade, let's call him RF)
Your job in this assignment is to double check the data and make sure that it is ready for the analysis proper (Assignment 2), in which we will try to understand how the children's language develops as they grow as a function of cognitive and social factors and which are the "cues" suggesting a likely future language impairment.
## 1. Let's get started on GitHub
In the assignments you will be asked to upload your code on Github and the GitHub repositories will be part of the portfolio, therefore all students must make an account and link it to their RStudio (you'll thank us later for this!).
Follow the link to one of the tutorials indicated in the syllabus:
* Recommended: https://happygitwithr.com/
* Alternative (if the previous doesn't work): https://support.rstudio.com/hc/en-us/articles/200532077-Version-Control-with-Git-and-SVN
* Alternative (if the previous doesn't work): https://docs.google.com/document/d/1WvApy4ayQcZaLRpD6bvAqhWncUaPmmRimT016-PrLBk/mobilebasic
N.B. Create a GitHub repository for the Assignment 1 and link it to a project on your RStudio.
## 2. Now let's take dirty dirty data sets and make them into a tidy one
If you're not in a project in Rstudio, make sure to set your working directory here.
If you created an RStudio project, then your working directory (the directory with your data and code for these assignments) is the project directory.
```{r}
getwd()
pacman::p_load(tidyverse,
janitor,
here,
knitr,
namespace,
readr,
stringr,
dplyr)
setwd("C:/Users/admin/Documents/Methods-3")
here()
```
Load the three data sets, after downloading them from dropbox and saving them in your working directory:
* Demographic data for the participants: https://www.dropbox.com/s/w15pou9wstgc8fe/demo_train.csv?dl=0
* Length of utterance data: https://www.dropbox.com/s/usyauqm37a76of6/LU_train.csv?dl=0
* Word data: https://www.dropbox.com/s/8ng1civpl2aux58/token_train.csv?dl=0
```{r}
lu_train <- read.csv("LU_train.csv")
demo_train <- read.csv("demo_train.csv", stringsAsFactors = T)
token_train <- read.csv("token_train.csv", stringsAsFactors = T)
```
Individual variability in group level
Individual variability based on certain population variables
Multilevel models and mixed effect models
- multilevel is better :))) - not random, there's just different levels
Visualize effect of difference between levels
Explore the 3 data sets (e.g. visualize them, summarize them, etc.). You will see that the data is messy, since the psychologist collected the demographic data, the linguist analyzed the length of utterance in May 2014 and the fumbling jack-of-all-trades analyzed the words several months later.
In particular:
- the same variables might have different names (e.g. participant and visit identifiers)
- the same variables might report the values in different ways (e.g. participant and visit IDs)
Welcome to real world of messy data :-)
```{r}
head(lu_train)
head(demo_train)
head(token_train)
```
Before being able to combine the data sets we need to make sure the relevant variables have the same names and the same kind of values.
So:
*2a)* Identify which variable names do not match (that is are spelled differently) and find a way to transform variable names.
Pay particular attention to the variables indicating participant and visit.
Tip: look through the chapter on data transformation in R for data science (http://r4ds.had.co.nz). Alternatively you can look into the package dplyr (part of tidyverse), or google "how to rename variables in R". Or check the janitor R package. There are always multiple ways of solving any problem and no absolute best method.
```{r}
demo_train <- demo_train %>%
rename (SUBJ = Child.ID)
upper_all <- function(a,b,c){
names(a) <- toupper(names(a))
names(b) <- toupper(names(b))
names(c) <- toupper(names(c))
}
upper_all <- function(a,b,c){
df.list <- list(a,b,c)
toupper(names(df.list))
}
upper_all(demo_train, lu_train, token_train)
#The function above did not work, so I made it by hand .. :/
names(demo_train) <- toupper(names(demo_train))
names(lu_train) <- toupper(names(lu_train))
names(token_train) <- toupper(names(token_train))
#Making the visit column into factors
demo_train$VISIT <- as.factor(demo_train$VISIT)
LU_train$VISIT <- as.factor(LU_train$VISIT)
token_train$VISIT <- as.factor(token_train$VISIT)
```
2b. Find a way to homogenize the way "visit" is reported (visit1 vs. 1).
Tip: The stringr package is what you need. *str_extract()* will allow you to extract only the digit (number) from a string, by using the regular expression \\d.
```{r}
#with str_extract
lu_train$VISIT <- str_extract(lu_train$VISIT, "\\d+") #extract only the numerical value from the string "visit1."
token_train$VISIT <- str_extract(token_train$VISIT, "\\d+")
#with parse numbers
LU_train$VISIT <- parse_number(LU_train$VISIT)
token_train$VISIT <- parse_number(token_train$VISIT)
```
2c. We also need to make a small adjustment to the content of the Child.ID column in the demographic data. Within this column, names that are not abbreviations do not end with "." (i.e. Adam), which is the case in the other two data sets (i.e. Adam.). If The content of the two variables isn't identical the rows will not be merged.
A neat way to solve the problem is simply to remove all "." in all data sets.
Tip: stringr is helpful again. Look up str_replace_all
Tip: You can either have one line of code for each child name that is to be changed (easier, more typing) or specify the pattern that you want to match (more complicated: look up "regular expressions", but less typing)
```{r}
#str_replace_all
lu_train$SUBJ <- str_replace_all(lu_train$SUBJ, "\\.","")
token_train$SUBJ <- str_replace_all(token_train$SUBJ, "\\.","")
demo_train$SUBJ <- str_replace_all(demo_train$SUBJ, "\\.","")
#Doing the same but with gsub
demo_train$SUBJ <- gsub("\\.","", demo_train$SUBJ)
```
2d. Now that the nitty gritty details of the different data sets are fixed, we want to make a subset of each data set only containig the variables that we wish to use in the final data set.
For this we use the tidyverse package dplyr, which contains the function select().
The variables we need are:
* Child.ID,
* Visit,
* Diagnosis,
* Ethnicity,
* Gender,
* Age,
* ADOS,
* MullenRaw,
* ExpressiveLangRaw,
* Socialization
* MOT_MLU,
* CHI_MLU,
* types_MOT,
* types_CHI,
* tokens_MOT,
* tokens_CHI.
Most variables should make sense, here the less intuitive ones.
* ADOS (Autism Diagnostic Observation Schedule) indicates the severity of the autistic symptoms (the higher the score, the worse the symptoms). Ref: https://link.springer.com/article/10.1023/A:1005592401947
* MLU stands for mean length of utterance (usually a proxy for syntactic complexity)
* types stands for unique words (e.g. even if "doggie" is used 100 times it only counts for 1)
* tokens stands for overall amount of words (if "doggie" is used 100 times it counts for 100)
* MullenRaw indicates non verbal IQ, as measured by Mullen Scales of Early Learning (MSEL https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-1698-3_596)
* ExpressiveLangRaw indicates verbal IQ, as measured by MSEL
* Socialization indicates social interaction skills and social responsiveness, as measured by Vineland (https://cloudfront.ualberta.ca/-/media/ualberta/faculties-and-programs/centres-institutes/community-university-partnership/resources/tools---assessment/vinelandjune-2012.pdf)
Feel free to rename the variables into something you can remember (i.e. nonVerbalIQ, verbalIQ)
```{r}
lu_train_sub <- lu_train %>%
select(SUBJ,
VISIT,
MOT_MLU,
CHI_MLU)
demo_train_sub <- demo_train %>%
select(SUBJ,
VISIT,
ETHNICITY,
DIAGNOSIS,
GENDER,
AGE,
ADOS,
MULLENRAW,
EXPRESSIVELANGRAW)
token_train_sub <- subset(token_train, select = -8)
```
2e. Finally we are ready to merge all the data sets into just one.
Some things to pay attention to:
* make sure to check that the merge has included all relevant data (e.g. by comparing the number of rows)
* make sure to understand whether (and if so why) there are NAs in the data set (e.g. some measures were not taken at all visits, some recordings were lost or permission to use was withdrawn)
```{r}
merged <- inner_join(lu_train_sub, token_train_sub)
merged <- inner_join(merged, demo_train_sub)
```
2f. Only using clinical measures from Visit 1
In order for our models to be useful, we want to minimize the need to actually test children as they develop. In other words, we would like to be able to understand and predict the children's linguistic development after only having tested them once. Therefore we need to make sure that our ADOS, MullenRaw, ExpressiveLangRaw and Socialization variables are reporting (for all visits) only the scores from visit 1.
A possible way to do so:
* create a new data set with only visit 1, child id and the 4 relevant clinical variables to be merged with the old dataset
* rename the clinical variables (e.g. ADOS to ADOS1) and remove the visit (so that the new clinical variables are reported for all 6 visits)
* merge the new data set with the old
```{r}
```
2g. Final touches
Now we want to
* anonymise our participants (they are real children!).
* make sure the variables have sensible values. E.g. right now gender is marked 1 and 2, but in two weeks you will not be able to remember, which gender were connected to which number, so change the values from 1 and 2 to Female and Male in the gender variable (calling Female F would create issues, since F is also used for FALSE). For the same reason, you should also change the values of Diagnosis from A and B to ASD (autism spectrum disorder) and TD (typically developing). Tip: Try taking a look at ifelse(), or google "how to rename levels in R".
* Save the data set using into a csv file. Hint: look into write.csv()
```{r}
```