owned this note
owned this note
Published
Linked with GitHub
> # INBO CODING CLUB
24 Januari, 2019
Welcome! Yes very welcome
## CHALLENGE 1
List the issues you encountered during tidying the data:
* Een hele hoop "brol" voor en na de data (a lot of "trash" before and after data)
* Missing data in column "Sex" and "Weight" row: 3,6,16,18
* "#####" in column "Weight", row 6
* "note" below(colors)
> [Ine: Hi, you need to take care with the species called "NA" and make sure that R is not interpreting it as a missing value. I think you can do this by indicating in read_delim() that NA is not a missing value by indicating that read_delim(na = "") --> real missing values are only the empty cells and not "NA" text cells. This works. ]
## CHALLENGE 2
List the issues you encountered during tidying the data:
* 2 tabellen in dezelfde Excel sheet
* plot 1 has 4 variables, plot 2 only 3
* weight column has unit in plot 2
* "species" and "sex" are merged in 1 column in plot 2
* Datums in eerste tabel worden ingelezen als een factor variabele
* Datums plot 2 kunnen zowel M/D/Y als D/M/Y zijn?
Sander's code suggestion:
```r
library(tidyverse)
X20190124_survey_part2 <- read_excel("~/R_coding_club/data/20190124_survey_part2.xlsx", skip = 1)
plot1 <- X20190124_survey_part2 %>%
select(`Date collected`, Species, Sex, Weight_in_gr = `Weight (g)`) %>%
filter(!is.na(`Date collected`)) %>%
mutate(plot_id = 1) %>%
mutate(Date = as.Date(`Date collected`, format = "%m/%d/%Y")) %>%
select(-`Date collected`)
plot2 <- X20190124_survey_part2 %>%
select(Date = `Date collected__1`, species_sex, wgt ) %>%
mutate(Weight_in_gr = as.numeric(gsub(pattern = "g", replacement = "", x=wgt))) %>%
mutate(Species = substr(x = species_sex, 1, 2)) %>%
mutate(Sex = substr(x = species_sex, 4, 4)) %>%
mutate(plot_id = 2) %>%
select(-wgt, -species_sex)
X20190124_survey_part2_tidy <- rbind(plot1, plot2)
```
## Intermezzo
You can try out yourself to read and work with the tidy version of the data:
```r
library(readr)
library(tidyverse)
survey <- read_delim(
"../data/20190124_survey_data_spreadsheet_tidy.csv",
delim = ";")
test <- survey %>%
group_by(sex, species) %>%
summarise(median_crap = median(weight_in_g,
na.rm = TRUE),
mean_crap = mean(weight_in_g,
na.rm = TRUE)) %>%
ungroup()
```
## CHALLENGE 3
### Share your code snippet
If you want to share your code snippet, copy paste your snippet within a section of three backticks (```):
As an **example**:
```r
library(tidyverse)
```
Sander's code:
```r
library(tidyverse)
main_experiment_tidy <- main_experiment %>%
gather(key = "Experiment", value = "Optical_density", 4:6) %>%
mutate(Experiment = gsub(pattern = "OD_", replacement = "", x = Experiment)) %>%
mutate(Experiment = gsub(pattern = "h", replacement = "", x = Experiment))
```
Stien en Marijke
```r
gather(X20190124_dryad_arias_hall_v3, "OT", "OD", 4:6)
```
Joost
```r
dataset <- read_delim("../data/20190124_dryad_arias_hall_v3.csv", delim = ",")
# h1#!!!add row ID to each input
# combinations of AB_r, bacterial_genotype & phage_t are not unique
# applying gather will lose the information of row identity if row ID is not added explicitely
dataset <- dataset %>%
mutate(ID = 1:nrow(.))
# gather, also taking into account that survival and phage_r haven only data for hour = 72
clean_data <- dataset %>%
select(-Survival_72h, -PhageR_72h) %>%
gather("hour", "OD", OD_0h, OD_20h, OD_72h) %>%
mutate(hour = str_replace(hour, "OD_", "")) %>%
left_join(dataset %>%
select(-OD_0h, -OD_20h, -OD_72h, -PhageR_72h) %>%
gather("hour", "Survival", Survival_72h) %>%
mutate(hour = str_replace(hour, "Survival_", ""))) %>%
left_join(dataset %>%
select(-OD_0h, -OD_20h, -OD_72h, -Survival_72h) %>%
gather("hour", "PhageR", PhageR_72h) %>%
mutate(hour = str_replace(hour, "PhageR_", "")))
```
Jeroens opkuiske naar numerisch variabeltje
```
d2$OD <- d2$OD %>%
str_replace("OD_", "") %>%
str_replace("h", "") %>%
as.numeric()
```
(*you can copy paste this example and add your code further down, but do not fill in your code in this section*)
```
```
#opmerking Frank VDM: structuur van projecten toelichten, waarom projecten gebruiken ipv losse files, structureren van projectonderdelen,etc...
figuur gather: https://datacarpentry.org/R-ecology-lesson/img/gather_data_R.png
A solution to clean further column `experiment_time_h` of data.frame `main_experiment_tidy` :
```r
# remove "OD_" before hour and "h" after
main_experiment_tidy_cleaned <- main_experiment_tidy %>%
mutate(experiment_time_h = str_remove(experiment_time_h, pattern = "OD_")) %>%
mutate(experiment_time_h = str_remove(experiment_time_h, pattern = "h"))
# convert hour from character to integer
class(main_experiment_tidy_cleaned$experiment_time_h) <- "integer"
# check
distinct(main_experiment_tidy_cleaned, experiment_time_h)
```