# INBO CODING CLUB 25 April 2024 Welcome! ## Share your code snippet If you want to share your code snippet, copy paste your snippet within a section of three backticks (```): As an **example**: ``` library(tidyverse) ``` (*you can copy paste this example and add your code further down*) ## Yellow sticky notes No yellow sticky notes online. Put your name + " | " and add a "*" each time you solve a challenge (see below). ## Participants Name | Challenges --- | --- Damiano Oldoni | *** Ward Langeraert 🐌 | *** Hans Van Calster | *** Luc De Bruyn | Adriaan Seynaeve | * Pieter Huybrechts | *** Sanne Govaert | ** Siebe Indestege | * Tina Tuerlings | Jolien Van Malderen | * Lies Teunen | Marieke Desender| Emma Cartuyvels| * ## Challenge 1 ### Damiano's solution (example) Copy paste this section to show your solutions. ```r # dummy code print("This is how to insert code.") ``` ### Pieter's solution ```r ## Display first 8 rows; display the last 8 rows. slice_head(obs, n = 8) ## Select columns observationLevel, eventStart, eventEnd and scientificName slice_tail(obs, n = 8) ## Display the distinct values of observationLevel as a data.frame with one ## column, observationLevel. How many different values of observationLevel are ## there? distinct(select(obs, observationLevel)) ### How to remove observations with observationLevel = "media"? How to remove filter(obs, observationLevel != "media") ### observations with observationLevel = "media" and empty scientificName? filter(obs, observationLevel == "media" & !is.na(scientificName)) ### Add a new column called is_classified_by_human which is TRUE if ### classificationMethod is equal to "human". Add a new column month with the ### month of the observation (eventStart). Tip: use lubridate's month() and ### year() functions. mutate(obs, is_classified_human = classificationMethod == "human", month = lubridate::month(eventStart)) %>% glimpse() ## Move the column count after behavior. relocate(obs, count, .after = behavior) %>% glimpse() ## Move all columns starting with "individual" before observationLevel relocate(obs, starts_with("individual"), .before = observationLevel) %>% glimpse() ## Is the following statement true of false: Observations with observationLevel ## = "event" have no mediaID, i.e. mediaID is NA. setequal( filter(obs, observationLevel == "event" & is.na(mediaID)), filter(obs, observationLevel == "event") ) ``` ### Hans ```r ## CHALLENGE 1 obs <- readr::read_csv("./data/20240425/20240425_observations.csv") # Preview glimpse(obs) # Display first 8 rows; display the last 8 rows. head(obs, 8) tail(obs, 8) # Select columns observationLevel, eventStart, eventEnd and scientificName obs %>% select(observationLevel, eventStart, eventEnd, scientificName) # Display the distinct values of observationLevel as a data.frame with one column, observationLevel. obs %>% distinct(observationLevel) # How many different values of observationLevel are there? obs %>% count(observationLevel) # How to remove observations with observationLevel = "media"? How to remove observations with observationLevel = "media" and empty scientificName? obs %>% filter(observationLevel != "media") # How to remove observations with observationLevel = "media" and empty scientificName? obs %>% filter( observationLevel != "media", !is.na(scientificName) ) # Add a new column called is_classified_by_human which is TRUE if classificationMethod is equal to "human". Add a new column month with the month of the observation (eventStart). Tip: use lubridate's month() and year() functions. obs %>% mutate( is_classified_by_human = classificationMethod == "human", month = lubridate::month(eventStart), year = lubridate::year(eventStart) ) # Move the column count after behavior. obs %>% relocate(count, .before = behavior) # Move all columns starting with "individual" before observationLevel obs %>% relocate( starts_with("individual"), .before = observationLevel ) # Is the following statement true of false: Observations with observationLevel = "event" have no mediaID, i.e. mediaID is NA. obs %>% filter( observationLevel == "event" ) %>% select(mediaID) %>% pull() %>% is.na() %>% all() ``` ### Ward's solution ```r ## CHALLENGE 1 # load data obs <- readr::read_csv("./data/20240425/20240425_observations.csv") # preview data glimpse(obs) # 1. Display first 8 rows; display the last 8 rows. slice_head(obs, n = 8) # first slice_tail(obs, n = 8) # last # 2. Select columns observationLevel, eventStart, eventEnd and scientificName obs %>% select("observationLevel", "eventStart", "eventEnd", "scientificName") # 3. Display the distinct values of observationLevel as a data.frame with one # column, observationLevel. How many different values of observationLevel are # there? obs %>% distinct(.data$observationLevel) # how many? obs %>% distinct(.data$observationLevel) %>% nrow() # 4. How to remove observations with observationLevel = "media"? How to remove # observations with observationLevel = "media" and empty scientificName? obs %>% filter(.data$observationLevel != "media") obs %>% filter(.data$observationLevel != "media", !is.na(.data$scientificName)) # 5. Add a new column called is_classified_by_human which is TRUE if # classificationMethod is equal to "human". Add a new column month with the # month of the observation (eventStart). Tip: use lubridate's month() and year() # functions. obs %>% mutate(is_classified_by_human = .data$classificationMethod == "human", month = month(.data$eventStart)) # 6. Move the column count after behavior. obs %>% relocate("count", .after = "behavior") # 7. Move all columns starting with "individual." before observationLevel obs %>% relocate(starts_with("individual"), .before = "observationLevel") # 8. Is the following statement true of false: Observations with # observationLevel = "event" have no mediaID, i.e. mediaID is NA. media_events <- obs %>% filter(.data$observationLevel == "event") %>% pull("mediaID") %>% # are they all NA? is.na() %>% all() ``` ### Emma's solution ```r # 1 obs %>% slice_head(n = 8) obs %>% slice_tail(n = 8) # 2 obs %>% select(observationLevel, eventStart, eventEnd, scientificName) # 3 obs %>% distinct(observationLevel) # 4 obs %>% filter(observationLevel != "media") obs %>% filter(observationLevel != "media" & !is.na(scientificName)) # 5 obs %>% mutate(is_classified_by_human = ifelse( classificationMethod == "human", TRUE, FALSE ), month = month(eventStart)) # 6 obs %>% relocate(count, .after = behavior) # 7 obs %>% relocate(starts_with("individual"), .before = observationLevel) # 8 obs %>% select(observationLevel, mediaID) %>% filter(observationLevel == "event") %>% distinct() ``` ### Siebe's solution ```r # 1 head(obs,8) tail(obs,8) # 2 obs %>% select(observationLevel,eventStart,eventEnd,scientificName) # 3 obs %>% distinct(observationLevel) # 4 obs %>% filter(observationLevel != "media") obs %>% filter(observationLevel != "media" & is.na(scientificName)) # 5 obs <- obs %>% mutate(is_classified_by_human = case_when(classificationMethod == "human" ~ "TRUE", classificationMethod != "human" ~ "FALSE"), month = month(eventStart) ) # 6 obs <- obs %>% relocate(count, .after = behavior) # 7 taxon = individual obs <- obs %>% relocate(individualID,individualPositionRadius, individualPositionAngle,individualSpeed, .before = observationLevel ) # 8 obs %>% filter(observationLevel == "event") %>% distinct(mediaID) #statement = true ``` ### Sanne's solution ```r # 1.1 obs %>% slice_head(n = 8) obs %>% slice_tail(n = 8) # 1.2 obs %>% select(observationLevel, eventStart, eventEnd) # 1.3 obs %>% distinct(observationLevel) obs %>% distinct(observationLevel) %>% nrow() # 1.4 obs %>% filter(observationLevel != "media") obs %>% filter(observationLevel != "media" & !is.na(scientificName)) # 1.5 obs %>% mutate( is_classified_by = if_else(classificationMethod == "human", TRUE, FALSE), month = paste(month(eventStart), year(eventStart)), ) # 1.6 obs %>% relocate(count, .after = behavior) # 1.7 obs %>% relocate(starts_with("individual"), .before = observationLevel) %>% names() # 1.8 obs %>% filter(observationLevel == "event") %>% distinct(mediaID) ``` ## Challenge 2 Hans ``` ## CHALLENGE 2 - Summaries # How many observations are there per deploymentID? Show it as a dataframe with two columns: deploymentID and n_obs. obs %>% count(deploymentID, name = "n_obs") # How many observations are there per deploymentID and observationLevel? obs %>% count(deploymentID, observationLevel, name = "n_obs") # Order the previous output per n_obs (descending order). obs %>% count(deploymentID, observationLevel, name = "n_obs") %>% arrange(desc(n_obs)) # Create a summary dataframe with the number of media-based observations (observationLevel = "media"), the number of event-based observations (observationLevel = "event") and the total number of observations per deploymentID. Notice that observationLevel is always filled in. The result should be a dataframe with 4 columns: deploymentID, n_obs_event, n_obs_media and n_obs. Tip: one way to solve this is by combining dplyr with tidyr. obs %>% count(deploymentID, observationLevel, name = "n_obs") %>% pivot_wider( id_cols = deploymentID, names_from = observationLevel, names_prefix = "n_obs_", values_from = n_obs) %>% mutate( n_obs = n_obs_event + n_obs_media ) #Pieter obs %>% group_by(deploymentID) %>% summarise(n_obs_media = sum(observationLevel == "media"), n_obs_event = sum(observationLevel == "event"), n_obs = n()) # For each month, year and deploymentID, return the eventStart of the first and the last event-based observation (observationLevel = event) if there are 3 or more event-based observations. Call these two columns first_event_obs and last_event_obs respectively. obs %>% mutate( month = lubridate::month(eventStart), year = lubridate::year(eventStart) ) %>% group_by( deploymentID, month, year ) %>% filter( observationLevel == "event", n() >= 3 ) %>% summarise( first_event_obs = first(eventStart, order_by = eventStart), last_event_obs = last(eventStart, order_by = eventStart), .groups = "drop" ) # How can you return the same dataframe as in exercice 5 but now limiting us to the months with the highest number of event-based observations for each year and deploymentID? obs %>% mutate( month = lubridate::month(eventStart), year = lubridate::year(eventStart) ) %>% group_by( deploymentID, month, year ) %>% mutate( n_obs = n() ) %>% group_by( deploymentID, year ) %>% filter( observationLevel == "event", n_obs == max(n_obs) ) %>% group_by( deploymentID, month, year ) %>% summarise( first_event_obs = first(eventStart, order_by = eventStart), last_event_obs = last(eventStart, order_by = eventStart), .groups = "drop" ) ``` ### Pieter's solution ```R ### How many observations are there per deploymentID? Show it as a dataframe ### with two columns: deploymentID and n_obs. obs %>% group_by(deploymentID) %>% tally(name = "n_obs") count(obs, deploymentID, name = "n_obs") ### How many observations are there per deploymentID and observationLevel? obs %>% group_by(deploymentID, observationLevel) %>% tally() count(obs, deploymentID, observationLevel) ### Order the previous output per n_obs (descending order). count(obs, deploymentID, observationLevel) %>% group_by(deploymentID) %>% summarise(n_obs = sum(n)) obs %>% group_by(deploymentID, observationLevel) %>% tally(name = "n_obs") %>% arrange(-n_obs) ###Create a summary dataframe with the number of media-based observations ###(observationLevel = "media"), the number of event-based observations ###(observationLevel = "event") and the total number of observations per ###deploymentID. Notice that observationLevel is always filled in. The result ###should be a dataframe with 4 columns: deploymentID, n_obs_event, n_obs_media ###and n_obs. Tip: one way to solve this is by combining dplyr with tidyr. obs %>% mutate(is_media = observationLevel == "media", is_event = observationLevel == "event") %>% group_by(deploymentID) %>% summarise(n_obs_media = sum(is_media), n_obs_event = sum(is_event), n_obs = n()) # For each month, year and deploymentID, return the eventStart of the first and # the last event-based observation (observationLevel = event) if there are 3 or # more event-based observations. Call these two columns first_event_obs and # last_event_obs respectively. obs_first_last <- obs %>% filter(observationLevel == "event") %>% add_count(deploymentID) %>% filter(n >= 3) %>% mutate(year = lubridate::year(eventStart), month = lubridate::month(eventStart)) %>% group_by(deploymentID, year, month) %>% summarise( first_event_obs = head(eventStart, 1), last_event_obs = tail(eventStart, 1) ) obs_first_last # How can you return the same dataframe as in exercice 5 but now limiting us to # the months with the highest number of event-based observations for each year # and deploymentID? obs %>% filter(observationLevel == "event") %>% add_count(deploymentID) %>% filter(n >= 3) %>% mutate(year = lubridate::year(eventStart), month = lubridate::month(eventStart)) %>% group_by(deploymentID, year) %>% filter(n == max(n)) %>% group_by(deploymentID, year, month, .add = FALSE) %>% summarise( first_event_obs = head(eventStart, 1), last_event_obs = tail(eventStart, 1) ) ``` ### Ward's solution ```r # 1. How many observations are there per deploymentID? Show it as a dataframe # with two columns: deploymentID and n_obs. obs %>% count(.data$deploymentID, name = "n_obs") # 2. How many observations are there per deploymentID and observationLevel? count_deploy_obslevel <- obs %>% count(.data$deploymentID, .data$observationLevel, name = "n_obs") count_deploy_obslevel # 3. Order the previous output per n_obs (descending order). count_deploy_obslevel %>% arrange(desc(.data$n_obs)) # 4. Create a summary dataframe with the number of media-based observations # (observationLevel = "media"), the number of event-based observations # (observationLevel = "event") and the total number of observations per # deploymentID. Notice that observationLevel is always filled in. The result # should be a dataframe with 4 columns: deploymentID, n_obs_event, n_obs_media # and n_obs. Tip: one way to solve this is by combining dplyr with tidyr. obs %>% # count number of observations per deployment ID count(.data$deploymentID, .data$observationLevel) %>% # set to wide format pivot_wider(id_cols = "deploymentID", names_from = "observationLevel", values_from = "n", names_prefix = "n_obs_") %>% # sum observations across 2 columns rowwise() %>% mutate(n_obs = sum(c_across(c("n_obs_event", "n_obs_media")))) # 5. For each month, year and deploymentID, return the eventStart of the first # and the last event-based observation (observationLevel = event) if there are # 3 or more event-based observations. Call these two columns first_event_obs and # last_event_obs respectively. obs %>% # get month and year using lubridate mutate(month = month(.data$eventStart), year = year(.data$eventStart)) %>% # only event-based observations filter(.data$observationLevel == "event") %>% # count number of observations per month, year, deploymentID group_by(.data$month, .data$year, .data$deploymentID) %>% mutate(n_obs = n()) %>% ungroup() %>% # only if there are 3 or more observations filter(.data$n_obs >= 3) %>% # get first and last observations per month, year, deploymentID group_by(.data$month, .data$year, .data$deploymentID) %>% summarise(first_event_obs = min(.data$eventStart), last_event_obs = max(.data$eventStart), .groups = "drop") # 6. How can you return the same dataframe as in exercice 5 but now limiting us # to the months with the highest number of event-based observations for each # year and deploymentID? obs %>% # get month and year using lubridate mutate(month = month(.data$eventStart), year = year(.data$eventStart)) %>% # only event-based observations filter(.data$observationLevel == "event") %>% # count number of observations per month, year, deploymentID group_by(.data$month, .data$year, .data$deploymentID) %>% mutate(n_obs = n()) %>% ungroup() %>% # only if there are 3 or more observations filter(.data$n_obs >= 3) %>% # keep only the months with the highest number of event-based observations group_by(.data$year, .data$deploymentID) %>% slice_max(.data$n_obs) %>% # get first and last observations per month, year, deploymentID group_by(.data$month, .data$year, .data$deploymentID) %>% summarise(first_event_obs = min(.data$eventStart), last_event_obs = max(.data$eventStart), .groups = "drop") ``` ### Sanne's solution ```r # 2.1 How many observations are there per deploymentID? Show it as a dataframe # with two columns: deploymentID and n_obs. obs %>% count(deploymentID, name = "n_obs") # 2.2 How many observations are there per deploymentID and observationLevel? obs %>% count(deploymentID, observationLevel) # 2.3 Order the previous output per n_obs (descending order). obs %>% count(deploymentID, observationLevel, name = "n_obs") %>% arrange(desc(n_obs)) # 2.4 Create a summary dataframe with the number of media-based observations # (observationLevel = "media"), the number of event-based observations # (observationLevel = "event") and the total number of observations per # deploymentID. Notice that observationLevel is always filled in. The result # should be a dataframe with 4 columns: deploymentID, n_obs_event, n_obs_media # and n_obs. Tip: one way to solve this is by combining dplyr with tidyr. obs %>% count(deploymentID, observationLevel, name = "n_obs") %>% summarise(n_obs = n()) %>% pivot_wider( names_from = observationLevel, values_from = n_obs, names_glue = "n_obs_{observationLevel}") %>% mutate( n_obs = n_obs_event + n_obs_media ) # 2.5 For each month, year and deploymentID, return the eventStart of the first # and the last event-based observation (observationLevel = event) if there are 3 # or more event-based observations. Call these two columns first_event_obs and # last_event_obs respectively. groupedobs <- obs %>% mutate( month = month(eventStart), year = year(eventStart) ) %>% group_by(month, year, deploymentID) %>% filter(observationLevel == "event" & n() >= 3) firstlast <- groupedobs %>% summarize( first_event_obs = first(eventStart, order_by = eventStart), last_event_obs = last(eventStart, order_by = eventStart) ) %>% ungroup() firstlast # 2.6 How can you return the same dataframe as in exercice 5 but now limiting us # to the months with the highest number of event-based observations for each year # and deploymentID? groupedobs %>% count() %>% group_by(year, deploymentID) %>% filter(n == max(n)) %>% ungroup() %>% left_join(firstlast) %>% select(-n) ``` ### Marieke's solutions ```r chal2_obs4<-obs %>% group_by(deploymentID) %>% summarise(., n_obs_media=n_distinct(observationID[observationLevel=="media"]), n_obs_event=n_distinct(observationID[observationLevel=="event"]), n_obs=n_distinct(observationID)) ``` ### Emma's solution ``` # 5 obs %>% filter(observationLevel == "event") %>% mutate(month = month(eventStart), year = year(eventStart)) %>% mutate(n_obs = n(), .by = c(month, year, deploymentID)) %>% filter(n_obs > 2) %>% summarise(first_event_obs = min(eventStart), last_event_obs = max(eventStart), .by = c(month, year, deploymentID)) %>% arrange(year, month) ``` ## CHALLENGE 3 - Two-table verbs ```r media <- read_csv("./data/20240425/20240425_media.csv", na = "") # Preview glimpse(media) # How to add the media information stored in media to the observations? obs %>% left_join( media, by = join_by(deploymentID, mediaID) ) # How are the columns ordered? Are the columns from observations on the left? Try to put them on the right. media %>% right_join( obs, by = join_by(deploymentID, mediaID) ) # Are there media not in observations, i.e. are there media that are not linked to any observation? media %>% anti_join( obs, by = join_by(deploymentID, mediaID) ) %>% nrow() > 0 # NO # Some observations have a missing value for column mediaID. Get rid of them while joining. obs %>% inner_join( media, by = join_by(deploymentID, mediaID) ) # As deploymentID is present in both dataframes, it gets duplicated and the suffixes ".x" and ".y" are added. How to change the suffixes to "_obs" and "_media" while performing exercise 1? obs %>% inner_join( media, by = join_by(mediaID), suffix = c("_obs", "_media") ) # How to add the suffix only for the column deploymentID in media? obs %>% inner_join( media, by = join_by(mediaID), suffix = c("", "_media") ) # How can you avoid having this column twice? obs %>% inner_join( media, by = join_by(mediaID, deploymentID)) ``` ### Pieter's solution ```r ## How to add the media information stored in media to the observations? left_join(obs, media, by = join_by("mediaID")) %>% glimpse() ## How are the columns ordered? Are the columns from observations on the left? ## Try to put them on the right. right_join(media, obs, by = join_by("mediaID")) %>% glimpse() ## Are there media not in observations, i.e. are there media that are not linked ## to any observation? anti_join(media, obs, by = join_by("mediaID")) ## Some observations have a missing value for column mediaID. Get rid of them ## while joining. inner_join(obs, media, by = join_by("mediaID")) ## As deploymentID is present in both dataframes, it gets duplicated and the ## suffixes ".x" and ".y" are added. How to change the suffixes to "_obs" and ## "_media" while performing exercise 1? left_join(obs, media, by = join_by("mediaID"), suffix = c("_obs", "_media")) %>% glimpse() ## How to add the suffix only for the column deploymentID in media? mutate(media, deploymentID_media = deploymentID, .keep = "unused") %>% right_join(obs, by = join_by("mediaID")) %>% glimpse() ## How can you avoid to have this column twice? obs %>% left_join(select(media, -deploymentID), by = join_by("mediaID")) %>% glimpse()` ``` ### Ward's solution ```r # load data media <- read_csv("./data/20240425/20240425_media.csv", na = "") # preview data glimpse(media) # 1. How to add the media information stored in media to the observations? obs %>% full_join(media, by = join_by("mediaID")) # 2. How are the columns ordered? Are the columns from observations on the left? # Try to put them on the right. media %>% full_join(obs, by = join_by("mediaID")) # 3. Are there media not in observations, i.e. are there media that are not # linked to any observation? media %>% anti_join(obs, by = join_by("mediaID")) # 4. Some observations have a missing value for column mediaID. Get rid of them # while joining. obs %>% right_join(media, by = join_by("mediaID")) # 5. As deploymentID is present in both dataframes, it gets duplicated and the # suffixes ".x" and ".y" are added. How to change the suffixes to "_obs" and # "_media" while performing exercise 1? obs %>% full_join(media, by = join_by("mediaID"), suffix = c("_obs", "_media")) # 6. How to add the suffix only for the column deploymentID in media? obs %>% full_join(media, by = join_by("mediaID"), suffix = c("", "_media")) # 7. How can you avoid having this column twice? obs %>% full_join(media, by = join_by("mediaID", "deploymentID")) ``` ## Bonus 1 Hans ``` #After reading 20240425_ias_plants.txt dataset (code provided), make it tidy. Please notice that two vernacular names of the same language are given for some species, e.g. "gewone gunnera;reuzenrabarber". They should be split as two rows. ias %>% pivot_longer( cols = c(english_name, dutch_name, french_name), names_to = c("language", "drop_me"), names_sep = "_", values_to = "name") %>% select(-drop_me) %>% separate_wider_delim( cols = name, names = c("name1", "name2"), delim = ";", too_few = "align_start" ) %>% pivot_longer( cols = starts_with("name"), names_to = "name_alternative", values_to = "name", values_drop_na = TRUE ) ``` ## Bonus 2 Hans ``` a %>% slice( order(as.numeric(my_favorite_number_string)) ) ``` ## Bonus 3 Hans ``` result <- si_data %>% group_by(year, eea_cell_code) %>% summarise(species_list = list(as.character(speciesKey))) %>% mutate( accumulated_species_list = accumulate( species_list, union, .simplify = FALSE), cumulative_n_species = map_dbl(accumulated_species_list, length) ) %>% ungroup() library(ggplot2) result %>% ggplot() + geom_line(aes(x = year, y = cumulative_n_species, group = eea_cell_code), alpha = 0.3) ```