owned this note
owned this note
Published
Linked with GitHub
Data Carpentry Workshop
Apr 16-17 2018
NAS Gulf Research Program
===
Super Welcome! Here we can collaboratively take notes and access links and other resources as the workshop progresses. After the workshop this page will be published and made permanently available to participants. Please feel free to contribute!
> Check in: does everyone have a laptop with R, RStudio, and SQLite installed? You will also need a spreadsheet editor.
> NOTE: the SQLite installation instructions on the workshop website are outdated. You won't need it until tomorrow, but see below for the updated instructions.
[Workshop Website](https://bpteague.github.io/2018-04-16-nas/)
[Code of Conduct](https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html)
Before we start, three principles:
1. This is your workshop.
2. This is a safe space, especially for ignorance.
3. Anyone can learn programming if they try.
# Day 1: Organizing data and intro to R
## Data Organization in Spreadsheets
Download spreadsheet [here](https://ndownloader.figshare.com/files/2252083).
Why use spreadsheets?
- Organizing data (**?**)
- Simple calculations
- Plotting (**simple**)
- **Sorting**
- **Data entry**
- Sharing data
- Quality assurance / quality control
- Making tables
Why are spreadsheets frustrating?
- Not flexible
- Customizing plots
- lots of steps
- Calculations & equations
- Too "smart"
- Big data sets
What's wrong with this (truly horrible, awful, no good, very bad) spreadsheet?
- Inconsistent data format
- Inconsistent *date* format
- Inconsistent column names/numbers
- Text and numbers combined - variables
- Missing data -- handled inconsistently
- (what do blank cells mean?)
- Different structures in different sheets
- Lack of units (for some of them)
- Metadata is encoded inconsistently
Let's make this better. Columns:
| date | plot | species | sex | notes |
Think about organizing before you even collect the data. Maybe give the field assistant a template with column letters so that you could organize the data in the way you wanted. Usually you want the variables to columns and the observations to be rows. However, you may want to flip this so that each observation is a column and the variables are rows.
What if someone in 10 or 20 years wanted to use this data?
- do they have excel?
- what is their operating system?
- can they read excel?
Save in .csv (comma separated values)
.csv is a text file that has the organization of something like a spreadsheet but is just text.
Activity to try on your own - opening an Excel spreadsheet with a text editor (Example of text editor: Notepad++)
Once you have data entered you can save it as .tsv, .txt. or .csv files.
Potential issue using .csv - What if you have commas in your cells? The computer will read the comma as a separator so that it wil add a new cell. Example, if you were making a table that had comments from a website and the comments had a comma in them, it would make a new cell when it finds a comma. To get around this you can put the data that contains commas in quotes (either single or double quotes will work)
Note about sharing data with European colleagues: They are more likely to use commas instead of decimals. Using .csv may be difficult, .tsv (tab separated values) may be perferred.
You can also save as a .txt file.
Has anyone ever overwritten data? Of course, everyone has!
How can you make sure that you don't overwrite your data?
- Make copies.
- Dropbox or google drive
- Git/Mercurial
Find ways to have new documents with new names or find another way to organize.
Flaws of GUI (graphical user interface)
* doesn't document what you have done to data
Document what you have done to the data
* digital lab notebooks
* can write paper notes
* .txt document
* read me file
* keep a raw file, a .readme file, and then a working file
* lab notebook
Work with data repositories to understand requirements for data submission prior to data collection
Null values
* never use a 0
* don't use an extreme value like -999
* N/A avoid
* Null - can be a good option
* NA can be a good option
* think about it before you start and be consistent
Variable names
* avoid a white space
Data themselves
* avoid blanks and empty spaces
Dates
* if formatting as a number the date will be presented as the number of days since a specific date, but that specific date varies depending on the operating system
* How to find year of a cell? Use formula =year()
* good idea: save the day, month, and year in separate cells
* another idea: store as an integer using YYYYMMDD (example: 20170130) BUT if you know you are going to sort by something like year or month than using separate cells might be a better solution
## Intro to R
Why did you come to this workshop?
* I know MatLab but I would like to learn more about R because R is open source and my employer is more likely to want me to use it
* I use MatLab for my analysis but my colleague uses R and python, and I would like to know the advantages of using R
* I am a social scientist and I would like to use R to communicate
* I know R already but I am self taught so I may not do everything in the most efficient way
* Learn about R to prepare for other classes
R advantages
* it is good at handling big datasets (one tool, but not the only one) (big data = 1 million or more rows)
* free
* widely available
* commonly used
* graphics and visualizations
* reproducability
Managing Expectations
* You will not walk out of this workshop a R expert
Goal
* A novice is someone who doesn't know what question to ask, the beginner knows what question to ask
* This workshop will get everyone to a beginner level
* We will show you where to go for help
R is text based
R-studio is a wrapper that lets you manage files
We are going to create a work space
Good Reproducible Research Practices
* saving data in non-propreitary formats
* document what you have done (example: readme file)
* consistent folder structure (exmaple: raw data, working, plots, scripts functions)
Use syntax <- to set a variable
Must select RUN to run your command in R studio, do not press enter to run the command
### Challenge 1:
What do the variables `mass`, `age` and `mass_index` contain after you execute each of these commands?
```
mass <- 47.5 # mass?
age <- 122 # age?
mass <- mass * 2.0 # mass?
age <- age - 20 # age?
mass_index <- mass/age # mass_index?
```
You can use functions in R.
You can assign the output of the function to a variable by using the syntax s <- sqrt ()
An argument is the thing inside of the paranthesis in the function. Example sqrt (2.5) - function is sqrt and argument is 2.5
round is a funtion to round a number.
Arguments cacn have names. For example for round you can use the argument digit to specify how many decimal places to round to. Example round (3.14159, digits = 3) would round the number 3.14159 to 3 decimal places
How to find help?
* in R studio you can start typing the function name and it will explain the function
* can press F1 for more help and a description will display in the bottom right panel
* example: to round you can use ceiling function to round up, floor fucntion to round down, truncate function to remove decimals, round to do a simple rounding
* in help you can search by text
### Challenge 2:
What does the `c()` function do?
To determine what c does you can type ?c into the console and it will search for function c
c combines arguments
When you run c you get a vector. A vector is an ordered list of things.
Vectors can contain numbers or Boolean values
Boolean values are true or false
R knows what a Boolean (logical) value is
True and False are evaluated as T and F
You cannot label a variable True or False or T or F
Need to use quotes around text in arguments (example animals <- c("dog", "cat", "mouse", "moose"))
You can determine what kind of data a vector has in it.
You use the class function to determine what type of data is in a vector. Example class(animals) return characters because it is a vector that is text characters; example class(is_large) returns logical because the vector is true or false.
### Challenge 3:
What will happen in each of these examples? (hint: use `class()` to check the data type of your objects):
```
num_char <- c(1, 2, 3, "a")
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
tricky <- c(1, 2, 3, "4")
```
Logical is a restricted data type. (True or False)
Numerical data are numbers, there are more things we can do with numbers. You can turn logical vectors into numbers (1 and 0).
Can express more things in character or text format, but can do fewer things with these. For example, one + two + three does not return an answer, but 1+2+3 returns 6.
# Other places to get help
* other people!!!
* stack overflow (online community of programmers who ask and answer questions) - Check to see if your question has been asked before!
* how to ask a good question? include the code you are using and what you are trying to do
* R-help - there is a mailing list for R-help. These people expect you to ask a good question.
# Afternoon Session - April 16 2018
What do we remember from this morning?
* characters win over numbers, and numbers win over logic
* can use a ?? to search for things in R Studio
* a single ? to search for things lets you find more specific information
* google your question first!
* ask questions on forums, such as stack over flow
* what is a vector? a list of things - they can be numerical, logical, or characters
* what happens if you try to create a mixed vector? all of things in a vector have to have the same data type
* will R give you a warning if all the data in a vector are not the same type? No! R expects you to know what type the data are.
* the most important collaborator is you 6 months from now
One thing that can help you remember what you have done is properly documenting your code. You can include comments to R code by using a hashtag/pound sign. Anything after the hashtag/pound sign the interpreter ignores.
You can find elements in a vector by using a square bracket.
Can use indexing to get information you want about a vector.
na.rm = TRUE will remove values coded NA
can use a pipe (|) (it may be above the enter key with the forward slash) to indicate an "or" statement
Use an ampersand (&) for and operations
example: weight is greater than 22 OR less than 11 [weight_g > 22 | weight_g < 11]
### Challenge 4
1. Starting with
```
weight_g <- c(10, 15, 20, 25, 30, 35)
```
write three expressions that subset `weight_g` to return
```
[15, 20, 25]
```
2. Starting with
```
lengths <- c(10, 24, NA, 18, NA, 20)
```
compute the median of lengths.
3. What does `weight_g[ c(TRUE, FALSE) ]` do? why?
```
```
use command structure to investigate the structure of the surveys spreadsheet str(surveys)
int = a special kind of number that doesn't have decimals
can use the function class to figure
e.g. class(surveys)
data.frame
data.frame is a data object in R that is very successful
data.frame is a data object where columns are variables and the rows are observations
We can call different functions to learn about the size and shape of the data frame (e.g. the table)
* dim(surveys) gives you the dimensions of the table (e.g. the rows and the columns)
* dim returns integers in a vector
* to get the number of rows using of the dim function dim(surveys)[1]
* there is also a special function nrow that returns the number of rows nrow(surveys)
* ncol returns the number of columns ncol(surveys)
* head will display the first six rows head(surveys)
* tail will show the last six rows tail(surveys)
* can get the names of the columns by using function names(surveys)
* rows can have names but often are identified by row number
* can use rownames(surveys) to return a list of row names
* by default, if you try to run a function on data that returns more than 1000 results it will only display the first 1000 results by default
* how to get the first value in the first row out of surveys? surveys[1,1]
* to get the number out of the 4th row and 12th column surveys[4,12]
* to get the entirity of the 4th row surveys[4,]
* if you don't specify the dimension you want to index you will get all the data for that dimension
* what if you want row 4 columns 1-4? use colon to indicate from 1 through 4, example: surveys[4, 1:4]
* you could also use the c function to obtain row 4 columns 1,2, and 4 example surveys[4, c(1,2,4)]
* you can also refer to columns by their name! (thank goodness!)
* example, taxa column surveys["taxa"] returns the column name taxa
* surveys [, taxa] - would return all the rows for the taxa column
* can also use two brackets in the syntax surveys[["taxa"]] will also return all the rows for the column
* can also use the dollar sign to get back all the rows for a single column example surveys$taxa
* even though all of these above syntax return the same results, they are asking R to do different things
### Challenge
1. Create a data.frame (surveys_200) containing only the observations from row 200 of the surveys dataset.
2. Use nrow() to extract the row that is in the middle of the data frame. Store the content of this row in an object named surveys_middle.
3. You can also get a column out of a data frame via its name. Here are four ways:
```
surveys["species_id"]
surveys[, "species_id"]
surveys[["species_id"]]
surveys$species_id
```
Do these four commands all return the same thing?
Post-challenge wrap up!
How do I get the 200 row from surveys?
surveys[200, ]
How do I save the 200 row from surveys as a new object called surveys_200 data.frame?
surveys_200 <- surveys[200,]
How did you do the nrow() to figure out number 2 above in multiple rows?
nrow(surveys)/2
surveys[17393,]
How did you do the nrow() to figure out number 2 above in one command?
surveys_middle <- surveys[nrows(surveys) / 2, ]
How is R evaluating the order of operations in the command above?
(2 / (3+6))
TAKE AWAY - inside out composing of functions
TAKE AWAY - having two different commands might make it easier to read your functions
In number three above, do they all return the same thing?
No - surveys["species_id"] returns a data.frame. The rest return a factor. What's a factor? Stay tuned! we will find out later on!
What is a factor?
Some data is obviously numeric - e.g. record_id, month, day, year
But species ID is not numeric. It appears to be an abbreviation. Example: NL
What is the difference between numeric and categorical data?
* With numbers you can do math!
* With numbers there are usually an infinite number
* Categories may give you context
* Categories often have a limited number of options
What do you do with categories vs numbers?
* ANOVA would be something you do on categories
* generally categories can be counted (e.g. how many categories are in this dataset)
* categories can be assigned attributes (generally)
R represents categories as "factors"
If we use the function `surveys$species_id` you get the first 1000 results plus it says there are 48 levels.
So each species ID is a level
if you use the function `levels(surveys$species_id)` you can get a list of all the species_id
if you use the function nlevels you get a count of the number of levels for a factor in the dataset `nlevels(survey$species_id)`
Note that levels are ordered
`surveys$sex`
`plot(surveys$sex)`
It plots three levels of the thing
`levels(surveys$sex)` will return the levels of the category sex (blanks, F, and M)
This is where metadata is useful; blanks are juveniles that you could not sex.
R was specifically built to deal with categorical data. In other programming languages this is not the case. When you put in categories they are assigned numbers.
To make the data less ambigious. Let's encode the juveniles.
```
levels(survey$sex) <- c("Juvenile", "F", "M")
plot(survey$sex)
```
Order of categories is fixed when you put then in R in alphabetical order.
How to reorder the categories?
If I want to reorder the category I create a factor using the function name factor and create a new factor `sex <- factor(surveys$sex, levels = c("M", "F", "Juvenile"))`
Sometimes you don't want to store information, like notes, as categories.
There are a number of functions that allow you to read data in a tabular format.
stringAsFactors is a function you can use to let R know if the data should be read in as a factor (category) or as a text string.
example
```
surveys <- read.csv("data/portal_data_joined.csv",
stringAsFactors = FALSE)
```
can use the str function to seee if these are strings run `str(surveys)`
How to change something back to a factor?
USe the factor command! `as.factor(surveys$genus)`
Will convert the factor column as a factor.
How to overwrite the old column with the new column?
```
surveys$genus <- factor(surveys$genus)
```
To confirm it did what you wanted you can read the table by entering surveys$genus
How to convert data into a character type use function as.character
```
as.character(surveys$genus)
```
This would convert the factor back into text string.
Can use as.numeric to convert to numbers
Should I use .numeric to encode a year?
There may be reasons to convert year to a number and reasons not to. For example, could use factor to convert years to categories.
Another example would be plot id - it is a number, but it doesn't mean anything. It was arbitrarily assigned. Can convert to a factor using as.factor function.
To convert year to a function
```
years <- as.factor(surveys$year)
years
as.numeric(years)
as.character(years)
```
Categories are not encoded as text. Under the hood it is a list of indices.
Can use as.characters to transform back into years.
Can then use `as.numeric(as.characters(years))` to convert back into year numbers.
### Challenge 6
We have seen how data frames are created when using `read.csv()`, but they can also be created by hand with the data.frame() function. There are a few mistakes in this hand-crafted data.frame. Can you spot and fix them? Don’t hesitate to experiment!
```
animal_data <- data.frame(
animal = c(dog, cat, sea cucumber, sea urchin),
feel = c("furry", "squishy", "spiny"),
weight = c(45, 8 1.1, 0.8)
)
```
Can you predict the class for each of the columns in the following example? How can you check your guesses?
Challenge 6 Wrap Up!
What is a data frame? It is columns and rows of data!
What contraints do I have on the rows and columns? Without which a data frame is not a data frame?
The columns all have to be the same length
Every entry in a column has the same data type.
If I am using c to combine I get that for free.
If those are my contraints, what was wrong with the line animal= above? There were no quotation marks around the animal names.
What is wrong with the line that starts with feel = , you need to add a feel for the cat - there have to be an equal number of feels for animals. For the lie weight = you need to add a comma after 8.
To find the class of feel?
`class(animal_data$feel)`
To find all the classes
`str(animal_data)`
To set animal as a character instead of factor us as.character
```
animal_data$animal <- as.character(animal_data$animal)`
```
We are going to load a library (also known as a package) called lubridate to help us handle dates!
```
`library(lubridate)`
```
That only works if you have it installed. To install a package, you can use `install.packages()`, e.g. `install.packages('lubridate').`
Packages are sets of scripts you can use, which often share a goal or function. For example, `lubridate` is a package that makes 'dealing with dates a little easier'. Some packages come automatically installed with R, and some automatically load when you open R (e.g. `base`, `stats`). You can customize which packages automatically load.
Anyone can create a package, even you!
# Day 2: Plotting and analyzing data with R; SQL databases
You will need to install these packages.
```
install.packages(c("tidyverse", "RSQLite"))
```
## UPDATED software installation instructions
- Download and install SQLiteBrowser from here: http://sqlitebrowser.org/
- Download the data set:
- Browse to https://figshare.com/articles/Portal_Project_Teaching_Database/1314459
- Click "Download All"
- Unzip the .ZIP file somewhere that you'll be able to find it.
To clear out your environment
rm(x)
To repopulate your environment - You can open the script file you saved yesterday and you could rerun that script and get all the environment back.
Best Practice: Always work from a script and create intermediate objects along the way
R scripts have the extensions .R
get(wd) is a function to find out where you are (It means get(workdirectory)) and will give you the path of the directory you are working in.
To read in data, use the function read.csv. You need to name the object that the data is being read in to. In the example below we are getting (<-) the file portal_data_joined.csv and we are creating the object called surveys.
Example: `surveys <- read.csv(/data/portal_data_joined.csv)`
## Data manipulation with `dplyr`
> But first! Open a new R script and save it. In this script, use `read.csv()` to load the surveys data into your environment.
>
To load dplyr use function `library(dplyr)`
After you load a library you might get a messaging about masking. This means that an old function that was controlled by an old library (or default) will now be using the function of the newly loaded library.
How to get row 5 of the surveys?
```
surveys[5,]
```
How to get the column plot_id?
```
surveys["plot_id"]
```
To use the dplyr grammar we will use the select function. Select gives you a column(s) by name.
```
select(surveys, plot_id)
```
In dplyr you don't need to put quotes around the variable name in the dplyr function.
How to select multiple columns?
```
select(surveys, plot_id, species_id, weight)
```
How to select a row? use the filter function to find a row that meets the requirements of a certain condition!
For example to find all the rows that have the year 1995:
```
filter(surveys, year==1995)
```
To find all the surveys where year is 1995 and the sex is female:
```
filter(surveys, year ==1995, sex =="F")
```
What if you want to look for columns and rows at the same time?
Can do this in two steps-
```
surveys_plot_id <- select(surveys, plot_id, year)
filter(surveys_plot_id, year==1995
```
Can also use nested functions:
`select(filter(surveys, year ==1995), plot_id)`
Can also use pipes:
A pipe %>% flows data from one thing (one function) to the next thing (another function). So the output from one function becomes the input for the next function.
Use the short cut control-shift-m
Best practice: start a new line after a pipe
To get a list of species id, the sex and weight of all the animals with weight less than 5:
```
surveys %>%
filter(weight < 5)
select(species_id, sex, weight)
```
If we want to save this output in a new data.frame you can create a new object.
```
surveys_light <- surveys %>%
filter(weight < 5)
select(species_id, sex, weight)
```
When there is a get function R evaluates everything on the right of the get first.
> ### Challenge 1
> Using pipes, subset the `survey` data to include individuals collected before 1995 and retain only the columns `year`, `sex`, and `weight`.
>
```
survey_prior1995 <-surveys %>%
filter (year < 1995) %>%
select(year, sex, weight)
```
You may want to create a new column in your data based on values in existing columns. In Excel this is where you would enter a formula and drag and drop it.
In dyplr you use the function mutate.
For example, maybe you want to change the weight from grams to kilograms and store it in a column called weight_kg
```
surveys %>%
mutate(weight_kg = weight / 1000)
```
What if I want to make two new columns at the same time? For example I want to know the weight in kg squared.
```
surveys %>%
mutate(weight_kg = weight / 1000,
weight_kg2 = weight_kg ^ 2)
```
What if I want to remove observations that have NA in the weight column? The grammar for not is a exclamation point !
```
surveys %>%
filter(!is.na(weight)) %>%
mutate(weight_kg = weight / 1000,
weight_kg2 = weight_kg ^ 2)
```
> ### Challenge 2
> Create a new data frame from the `surveys` data that meets the following criteria: contains only the `species_id` column and a new column called `hindfoot_half` containing values that are half the `hindfoot_length` values. In this `hindfoot_half` column, there are no `NA`s and all values are less than 30.
>
> **Hint**: think about how the commands should be ordered to produce this data frame!
>
```
surveys_hindfoothalf <- surveys %>%
mutate(hindfoot_half = hindfoot_length / 2) %>%
filter(!is.na(hindfoot_half), hindfoot_half <30)
select(species_id, hindfoot_half)
```
Review Dplyr Functions
filter = choose rows
select = choose columns
mutate = creates new column using existing columns
group_by = split
summarise = apply and combine (example: count)
arrange = order table by variable
A common analysis for this type of data would be to want to find the mean weight for each species for each year. Dplyr allows you to do this easily with a few lines of code! This is often called a **split apply combine**.
Split Apply Combine workflow
split data - means to organize it into groups. Use the group_by() function.
Example: get the mean weight by sex
```
surveys %>%
group_by(sex) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))
```
By running this command you get a table that is called a Tibble!
What if we want to use more than one group variable? for example grouping by both species_id and sex
```
surveys %>%
group_by(sex, species_id) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))
```
This command gets you a lot of results that include NaN which means not a number. We need to remove NAs earlier.
```
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarise(mean_weight = mean(weight))
```
By default a Tibble will only print 10 rows, but we can use hte function print (n = #) to tell R how many rows to print.
```
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarise(mean_weight = mean(weight)) %>%
print (n = 20)
```
What if we wanted to summarize multiple variables? and we wanted minimum weight as well as mean weight.
```
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarise(mean_weight = mean(weight),
min_weight = min(weight)) %>%
print (n = 20)
```
What if I want to sort by the minimum weight? I want to use arrange function.
```
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarise(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(min_weight) %>%
print (n = 20)
```
To arrrange in descending order
```
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarise(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(desc(min_weight)) %>%
print (n = 20)
```
One more function is useful. Sometimes you want to know the sample size of each level of a factor. Use the count function.
```
surveys %>%
count(sex)
```
> ### Challenge 3
> 1. How many individuals were caught in each plot_type surveyed?
```
surveys %>%
count(plot_type)
```
> 2. Use group_by() and summarize() to find the mean, min, and max hindfoot length for each species (using species_id). Also add the number of observations (hint: see ?n).
```
surveys %>%
filter(!is.na(hindfoot_length))
group_by(species_id) %>%
summarize(mean = mean(hindfoot_length),
min = min(hindfoot_length),
max = max(hindfoot_length))
```
> 3. What was the heaviest animal measured in each year? Return the columns year, genus, species_id, and weight.
Do this activity on your own! The answer will be shared in the syllabus.
what is tidyr
To this point we have discussed tables as columns of variables and rows of observations, this is called wide data in R.
BUT there is another way the data can be represented that is called long in R:
| var | obs | value|
| -------- | -------- | -------- |
| variable1 | observation1 | value|
| variable2 | observation1 | value |
There are two functions in tidyr that we can use to change the representation of the data. spread and gather.
spread() makes long data wide
To use spread() we need:
* the data
* the key column: the variable whose value will become the new column names
* the value column: the variable whose values will fill the new column variables
We are going to use dplyr to make a table that we are then going to make wide.
1st - Make a table of the mean weight of each speces in each plot using filter, group_by, and summarize and save it
```
surveys_mw <- surveys %>%
filter(!is.na(weight)) %>%
group_by(species_id, plot_id) %>%
summarize(mean_weight = mean(weight))
```
This returns a table that could be considered a long data table.
Now we are going to make the table wide.
We would like each species to have a column for weight by plot.
```
surveys_mw_wide <- surveys_mw %>%
spread(key = species_id, value = mean_weight)
```
We can use the function gather() to make the wide table long.
The gather() function needs 4 things
* the data
* the key column variable we wish to create from column names
* the values column variable we wish to create from cells
* the names of the columns that we use to fill the key variable or to drop
To drop a variable we can use a minus sign.
```
surveys_mw_wide %>%
gather(key = species, value = mean_weight, -plot_id)
````
> ### Challenge 4
> 1. Spread the surveys data frame with year as columns, plot_id as rows, and the number of genera per plot as the values. You will need to summarize before reshaping, and use the function n_distinct() to get the number of unique genera within a particular chunk of data. It’s a powerful function! See ?n_distinct for more.
> I like [this figure for visualizing spread](http://www.datacarpentry.org/R-ecology-lesson/img/spread_data_R.png).
>
```
tblnew <- surveys %>%
group_by(year, plot_id) %>%
summarize(NDG = n_distinct(genus)) %>%
spread(key=year, value = ndg)
```
> 2. Now take that data frame and gather() it again, so each row is a unique plot_id by year combination.
> I like [this figure for visualizing gather](http://www.datacarpentry.org/R-ecology-lesson/img/gather_data_R.png).
> ### We briefly reviewed these first two in class. Q3 & Q4 are challenges for later, with answers to be shared by tomorrow at the latest
```
tblnew %>%
gather(key=years, value = ngenera, -plot_id)
```
We use read.csv() to import table and there's also read_csv()
To export we can use write.csv or write_csv()
> 3. The surveys data set has two measurement columns: hindfoot_length and weight. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let’s walk through a common solution for this type of problem. First, use gather() to create a dataset where we have a key column called measurement and a value column that takes on the value of either hindfoot_length or weight. Hint: You’ll need to specify which columns are being gathered.
> 4. With this new data set, calculate the average of each measurement in each year for each different plot_type. Then spread() them into a data set with a column for hindfoot_length and weight. Hint: You only need to specify the key and value columns for spread().
# LUNCH!
Afternoon
%in% can be used as an equal or statement but across tables.
Example: we want to find species that match species_id in species_count table from the surveys_complete table.
```
surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
dim(surveys_complete)
```
write.csv can be used to export the data. You need ot provide the data that you want to export and where you want it to go.
Example:` write.csv(surveys_complete, "data_output/surveys_complete_abundant.csv")`
visualize
`library(ggplot2)`
ggplot wants something to map with. It wants aesthetics or aes(). This is where we select variables and specify how to plot them.
Example: plotting by species length in a scatter plot
`ggplot(data = surveys_complete, aes(x=weight, y=hindfoot_length))`
ggplot also wants a geometry or geom. It wants to know how to graphically represent the data (e.g. lilnes, points, bars)
```
ggplot(data = surveys_complete, aes(x=weight, y=hindfoot_length)) +
geom_point()
```
can also create an object for a plot
```
firstscatter <- `ggplot(data = surveys_complete, aes(x=weight, y=hindfoot_length)) +
```
We can change the plot. We can change the color of the points, the size of the points, and the transparency of the points for example.
To change something based on the data values you must do that within the aes call.
For example, to change color based on species_id
```
firstscatter + geom_point(size = 0.5, alpha = 0.5, aes(color = species_id))
```
To make a box plot of the weight by species id use the boxplot geometry
```
ggplot(data = surveys_complete, aes(x = species_id, y = weight)) +
geom_boxplot()
```
You can use geom_jitter to look at spread or distribution of the data.
```
ggplot(data = surveys_complete, aes(x = species_id, y = weight)) +
geom_boxplot()+
geom_jitter(alpha = 0.25, color = "violet")
```
To change what is presented on top, you can change the order
```
ggplot(data = surveys_complete, aes(x = species_id, y = weight)) +
geom_jitter(alpha = 0.25, color = "violet") +
geom_boxplot()
```
There is a reference website for ggplot2. ggplot2.tidyverse.org/reference/
You can change the scale of your axes, you can change the labels, etc.
Faceting - splitting data up into groups. For example, you may want to make a different plot for each species.
facet_wrap() will fit the groups into approximate grids
example
```
facet_wrap( ~ species_id)
```
facet_grid forces an actual grid
To save a plot to file use ggsave()
example:
```
ggsave("fig_output/toomanyfacets.pdf", plot=tooManyFACETS, device = "pdf", units = "in", width = 10, height = 7)
```
**Helpful Resources**
Tabular data are everywhere!
There are a significant number of packages specifically for R to do specific things.
**The Comprehensive R Archive Network**
https://cran.r-project.org/
- you can find packages here for specific things
- if you use the Task Views on the left menu, you get a list of things you might want to use R for (example: Machine Learning)
- there are Vignettes for some of the packages that show you how to use the package and evaluate if it will be useful for what you want to do
Reference websites for specific packages such as ggplot2.tidyverse.org
The company that makes R studio makes Cheat Sheets! These are brilliant
Example: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
All Cheat Sheets: https://www.rstudio.com/resources/cheatsheets/
# **SQL**
Why use SQL?
- large data sets (example: every tweet ever twittered)
- relational data
- long-term data
There are different SQL database programs - examples: Oracle, MySQL, Microsoft SQL. Each one will be slightly different.
When you define a new table you have to tell SQL what type of data is in each column. This will tell the program what you can do with the data. For example setting the data to an integer vs text.
SQL is not case sensitive. Usually programmers write SQL commands in all caps, but this is not required but is a best practice.
The asterisk is a wild card .
You can also do calculations with SQL.
Have to end your query with a semi-colon.
You can also use a limited number of functions that we can use to manipulate data.
The ROUND function can be used to ROUND in SQL.
Where to find helpful information?
the internet! example: https://www.w3schools.com/sql/func_sqlserver_round.asp
To comment out text in SQL use a double dash --
use SELECT to get the columns out of a table
use FROM to indicate the table
example
`SELECT species_id, weight FROM surveys;`
will return species_id and weight columns from the survey table
use WHERE to specify contraints of what data rows you want backc
example
```
SELECT species_id, weight
FROM surveys
WHERE species_id = 'DM'
```
In the WHERE clause you can use the syntax
= equals
> greater than
< less than
>= greater than or equal to
<= less than or equal to
!= not
AND
OR
example:
```
SELECT species_id, weight
FROM surveys
WHERE species_id = 'DM' AND year = 2000;
```
To find day, month, year, species_id, weight for individuals caught on plot #1 that weight more than 75 g
```
SELECT day, month, year, species_id, weight
FROM surveys
WHERE plot_id = 1 AND weight > 75;
```
Note that Day, Month, and Year can be functions in SQL, but in the case above they are column names.
To get results that meet the conditions in a list. For examples to get rows with species_id DM, DS and DO
```
SELECT *
FROM surveys
WHERE species_id IN ('DM', 'DS', 'DO');
```
You can also order the results in SQL using the function ORDER BY
and can be explicit in if you want it to be ascending or descending
Example below will return results by year descending.
```
SELECT *
FROM surveys
WHERE species_id IN ('DM', 'DS', 'DO')
ORDER BY year DESC;
```
Example: year, species_id, weight from the surveys table sorted with the largest weights at the top
```
SELECT year, species_id, weight
FROM surveys
ORDER BY weight DESC;
```
There are a limited number of statistical operations you can do in SQL. It is a structured query language not a statistical langauge.
COUNT can be used to COUNT
for example:
```
SELECT COUNT(plot_id)
FROM surveys;
```
will return the number of entries in the table that have a plot_id
To return the number of distinct plot_ids use DISTINCT function with COUNT
```
SELECT COUNT(DISTINCT(plot_id))
FROM surveys;
````
Can use SUM to find the SUM of a variable
can use MAX to find the Maximum for a variable
Can use MIN to find the Minimum for a variable
Can use AVG to find the average for a variable
examples
SELECT SUM(weight)
SELECT MAX(weight)
SELECT MIN(weight)
SELECT AVG(weight)
### Challenge
-- total weight, avg weight, min and max weights
-- all animals caught in year 2000
-- ..... and weights between 5 and 10 grams (inclusive)
```
SELECT SUM(weight), MIN(weight), MAX(weight), AVG(weight)
FROM surveys
WHERE year = 2000 AND weight >= 5 AND weight<= 10;
```
can use GROUP BY to make a group.
Order of clauses in SQL matters.
SELECT must come first, then FROM, then WHERE, then GROUP BY for example.
### Challenge
Write queries that return:
- How many individuals were counted in each year
- in total
- per species
```
SELECT year, COUNT(species_id)
FROM surveys
GROUP BY year;
```
```
SELECT year, species_id, COUNT(species_id)
FROM surveys
GROUP BY year, species id;
```
- The average weight of each species in each year
```
SELECT year, species_id, AVG(weight)
FROM surveys
GROUP BY year, species_id
```
In SQL missing data is encoded as NULL
In R we used is.na to remove NA
To select NULL values the syntax is IS NULL
```
SELECT species_id
FROM surveys
WHERE sex IS NULL;
```
Additional information about NULL in SQL
How to join tables in a relational database? For example two tables have the variable species_id, one table has the taxon information and the other table has the information about field observation about those species.
Use JOIN and ON clauses to join tables.
EXample
```
SELECT *
FROM surveys
JOIN species ON surveys.species_id =species.species_id;
```
Where surveys.species_id = species.species_id are a table called surveys and the column species id equals the same things as the column called species id in the species table
Alternative syntax when the field is named the same in both tables
```
SELECT surveys.plot_id, species.genus
FROM surveys
JOIN species USING(species_id);
```
### Challenge
- Write a query that returns the genus, species and weight
of every individual captured at the site.
### Challenge
What is the average hind-food length for the male and
female rodent of each species? Is there a male/female difference?
How many specimens of each species were caught in each type of plot?
What is the average weight of each rodent species over each year
in the study? Is there a noticable trend for any species?
```
SELECT surveys.year, species.genus, species.species, AVG(weight)
FROM surveys
JOIN species USING (species_id)
GROUP BY surveys.species_id, surveys.year;
```
To remove NULL values
```
SELECT surveys.year, species.genus, species.species, AVG(weight)
FROM surveys
JOIN species USING (species_id)
WHERE weight IS NOT NULL
GROUP BY surveys.species_id, surveys.year;
```
Can export the table from SQL into csv and then load it into R and use ggplot
OR
You can connect R to SQL
Find out more online at: http://www.datacarpentry.org/R-ecology-lesson/05-r-and-databases.html