owned this note
owned this note
Published
Linked with GitHub
# Data Management Workshop
### CSU Evolutionary Ecology 2017
# I. Organizing data and intro to R
Before we start, three principles:
1. This is your workshop.
2. This is a safe space, especially for ignorance.
3. Anyone can learn programming if they try.
> Check in: does everyone have a laptop with R and RStudio installed?
---
# Day 1: Spreadsheets and .csv
Today we will talk about:
* Best practices with data in spreadsheets
* How to export a csv file from Excel
* An introduction to RStudio
* How to import a csv file to R
### Overview
Good **data organization** is the foundation of your research project. Most researchers have data or do data entry in spreadsheets. Spreadsheet programs are very **useful graphical interfaces** for designing data tables and handling very basic data quality control functions, but are not great at reproducibility, statistical analyses, or figure production.
### Best practices
> **Exercise**
>
> Download [this messy .xls file](https://ndownloader.figshare.com/files/2252083). When you open it, you can see that there are two tabs. Two field assistants conducted ecological surveys of desert animal species, one in 2013 and one in 2014, and they both kept track of the data in their own way. Now you're the person in charge of this project and you want to be able to start analysing the data.
>
> In pairs, discuss what was done wrong, and what you might do to fix it. From this, we'll derive some general principles.
#### Add your notes on what went wrong here:
* Organized in very strange ways
* Units
* Formats change on each tab
* Collapse three boxes into one cohesive box using an additional species code column
* Consider organizing into one sheet: condense 2013 and 2014 tabs, along with "dates" tab
* Don't combine variables within a column
* Plot disapears and species replaces the column
### Okay, here are some best principles:
1. **Don't treat a spreadsheet like it is a lab notebook** by relying on context (e.g. notes in the margin, visual cues, or spatial layout of data and fields) to convey information.
As humans, we can (usually) interpret these things, but computers are dumb, and unless we explain to the computer what every single thing means (and that can be hard!), it will not be able to see how our data fit together. Using the power of computers, we can manage and analyze data in much more effective and faster ways, but to use that power, we have to set up our data for the computer to be able to understand it (and computers are very literal). This is why it’s extremely important to set up **well-formatted tables** from the outset - before you even start entering data from your very first preliminary experiment.
2. **Don't mess with the raw data.** During data clean up or analyses, it is very easy to end up with a spreadsheet that looks very different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when Reviewer #3 asks for a different analysis, **you must**:
- **create a new file or tab with your cleaned or analyzed data.** Do not modify that original dataset, or you will never know where you started!
- **keep track of the steps you took in your clean up or analysis.** You should track these steps as you would any step in an experiment. You can do this in another text file, or a good option is to create a new tab in your spreadsheet with your notes. This way the notes and data stay together.
3. **Columns for variables and rows for observations**. The rule of thumb, when setting up a datasheet, is **columns = variables**, **rows = observations**, **cells = data** (values). All of the related observations that you would like to compare should be stored in the same file (ideally in the same sheet). You can always make a new column for whatever you would have used to separate observations onto different sheets.
4. **Don't combine multiple pieces of information in one cell**. Sometimes it just seems like one thing, but think if that's the only way you'll want to be able to use or sort those data. For example, maybe you'll want to look at 'species' and 'sex' separately when you analyse the desert animal census data. As we discuss below, you might also want separate columns for date and time components (e.g. year, month, day, hour, etc.).
## Problems with Spreadsheets
Spreadsheets are **good for data entry**, but in reality we **tend to use spreadsheet programs for much more** than data entry. We use them to create data tables for publications, to generate summary statistics, and make figures.
Generating **tables for publications** in a spreadsheet is not optimal - often, when formatting a data table for publication, we’re reporting key summary statistics in a way that is **not really meant to be read as data**, and often involves **special formatting** (merging cells, creating borders, making it pretty). We advise you to do this sort of operation within your document editing software.
The latter two applications, **generating statistics and figures**, should be used with caution: because of the graphical, drag and drop nature of spreadsheet programs, it can be very difficult, if not impossible, to replicate your steps (much less retrace anyone else's), particularly if your stats or figures require you to do more complex calculations. Furthermore, in doing calculations in a spreadsheet, it’s easy to accidentally apply a slightly different formula to multiple adjacent cells. When using a command-line based statistics program like R or SAS, it’s practically impossible to accidentally apply a calculation to one observation in your dataset but not another unless you’re doing it on purpose.
## Dates as data
Dates in spreadsheets are often stored in one column. While this seems the most natural way to record dates, it is not always a good practice. A spreadsheet application will display the dates in seemingly correct way (for the human eye) but how it actually handles and stores the dates may be problematic.
Let's try with a simple challenge.
> Challenge: pulling month, day and year out of dates
>
> - In the `dates` tab of your Excel file you have the data from 2014 plot 3. There's a `Date collected` column.
> - Let’s extract month and year from the dates to new columns. For this we can use the built in Excel functions
>
```
=MONTH(A3)
=DAY(A3)
=YEAR(A3)
```
> (Make sure the new column is formatted as a number and not as a date.) You can see that even though you wanted the year to be 2014, Excel automatically interpreted it as 2015, the year the data were entered.
### Dates stored as integers
Excel **stores dates as a number** - see the last column in the above figure. Essentially, it counts the days from a default of December 31, 1899, and thus stores July 2, 2014 as the serial number 41822.
(But wait. That’s the default on my version of Excel. We’ll get into how this can introduce problems down the line later in this lesson. )
This serial number thing can actually be useful in some circumstances. Say you had a sampling plan where you needed to sample every thirty seven days. In another cell, you could type:
=B2+37
And it would return
8-Aug
because it understands the date as a number `41822`, and `41822 +37 = 41859` which Excel interprets as August 8, 2014. It retains the format (for the most part) of the cell that is being operated upon, (unless you did some sort of formatting to the cell before, and then all bets are off).
Which brings us to the many different ways Excel provides in how it displays dates. The more you learn about this, the more you’ll see MANY ways that ambiguity creeps into your data depending on the format you chose when you enter your date data. If you’re not fully cognizant of which format you’re using, you can end up entering your data in a way that Excel will badly misinterpret.
> **Question**
> What will happen if you save the file in Excel (in `csv` format) and then open the file using a plain text editor?
**Note**: You may notice that when exporting into a text-based format (such as CSV), some versions of Excel will export its internal date integer instead of a useful value (that is, the dates will be represented as integer numbers). This can potentially lead to problems, if you use other software to manipulate the file.
### Preferred date format
Instead we recommend storing dates as an integer in `YYYYMMDD` format so that they can sorted, incremented, and displayed with no confusion. Another alternative is to store dates with YEAR, MONTH, and DAY in separate columns or as YEAR and DAY-OF-YEAR in separate columns (depending on what works best for your analyses).
**Note**: Excel is unable to parse dates from before 1899-12-31, and will thus leave these untouched. If you’re mixing historic data from before and after this date, Excel will translate only the post-1900 dates into its internal format, thus resulting in mixed data. If you’re working with historic data, be extremely careful with your dates!
**Note 2**: Excel also entertains a second date system, the 1904 date system, as the default in Excel for Macintosh. This system will assign a different serial number than the [1900 date system](https://support.microsoft.com/kb/180162). Because of this, [dates must be checked for accuracy when exporting data from Excel](http://datapub.cdlib.org/2014/04/10/abandon-all-hope-ye-who-enter-dates-in-excel/) (look for dates that are ~4 years off). This is the most common offset of dates, but be aware that other date systems exist.
## Exporting data from spreadsheets
Storing the data you're going to work with for your analyses in Excel default file format (`*.xls` or `*.xlsx` - depending on the Excel version) is a **bad idea**. Why?
- Because it is a **proprietary format**, and it is possible that in the future, technology won’t exist (or will become sufficiently rare) to make it inconvenient, if not impossible, to open the file.
- **Other spreadsheet software** may not be able to open the files saved in a proprietary Excel format.
- **Different versions of Excel** may be changed so they handle data differently, leading to inconsistencies.
- Finally, as more **journals and grant agencies** are requiring you to deposit your data in a data repository, they mainly **don't accept Excel format**. It needs to be in one of the formats discussed here.
As an example, do you remember how we talked about how Excel stores **dates** earlier? Turns out there are **multiple defaults for different versions of the software**. And you can switch between them all willy-nilly. So, say you’re compiling Excel-stored data from multiple sources. There’s dates in each file--Excel interprets them as their own internally consistent serial numbers. When you combine the data, Excel will take the serial number from the place you’re importing it from, and interpret it using the rule set for the version of Excel you’re using. Essentially, you could be adding a huge error to your data, and it wouldn’t necessarily be flagged by any data cleaning methods if your ranges overlap.
Storing data in a **universal**, **open**, **static format** will help deal with this problem. Try **tab-delimited** (.tsv or .txt or .tab) or **CSV** (.csv, more common). CSV files are plain text files where the columns are separated by commas, hence 'comma separated variables' or CSV. The advantage of a CSV over an Excel/SPSS/etc. file is that we can open and read a CSV file using just about any software, including a simple **text editor**. Data in a CSV can also be **easily imported** into other formats and environments, such as SQLite and R. We're not tied to a certain version of a certain expensive program when we work with CSV, so it's a good format to work with for maximum portability and endurance. Most spreadsheet programs can save to delimited text formats like CSV easily, although they complain and make you feel like you’re doing something wrong along the way.
To save a file you have opened in Excel into the `*.csv` format:
1. From the top menu select 'File' and 'Save as'.
2. In the 'Format' field, from the list, select 'Comma Separated Values' (`*.csv`).
3. Double check the file name and the location where you want to save it and hit 'Save'.
An important note for backwards compatibility: you can open CSVs in Excel!
### Commas as part of data values in `*.csv` files
Comma Separated Value files are very useful or easily exchanging and sharing data. However, there can be problems with this particular format if the data values themselves include commas (,). In that case, the software which you use (including Excel) will most likely incorrectly display the data in columns. It is because the commas which are a part of the data values will be interpreted as a delimiter.
For example, our data could look like this:
species_id,genus,species,taxa
AB,Amphispiza,bilineata,Bird
AH,Ammospermophilus,harrisi,Rodent-not,censused
AS,Ammodramus,savannarum,Bird
In record `AH,Ammospermophilus,harrisi,Rodent-not,censused` the value for *taxa* includes a comma (`Rodent-not,censused`). Importing this into Excel will split the value for 'taxa' for this record into two columns. This can propagate to a number of further errors. For example, the "extra" column may be interpreted as a column with many missing values (and without a proper header!). In addition to that, the value 'taxa' for the record in row 3 and all those following it will be incorrect.
### Dealing with commas as part of data values in `*.csv` files
If you want to store your data in `*.csv` and expect that your data may contain commas in their values, you can avoid the problem discussed above by putting the values in quotes (""). For example:
species_id,genus,species,taxa
"AB","Amphispiza","bilineata","Bird"
"AH","Ammospermophilus","harrisi","Rodent-not censused"
"AS","Ammodramus","savannarum","Bird"
"BA","Baiomys","taylori","Rodent"
"CB","Campylorhynchus","brunneicapillus","Bird"
"CM","Calamospiza","melanocorys","Bird"
"CQ","Callipepla","squamata","Bird"
"CS","Crotalus","scutalatus","Reptile"
"CT","Cnemidophorus","tigris","Reptile"
"CU","Cnemidophorus","uniparens","Reptile"
However, if you are working with already existing dataset in which the data values are not included in "" and but which have commas as both delimiters and parts of data values, you are potentially facing a major problem with **data cleaning**.
If the dataset you're dealing with contains hundreds or thousands of records, cleaning them up manually (by either removing commas from the data values or putting the values into quotes - "") is not only going to take hours and hours but may potentially end up with you accidentally introducing many errors.
Cleaning up datasets is one of major problems in many scientific disciplines. The approach almost always depends on the particular context, and we are not able to cover these approaches in detail during this workshop.
An excellent reference, in particular with regard to R scripting is
> Hadley Wickham, *Tidy Data*, Vol. 59, Issue 10, Sep 2014, Journal of Statistical Software. [http://www.jstatsoft.org/v59/i10](http://www.jstatsoft.org/v59/i10).
## Importing *.csv files into R
### Quick intro to R and RStudio
- The term "R" is used to refer to both the programming language and the software that interprets the scripts written using it.
- RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.
### Why learn R?
**R does not involve lots of pointing and clicking**
The learning curve might be steeper than with other software, but with R, the results of your analysis does not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that's a good thing! If you want to redo your analysis because you collected more data, you don't have to remember which button you clicked in which order to obtain your results, you just have to run your script again.
Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes. Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use.
**R code is great for reproducibility:** Reproducibility is when someone else (including your future self) can obtain the same results from the same dataset when using the same analysis. R integrates with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically. An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements.
**R is interdisciplinary and extensible:** With 10,000+ packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more.
**R works on data of all shapes and sizes:** The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won't make much difference to you. R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient. R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.
**R produces high-quality graphics:** The plotting functionalities in R are endless, and allow you to adjust any aspect of your graph to convey most effectively the message from your data.
**R has a large community:** Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as [Stack Overflow](https://stackoverflow.com/).
**Not only is R free, but it is also open-source and cross-platform:** Anyone can inspect the source code to see how R works. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs.
### Knowing your way around RStudio
Let's start by learning about [RStudio](https://www.rstudio.com/), which is an Integrated Development Environment (IDE) for working with R. We will use RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.
RStudio is divided into 4 "Panes": the **Source** for your scripts and documents (top-left, in the default layout), the R **Console** (bottom-left), your **Environment/History** (top-right), and your **Files/Plots/Packages/Help/Viewer** (bottom-right). The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout). One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone.
### Getting set up
It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the **working directory**. All of the scripts within this folder can then use *relative paths* to files that indicate where inside the project a file is located (as opposed to absolute paths, which point to where a file is on a specific computer). Working this way makes it a lot easier to move your project around on your computer and share it with others without worrying about whether or not the underlying scripts will still work.
RStudio provides a helpful set of tools to do this through its "Projects" interface, which not only creates a working directory for you but also remembers its location (allowing you to quickly navigate to it) and optionally preserves custom settings and open files to make it easier to resume work after a break. Below, we will go through the steps for creating an "R Project" for this tutorial.
* Start RStudio
* Under the `File` menu, click on `New project`, choose `Existing directory` and select the folder already containing your project or `New directory` and create one.
* Click on `Create project`
### Organizing your working directory
Using a consistent folder structure across your projects will help keep things organized, and will also make it easy to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you may create directories (folders) for **scripts**, **data**, and **documents**.
- **`data/`** Use this folder to store your raw data and intermediate datasets you may create for the need of a particular analysis. For the sake of transparency and [provenance](https://en.wikipedia.org/wiki/Provenance), you should *always* keep a copy of your raw data accessible and do as much of your data cleanup and preprocessing programmatically (i.e., with scripts, rather than manually) as possible. Separating raw data from processed data is also a good idea. For example, you could have files `data/raw/tree_survey.plot1.txt` and `...plot2.txt` kept separate from a `data/processed/tree.survey.csv` file generated by the `scripts/01.preprocess.tree_survey.R` script.
- **`documents/`** This would be a place to keep outlines, drafts, and other text.
- **`scripts/`** This would be the location to keep your R scripts for different analyses or plotting, and potentially a separate folder for your functions.
You may want additional directories or subdirectories depending on your project needs, but these should form the backbone of your working directory. For this workshop, we will need a `data/` folder to store our raw data, and we will create later a `data_output/` folder when we learn how to export data as CSV files. **For now, use your regular file system to create a `data/` folder in your project directory, then move the .csv file you would like to import into the `data/` folder.**
### Interacting with R
The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or *code*, instructions in R because it is a common language that both the computer and we can understand. We call the instructions *commands* and we tell the computer to follow the instructions by *executing* (also called *running*) those commands.
There are two main ways of interacting with R: by using **the console** or by using **script files** (plain text files that contain your code). The console pane (in RStudio, the bottom left panel) is the place where commands written in the R language can be typed and executed immediately by the computer. It is also where the results will be shown for commands that have been executed. You can type commands directly into the console and press `Enter` to execute those commands, but they will be forgotten when you close the session.
**Because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor, and save the script.** This way, there is a complete record of what we did, and anyone (including our future selves!) can easily replicate the results on their computer.
RStudio allows you to execute commands directly from the script editor by using the <kbd>`Ctrl`</kbd> + <kbd>`Enter`</kbd> shortcut (on Macs, <kbd>`Cmd`</kbd> + <kbd>`Return`</kbd> will work, too). The command on the current line in the script (indicated by the cursor) or all of the commands in the currently selected text will be sent to the console and executed when you press <kbd>`Ctrl`</kbd> + <kbd>`Enter`</kbd>. You can find other keyboard shortcuts in this [RStudio cheatsheet about the RStudio IDE](https://github.com/rstudio/cheatsheets/blob/master/source/pdfs/rstudio-IDE-cheatsheet.pdf).
At some point in your analysis you may want to check the content of a variable or the structure of an object, without necessarily keeping a record of it in your script. You can type these commands and execute them directly in the console. RStudio provides the <kbd>`Ctrl`</kbd> + <kbd>`1`</kbd> and <kbd>`Ctrl`</kbd> + <kbd>`2`</kbd> shortcuts allow you to jump between the script and the console panes.
If R is ready to accept commands, the R console shows a `>` prompt. If it receives a command (by typing, copy-pasting or sent from the script editor using <kbd>`Ctrl`</kbd> + <kbd>`Enter`</kbd>), R will try to execute it, and when ready, will show the results and come back with a new `>` prompt to wait for new commands.
If R is still waiting for you to enter more data because it isn't complete yet, the console will show a `+` prompt. It means that you haven't finished entering a complete command. This is because you have not 'closed' a parenthesis or quotation, i.e. you don't have the same number of left-parentheses as right-parentheses, or the same number of opening and closing quotation marks. When this happens, and you thought you finished typing your command, click inside the console window and press `Esc`; this will cancel the incomplete command and return you to the `>` prompt.
You are now ready to load the data using the `read.csv()` function in R:
```{r, purl=FALSE}
coolname <- read.csv('data/YOURFILENAMEHERE.csv')
```
Congrats! Your data are in your working environment.
---
# Day 2: R Basics
Today we will talk about:
* R functions (arguments & options)
* R objects and assignments
* R data types and structures
* Seeking help
You may find this [base R cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf) useful.
### Working Directories
Let’s import the quantitative genetics data file into our R environment. To import the file, first we need to tell our computer where the file is.
One way to manage this is through R projects. If you open an existing project file, RStudio will set your working directory (the local directory on our computer containing the files you need) to the **file path** you specified when you created the project.
A file path is the route **from** either the begininng of your computer's file organization structure (specified by a leading backslash, or '/') or the folder you are currently working within, **to** the file or folder you need. From folder (aka directory) to folder, each separated by a backslash. A path starting from the beginning is called an **absolute path**, and one from where you are now is called a **relative path**.
If you are not working within a project, you should set your working directory with the function `setwd()`. This is very important: if you forget this step you'll get an error message saying that the file does not exist. As an example, I can use this absolute path to set my working directory to my desktop (note, your path will be different):
```{r, purl=FALSE}
setwd("/Users/brooklebee/Desktop/")
```
From last week, the data file is located in the `data/` folder inside your project's working directory. Load the data into R using `read.csv()`. Note: `read.csv()` has many arguments that you may find useful when importing your own data in the future. You can check those out using the help operator and the function name: `?read.csv`.
```{r, purl=FALSE}
QG <- read.csv("data/526_QG_dataset.csv", header=T)
```
### Functions and their arguments
Functions are "canned scripts" that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R **packages** (more on that later). A function usually gets one or more inputs called **arguments**. Functions often (but not always) return a *value*. A typical example would be the function `head()`. The input (the argument) is an object like a data frame, and the return value (or output) is the first few (by default, six) lines of that object.
```{r, purl=FALSE}
# Function call
head(QG)
```
Arguments can be anything, from numbers to filenames. Exactly what each argument means differs by each function, and must be looked up in the documentation (see 'Seeking help', below). Some functions take arguments which may either be specified by the user, or, if left out, take on a **default** value: these are called **options**. Options are typically used to alter the way the function operates, such as whether it ignores 'bad values', or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default. Let's try using multiple arguments:
```{r, purl=FALSE}
head(QG, n = 10)
```
Here, we've called `head()` with a second argument, `n = 10`. The default is to return the top six lines, but the `n` argument allows us to specify a different number. To learn more about the arguments to `head()`, we can use `args(head)` or look at the help documentation for this function using `?head`.
If you provide the arguments in the exact same order as they are defined you don't have to name them:
```{r, results='show', purl=FALSE}
head(QG, 10)
```
And if you do name the arguments, you can switch their order:
```{r, results='show', purl=FALSE}
head(n = 10, QG)
```
It's good practice to put the non-optional arguments (like the number you're rounding) first in your function call, and to specify the names of all optional arguments. If you don't, someone reading your code might have to look up the definition of a function with unfamiliar arguments to understand what you're doing.
### Functions useful for data frames
We already saw how the function `head()` can be useful to check the content of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let's try them out!
* Size:
* `dim()` - returns a vector with the number of rows in the first element, and the number of columns as the second element (the **dim**ensions of the object)
* `nrow()` - returns the number of rows
* `ncol()` - returns the number of columns
* Content:
* `head()` - shows the first 6 rows
* `tail()` - shows the last 6 rows
* Names:
* `names()` - returns the column names (synonym of `colnames()` for `data.frame` objects)
* `rownames()` - returns the row names
* Summary:
* `str()` - structure of the object and information about the class, length, and content of each column
* `summary()` - summary statistics for each column
Note: most of these functions are "generic", they can be used on other types of objects besides `data.frame`.
### Assignments and Objects
You can get output from R simply by typing math in the console. However, to do useful and interesting things, we need to assign _values_ to _objects_. To create an object, we need to give it a name followed by the assignment operator `<-`, and the value we want to give it:
`<-` is the assignment operator. It assigns values on the right to objects on the left. So, after executing `x <- 3`, the value of `x` is `3`. The arrow can be read as 3 **goes into** `x`. For historical reasons, you can often **but not always** use `=` for assignments. Because of the [slight](http://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) [differences](https://web.archive.org/web/20130610005305/https://stat.ethz.ch/pipermail/r-help/2009-March/191462.html) in syntax, it is good practice to stick to `<-` for assignments.
In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> will write ` <- ` in a single keystroke.
Objects can be given any name such as `x`, `current_temperature`, or `subject_id`. You want your object names to be clear and not too long. They cannot start with a number (`2x` is not valid, but `x2` is). R is case sensitive (e.g., `weight_kg` is different from `Weight_kg`). There are some names that cannot be used because they are the names of fundamental functions in R (e.g., `if`, `else`, `for`, see [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) for a complete list). In general, even if it's allowed, it's best to not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`, `weights`). If in doubt, check the help to see if the name is already in use.
It's also best to avoid dots (`.`) within a variable name as in `my.dataset`. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it's best to avoid them. It is also recommended to use nouns for variable names, and verbs for function names. Using a consistent coding style makes your code clearer to read for your future self and your collaborators. In R, three popular style guides are [Google's](https://google.github.io/styleguide/Rguide.xml), [Jean Fan's](http://jef.works/R-style-guide/) and the [tidyverse's](http://style.tidyverse.org/). The tidyverse's is very comprehensive and may seem overwhelming at first. You can install [**`lintr`**](https://github.com/jimhester/lintr) to automatically check for issues in the styling of your code.
When assigning a value to an object, R does not print anything. You can force R to print the value by using parentheses or by typing the object name:
```{r, purl=FALSE}
weight_kg <- 55 # doesn't print anything
(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg`
weight_kg # and so does typing the name of the object
```
Now that R has `weight_kg` in memory, we can do arithmetic with it. For instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg):
```{r, purl=FALSE}
2.2 * weight_kg
```
We can also change a variable's value by assigning it a new one:
```{r, purl=FALSE}
weight_kg <- 57.5
2.2 * weight_kg
```
This means that assigning a value to one variable does not change the values of other variables. For example, let's store the animal's weight in pounds in a new variable, `weight_lb`:
```{r, purl=FALSE}
weight_lb <- 2.2 * weight_kg
```
and then change `weight_kg` to 100.
```{r, purl=FALSE}
weight_kg <- 100
```
What do you think is the current content of the object `weight_lb`? 126.5 or 220?
> ### Challenge
>
> What are the values after each statement in the following?
>
> ```{r, purl=FALSE}
> mass <- 47.5 # mass?
> age <- 122 # age?
> mass <- mass * 2.0 # mass?
> age <- age - 20 # age?
> mass_index <- mass/age # mass_index?
> ```
> What do you think the value of mass will be if you run `mass <- mass * 2.0` again?
### Data types and structures
A **vector** is the most common and basic data type in R. A vector is composed by a series of values, which can be numbers or characters. We can assign a series of values to a vector using the `c()` (for **c**ombine) function. For example we can create a vector of animal weights and assign it to a new object `weight_g`:
```{r, purl=FALSE}
weight_g <- c(50, 60, 65, 82)
weight_g
```
A vector can also contain characters:
```{r, purl=FALSE}
animals <- c("pigeon", "mink", "sculpin")
animals
```
The quotes around the animal names are essential here: they tell R that the character string within is just that, a string of characters. Without the quotes R will assume there are objects called `pigeon`, `mink` and `sculpin`. As these objects don't exist in R's memory, there will be an error message.
There are many functions that allow you to inspect the content of a vector. `length()` tells you how many elements are in a particular vector:
```{r, purl=FALSE}
length(weight_g)
length(animals)
```
An important feature of a vector is that all of the elements are the same type of data. The function `class()` indicates the class (the **data type**) of an object:
```{r, purl=FALSE}
class(weight_g)
class(animals)
```
The function `str()` provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects:
```{r, purl=FALSE}
str(weight_g)
str(animals)
```
You can use the `c()` function to add other elements to your vector:
```{r, purl=FALSE}
weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g
```
In the first line, we take the original vector `weight_g`, add the value `90` to the end of it, and save the result back into `weight_g`. Then we add the value `30` to the beginning, again saving the result back into `weight_g`.
We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.
We just saw 2 of the 6 main **atomic vector** types (or **data types**) that R uses: `"character"` and `"numeric"`. These are the basic building blocks that all R objects are built from. The other 4 are:
* `"logical"` for `TRUE` and `FALSE` (the boolean data type)
* `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R that it's an integer)
* `"complex"` to represent complex numbers with real and imaginary parts (e.g.,
`1 + 4i`) and that's all we're going to say about them
* `"raw"` that we won't discuss further
> ### Challenge
>
> * We’ve seen that atomic vectors can be of type character, numeric, integer, and logical. But what happens if we try to mix these types in a single vector?
>
> * What will happen in each of these examples? (hint: use `class()` to check the data type of your objects):
>
> ```r
> num_char <- c(1, 2, 3, 'a')
> num_logical <- c(1, 2, 3, TRUE)
> char_logical <- c('a', 'b', 'c', TRUE)
> tricky <- c(1, 2, 3, '4')
> ```
>
> * Why do you think it happens?
> * You've probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class _coercion_. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?
Vectors are one of the many **data structures** that R uses. Other important ones are lists (`list`), matrices (`matrix`), data frames (`data.frame`), factors (`factor`) and arrays (`array`).
> **A few more notes on data types**:
>
> - You can use logical vectors to subset other vectors (we'll spend more time on subsetting next week).
```{r, purl=FALSE}
# make a new logical vector with TRUE every time you like the animal in animals
isgood <- c(TRUE, TRUE, FALSE)
# then pull out only the values in the vector where isgood is TRUE
animals[isgood == TRUE]
```
> - Even though character vectors are not numeric, R will still sort them alphabetically.
```{r, purl = FALSE}
class(rownames(QG)) # character
# this subset still works because R is sorting them alphabetically
rownames(QG)[rownames(QG) < 200]
# Do you think R will evaluate this statement as TRUE or FALSE?
"four" > "five"
```
### Seeking help
**Use the built-in RStudio help interface to search for more information on R functions**
One of the most fastest ways to get help is to use the RStudio help interface. This panel is at the lower right hand panel of RStudio. By typing the word "mean" into the search bar, RStudio tries to also give a number of suggestions that you might be interested in. The description is then shown in the display window.
**I know the name of the function I want to use, but I'm not sure how to use it**
If you need help with a specific function, let's say `barplot()`, you can type `?barplot` If you just need to remind yourself of the names of the arguments, you can use: `args(lm)`.
**I want to use a function that does X, there must be a function for it but I don't know which one.**
If you are looking for a function to do a particular task, you can use the`help.search()` function, which is called by the double question mark `??`. However, this only looks through the installed packages for help pages with a match to your search request. If you can't find what you are looking for, you can use the [rdocumentation.org](http://www.rdocumentation.org) website that searches through the help files across all packages available.
Also, a generic Google or internet search "R <task\>" will often either send you to the appropriate package documentation or a helpful forum where someone else has already asked your question.
**I am stuck... I get an error message that I don't understand**
Start by googling the error message. However, this doesn't always work very well because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. "subscript out of bounds"). If the message is very generic, you might also include the name of the function or package you're using in your query.
However, you should check Stack Overflow. Search using the `[r]` tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers: [http://stackoverflow.com/questions/tagged/r](http://stackoverflow.com/questions/tagged/r)
**Asking for help**
The key to receiving help from someone is for them to rapidly grasp your problem. You should make it as easy as possible to pinpoint where the issue might be.
Try to use the correct words to describe your problem. For instance, a package is not the same thing as a library. Most people will understand what you meant, but others have really strong feelings about the difference in meaning. The key point is that it can make things confusing for people trying to help you. Be as precise as possible when describing your problem.
If possible, try to reduce what doesn't work to a simple *reproducible example*. If you can reproduce the problem using a very small data frame instead of your 50,000 rows and 10,000 columns one, provide the small one with the description of your problem. When appropriate, try to generalize what you are doing so even people who are not in your field can understand the question. For instance instead of using a subset of your real dataset, create a small (3 columns, 5 rows) generic one. For more information on how to write a reproducible example see [this article by Hadley Wickham](http://adv-r.had.co.nz/Reproducibility.html).
To share an object with someone else, if it's relatively small, you can use the function `dput()`. It will output R code that can be used to recreate the exact same object as the one in memory:
``` {r}
dput(head(iris)) # iris is an example data frame that comes with R and head() is a function that returns the first part of the data frame
```
If the object is larger, provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your issue). Alternatively, in particular if your question is not related to a data frame, you can save any R object to a file with `saveRDS()`
The content of this file is however not human readable and cannot be posted directly on Stack Overflow. Instead, it can be sent to someone by email who can read it with the `readRDS()` command (here it is assumed that the downloaded file is in a `Downloads` folder in the user's home directory): `some_data <- readRDS(file="~/Downloads/iris.rds")`
Last, but certainly not least, **always include the output of `sessionInfo()`** as it provides critical information about your platform, the versions of R and the packages that you are using, and other information that can be very helpful to understand your problem.
**Where to ask for help?**
* Your friendly colleagues.
* [Stack Overflow](http://stackoverflow.com/questions/tagged/r): if your question hasn't been answered before and is well crafted, chances are you will get an answer in less than 5 min. Remember to follow their guidelines on [how to ask a good question](http://stackoverflow.com/help/how-to-ask).
* The [R-help mailing list](https://stat.ethz.ch/mailman/listinfo/r-help): it is read by a lot of people (including most of the R core team), a lot of people post to it, but the tone can be pretty dry, and it is not always very welcoming to new users. If your question is valid, you are likely to get an answer very fast but don't expect that it will come with smiley faces. Also, here more than anywhere else, be sure to use correct vocabulary (otherwise you might get an answer pointing to the misuse of your words rather than answering your question). You will also have more success if your question is about a base function rather than a specific package.
* If your question is about a specific package, see if there is a mailing list for it. Usually it's included in the DESCRIPTION file of the package that can be accessed using `packageDescription("name-of-package")`. You may also want to try to email the author of the package directly, or open an issue on the code repository (e.g., GitHub).
* There are also some topic-specific mailing lists (GIS, phylogenetics, etc...), the complete list is [here](http://www.r-project.org/mail.html).
**More resources**
* The [Posting Guide](http://www.r-project.org/posting-guide.html) for the R mailing lists.
* [How to ask for R help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html) useful guidelines
* [This blog post by Jon Skeet](http://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) has quite comprehensive advice on how to ask programming questions.
* The [reprex](https://cran.rstudio.com/web/packages/reprex/) package is very helpful to create reproducible examples when asking for help. The [rOpenSci community call "How to ask questions so they get answered"](https://ropensci.org/blog/blog/2017/02/17/comm-call-v13), [Github link](https://github.com/ropensci/commcalls/issues/14) and [video recording](https://vimeo.com/208749032) includes a presentation of the reprex package and of its philosophy.
---
# Day 3: More R Basics
Today we will talk about subsetting and extracting data.
To get started, open up your project for this workshop. If the quantitative genetics dataset is not already in your environment, go ahead and load it using `read.csv()`.
### Let's talk R
What questions do you currently have?
* How do you know what package to use to perform a certain function?
* How do you open or view a functions source code?
* How do you find new packages for R?
### Subsetting vectors
If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:
```{r, results='show', purl=FALSE}
animals <- c("quogga", "platypus", "wombat", "kookaburra")
animals[2]
animals[c(3, 2)]
```
We can also repeat the indices to create an object with more elements than the original one:
```{r, results='show', purl=FALSE}
more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals
```
R indices start at 1, because that's what human beings typically do. Other programming languages (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do.
### Conditional subsetting
Another common way of subsetting is by using a logical vector. `TRUE` will select the element with the same index, while `FALSE` will not:
```{r, results='show', purl=FALSE}
weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
```
Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50:
```{r, results='show', purl=FALSE}
weight_g > 50 # will return logicals with TRUE for the indices that meet the condition
## so we can use this to select only the values above 50
weight_g[weight_g > 50]
```
You can combine multiple tests using `&` (both conditions are true, AND) or `|` (at least one of the conditions is true, OR):
```{r, results='show', purl=FALSE}
weight_g[weight_g < 30 | weight_g > 50]
weight_g[weight_g >= 30 & weight_g == 21]
```
Here, `<` stands for "less than", `>` for "greater than", `>=` for "greater than or equal to", and `==` for "equal to". The double equal sign `==` is a test for numerical equality between the left and right hand sides, and should not be confused with the single `=` sign, which performs variable assignment (similar to `<-`).
A common task is to search for certain strings in a vector. One could use the "or" operator `|` to test for equality to multiple values, but this can quickly become tedious. The function `%in%` allows you to test if any of the elements of a search vector are found:
```{r, results='show', purl=FALSE}
animals <- c("quogga", "platypus", "wombat", "kookaburra")
animals[animals == "kookaburra" | animals == "quogga"] # returns both kookaburra and quogga
animals %in% c("quogga", "wombat", "kookaburra", "merganser", "zebra")
animals[animals %in% c("quogga", "wombat", "kookaburra", "merganser", "zebra")]
```
### Subsetting data frames
Our quantitative genetics data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the "coordinates" we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.
```{r, purl=FALSE}
QG[1, 1] # first element in the first column of the data frame (as a vector)
QG[1, 6] # first element in the 6th column (as a vector)
QG[, 1] # first column in the data frame (as a vector)
QG[1] # first column in the data frame (as a data.frame)
QG[1:3, 7] # first three elements in the 7th column (as a vector)
QG[3, ] # the 3rd element for all columns (as a data.frame)
head_QG <- QG[1:6, ] # equivalent to head(QG)
```
`:` is an operator that creates numeric vectors of integers in increasing or decreasing order. Try running `1:10` and `10:1` to see the output.
You can also exclude certain parts of a data frame using the "`-`" sign:
```{r, purl=FALSE}
QG[,-1] # The whole data frame, except the first column
QG[-c(1:100),] # Excludes the first 100 rows
```
As well as using numeric values to subset a `data.frame` (or `matrix`), columns can be called by name, using one of the four following notations:
```{r, eval = FALSE, purl=FALSE}
QG["fitness"] # Result is a data.frame
QG[, "fitness"] # Result is a vector
QG[["fitness"]] # Result is a vector
QG$fitness # Result is a vector
```
For our purposes, the last three notations are equivalent. RStudio knows about the columns in your data frame, so you can take advantage of the autocompletion feature to get the full and correct column name.
> ### Challenge
>
> 1. Create a new `data.frame` (`QG_200`) containing only the observations from the first 200 rows of the `QG` dataset.
>
> 2. Notice how `nrow()` gave you the number of rows in a `data.frame`?
>
> * Use that number to pull out just that last row in the QG data frame.
> * Compare that with what you see as the last row using `tail()` to make sure it's meeting expectations.
> * Pull out that last row using `nrow()` instead of the row number.
> * Create a new data frame object (`QG_last`) from that last row.
>
> 3. Use `nrow()` to extract the row that is in the middle of the data frame. Store the content of this row in an object named `QG_middle`.
>
> 4. Combine `nrow()` with the `-` notation to reproduce the behavior of `head(QG)`, keeping just the first through 6th rows of the QG dataset.
### Homework: Practice subsetting
Using the quantiative genetics data set:
* Can you extract only the rows containing observations for RIL family 10?
* Make a new data frame with only the RIL family, treatment, and fitness.
* From this new data frame, extract only the rows where fitness is greater than mean fitness across all observations.
---
# Day 4: Factors, Missing Data, and If Statements
### Factors
When we did `str(QG)` we saw that most of the columns consist of integers or numbers, however, the column `treatment` is of a special class called a `factor`.
Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.
Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created, factors can only contain a pre-defined set of values, known as *levels*. By default, R always sorts *levels* in alphabetical order. For instance, if you have a factor with 2 levels:
```{r, purl=TRUE}
sex <- factor(c("male", "female", "female", "male"))
```
R will assign `1` to the level `"female"` and `2` to the level `"male"` (because `f` comes before `m` in the alphabet, even though the first element in this vector is `"male"`). You can check this by using the function `levels()`, and check the number of levels using `nlevels()`:
```{r, purl=FALSE}
levels(sex)
nlevels(sex)
```
Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., "low", "medium", "high"), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the `sex` vector would be:
```{r, results=TRUE, purl=FALSE}
sex # current order
sex <- factor(sex, levels = c("male", "female"))
sex # after re-ordering
```
In R's memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: `"female"`, `"male"` is more descriptive than `1`, `2`. Which one is "male"? You wouldn't be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels.
### Converting factors
You may have noticed that in our quantiative genetics data frame, several columns that contain qualitative values are numeric: RIL, plant_num, and block. This is because those columns contain only numbers, so R automatically assumed they were a numeric data type on input. To change their data type to factor, you can use `as.factor()`.
```{r, purl=FALSE}
QG$RIL <- as.factor(QG$RIL)
class(QG$RIL)
```
If you need to convert a factor to a character vector, you use `as.character(x)`.
```{r, purl=FALSE}
as.character(sex)
```
Converting factors where the levels appear as numbers (such as concentration levels, or years) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. Another method is to use the `levels()` function. Compare:
```{r, purl=TRUE}
f <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(f) # wrong! and there is no warning...
as.numeric(as.character(f)) # works...
as.numeric(levels(f))[f] # The recommended way.
```
Notice that in the `levels()` approach, three important steps occur:
* We obtain all the factor levels using `levels(f)`
* We convert these levels to numeric values using `as.numeric(levels(f))`
* We then access these numeric values using the underlying integers of the vector `f` inside the square brackets
### Renaming factors
When your data is stored as a factor, you can use the `plot()` function to get a quick glance at the number of observations represented by each factor level. Let's look at the number of observations in each treatment:
```{r, purl=TRUE}
## bar plot of the number of observations in each level of treatment:
plot(QG$treatment)
```
Looks like a balanced design! Let's try renaming the level names.
```{r, results=TRUE, purl=FALSE}
treatment <- QG$treatment # so if we mess up, we don't change the working data frame
head(treatment)
levels(treatment)
levels(treatment) <- c("control","stress")
levels(treatment)
head(treatment)
```
> ### Challenge
>
> * Make `QG$plot` a factor, and rename the levels to "A", "B", "C", "D", "E", "F".
>
> * If you have time, try plotting flowering time by experimental plot.
### Using `stringsAsFactors=FALSE`
By default, when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (converted) into the `factor` data type. Depending on what you want to do with the data, you may want to keep these columns as `character`. To do so, `read.csv()` and `read.table()` have an argument called `stringsAsFactors` which can be set to `FALSE`.
In many cases, it's preferable to set `stringsAsFactors = FALSE` when importing your data, and converting as a factor only the columns that require this data type.
The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance).
### Missing data
As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented as `NA`.
When doing operations on numbers, most functions will return `NA` if the data you are working with include missing values. This feature makes it harder to overlook the cases where you are dealing with missing data. You can add the argument `na.rm=TRUE` to calculate the result while ignoring the missing values.
```{r, purl=FALSE}
mean(QG$biomass)
max(QG$biomass)
mean(QG$biomass, na.rm = TRUE)
max(QG$biomass, na.rm = TRUE)
```
If your data include missing values, become familiar with the functions `is.na()`, `na.omit()`, and `complete.cases()`. Let's try using them now.
> ### Challenge
>
> 1. In one line, extract the biomass column from the quantitative genetics data frame and use the function `na.omit()` to create a new vector with the NAs removed.
>
> 2. Can you do the same thing with `is.na()`? Hint: consider using the `!` operator.
>
> 3. Use the function `complete.cases()` to create a new quantiative genetics dataset that contains only rows where all traits were measured. What are the dimensions of this new data frame?
>
### If statements
Often when we’re coding we want to control the flow of our actions. This can be done by setting actions to occur only if a condition or a set of conditions are met. Alternatively, we can also set an action to occur a particular number of times.
There are several ways you can control flow in R. For conditional statements, the most commonly used approaches are these constructs:
if (condition is true) {
perform action
}
if (condition is true) {
perform action
} else { # that is, if the condition is false,
perform alternative action
}
Say, for example, that we want R to print a message if a variable x has a particular value:
```{r, purl=FALSE}
# sample a random number from a Poisson distribution
# with a mean (lambda) of 8
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
}
x
```
Note you may not get the same output as your neighbour because you are sampling different random numbers from the same distribution. Let’s set the value of x, and then print more information:
```{r, purl=FALSE}
x <- 7
if (x >= 10) {
print("x is greater than or equal to 10")
} else if (x > 5) {
print("x is greater than 5")
} else {
print("x is less than 5")
}
[1] "x is greater than 5"
```
Important: when R evaluates the condition inside `if()` statements, it is looking for a logical element, i.e., TRUE or FALSE. This can cause some headaches for beginners. For example:
```{r, purl=FALSE}
x <- 4 == 3
if (x) {
"4 equals 3"
}
```
As we can see, the message was not printed because the vector x is FALSE
```{r, purl=FALSE}
x <- 4 == 3
x
[1] FALSE
```
> ### Challenge
> Use an `if()` statement to print a suitable message reporting whether there are any observations from RIL 1 in the quantitative genetics dataset. Now do the same for RIL 10. What happens? Why?
Did anyone get a warning message like this?
> Warning in if (QG$RIL == 10) {: the condition has length > 1 and
only the first element will be used
If your condition evaluates to a vector with more than one logical element, the function `if()` will still run, but will only evaluate the condition in the first element. Here you need to make sure your condition is of length 1.
**Tip: `any()` and `all()`**
The `any()` function will return TRUE if at least one TRUE value is found within a vector, otherwise it will return FALSE. This can be used in a similar way to the %in% operator. The function `all()`, as the name suggests, will only return TRUE if all values in the vector are TRUE.
---
# Extra: For loops
If you want to iterate over a set of values, when the order of iteration is important, and perform the same operation on each, a `for()` loop will do the job. This is the most flexible of looping operations, so is also the hardest to use correctly. Avoid using `for()` loops unless the order of iteration is important: i.e. the calculation at each iteration depends on the results of previous iterations.
The basic structure of a `for()` loop is:
for(iterator in set of values){
do a thing
}
For example:
```{r, purl=FALSE}
for(i in 1:10){
print(i)
}
```
The 1:10 bit creates a vector on the fly; you can iterate over any other vector as well. You can use any name for your iterator: `i` and `j` are common but `line` or `step`, etc. will work just as well.
We can use a `for()` loop nested within another `for()` loop to iterate over two things at once.
```{r, purl=FALSE}
for(i in 1:5){
for(j in c('a', 'b', 'c', 'd', 'e')){
print(paste(i,j))
}
}
```
Rather than printing the results, we could write the loop output to a new object.
```{r, purl=FALSE}
output_vector <- c()
for(i in 1:5){
for(j in c('a', 'b', 'c', 'd', 'e')){
temp_output <- paste(i, j)
output_vector <- c(output_vector, temp_output)
}
}
output_vector
```
This approach can be useful, but ‘growing your results’ (building the result object incrementally) is computationally inefficient, so avoid it when you are iterating through a lot of values.
**Tip: don’t grow your results**
One of the biggest things that trips up novices and experienced R users alike, is building a results object (vector, list, matrix, data frame) as your for loop progresses. Computers are very bad at handling this, so your calculations can very quickly slow to a crawl. It’s much better to define an empty results object before hand of the appropriate dimensions. So if you know the end result will be stored in a matrix like above, create an empty matrix with 5 row and 5 columns, then at each iteration store the results in the appropriate location.
A better way is to define your (empty) output object before filling in the values. For this example, it looks more involved, but is still more efficient.
```{r, purl=FALSE}
output_matrix <- matrix(nrow=5, ncol=5)
j_vector <- c('a', 'b', 'c', 'd', 'e')
for(i in 1:5){
for(j in 1:5){
temp_j_value <- j_vector[j]
temp_output <- paste(i, temp_j_value)
output_matrix[i, j] <- temp_output
}
}
output_vector2 <- as.vector(output_matrix)
output_vector2
```
**Tip: While loops**
Sometimes you will find yourself needing to repeat an operation until a certain condition is met. You can do this with a `while()` loop.
while(this condition is true){
do a thing
}
As an example, here’s a while loop that generates random numbers from a uniform distribution (the `runif()` function) between 0 and 1 until it gets one that’s less than 0.1.
```{r, purl=FALSE}
z <- 1
while(z > 0.1){
z <- runif(1)
print(z)
}
```
`while()` loops will not always be appropriate. You have to be particularly careful that you don’t end up in an infinite loop because your condition is never met.
> ### Challenge
> 1. Compare the objects output_vector and output_vector2. Are they the same? If not, why not? How would you change the last block of code to make output_vector2 the same as output_vector?
>
> 2. Write a script that loops through the quantitative genetics data by RIL and prints out whether the days to flower is smaller or larger than the mean days to flower across all RILs.
>
> 3. Modify the script from #2 to loop over each plot. This time print out whether the days to flower is smaller than 5, between 5 and 10, or greater than 10.
>
> 4. Write a script that loops over each RIL in the dataset, tests whether the treatment is ‘dry’, and graphs fitness against days to flower as a line graph if the mean biomass is higher than 100.
We made it through R basics! Let's start on a [new notes page for dplyr and ggplot2.](https://hackmd.io/BwNg7AjMCsYMYFoIjgUwQFhMATAgnKvomDmNDjhXAIYDMARkA===)
---
Much of this material is modified from [the Data Carpentry lessons in Ecology](http://www.datacarpentry.org/lessons/), which is released under an open license. This page was written by [Brook Moyers](www.brookmoyers.com) (with collaboration from learners).