# Data management workshop (BSURE) --- Three things to remember: * You're all adults, this is your workshop * But also, this is a safe space for ignorance and diversity * Anyone can learn this stuff, and will be great at it if they try ## July 25: Spreadsheets ## 1. Good Data Entry Practices ### Theory/General Background Alternatives: excel, google sheets, gnumeric, Libre Office, Open Office How have we used spreadsheets (Mostly Excel, some google docs): -graphing -Data ~management/accumulation -Pivot Tables -Statistical tests (forumula or add-ons) Aren't going to be focusing on Spreadsheets for this because: -Hard to track actions/"procedure": lots of steps, possible approaches -Also direct access to raw data, so possible to inadvertantly change or lose data -Can be fairly time-intensive, memorization-heavy -*Are* good for data entry and direct organization ### Learning Objectives * Describe best practices for data entry and formatting in spreadsheets. * Apply best practices to arrange variables and observations in a spreadsheet. Issues with spreadsheets: - inconsistent organization (e.g. by species, plot) - cannot have letters in data boxes: messes up calculations, formulas, and graphs - computers don't have "context"! can't understand things the way people can - different date formats (also missing year in some cases) - variable locations differ, resulting in redefining options for plotting each data set - also empty spaces--harder to interpret, and different programs can handle them differently - also should keep units in headers/titles, rather than mixing with data - don't combine independent variables * Think about how the data is going to end up - are you the only one using it? Will other people have to read the data? What program(s) will you want to analyse the data with? *General notes* - mistake to appproach spreadsheet like a lab notebook--things like visual ques, verbal descriptions, won't give PC information; need to respect differences between data types--point, integer, text, etc. - should anticipate potential uses of data; try to set up a good ~framework to minimize work you'll need to do down the road - Directory: set of folders, so e.g. folder for raw data, folder for ~reformatted data, folder for statistical analysis, for reports, etc. ** Recommended way to organize a spreadsheet ** * variables as columns * observations as rows * keep separate variables as different columns * for missing data, use e.g. "NA", "na", "-", "*", etc. rather than a black space * avoid using 0 for 'no data', as it has a numeric value > **Note:** the best layouts/formats (as well as software and > interfaces) for **data entry** and **data analysis** might be > different. It is important to take this into account, and ideally > automate the conversion from one to another. ### Raw data When you're working with spreadsheets, during data clean up or analyses, it's very easy to end up with a spreadsheet that looks very different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when Reviewer #3 asks for a different analysis, **you must**: - **create a new file or tab with your cleaned or analyzed data.** Do not modify that original dataset, or you will never know where you started! - **keep track of the steps you took in your clean up or analysis.** You should track these steps as you would any step in an experiment. You can do this in another text file, or a good option is to create a new tab in your spreadsheet with your notes. This way the notes and data stay together. ## Exercise We're going to take that messy version of the survey data and clean it up. - If you don't already have it, download the data by clicking [here](https://ndownloader.figshare.com/files/2252083) to get it from FigShare. - Open up the data in a spreadsheet program. You can see that there are two tabs. Two field assistants conducted the surveys, one in 2013 and one in 2014, and they both kept track of the data in their own way. Now you're the person in charge of this project and you want to be able to start doing statistics with the data. - With the person next to you, work on the messy data so that a computer will be able to understand it. Clean up the 2013 and 2014 tabs, and put them all together in one spreadsheet. > **Important** Do not forget of our first piece of advice, to > **create a new file (or tab)** for the cleaned data, **never > modify the original (raw) data**. > > **Also**, did you keep track of what you did? How? After you go through this exercise, we'll discuss as a group what you think was wrong with this data and how you fixed it. ## 2. Dates as Data ### Learning Objectives * Describe how dates are stored and formatted in spreadsheets. * Describe the advantages of alternative date formatting in spreadsheets. * Demonstrate best practices for entering dates in spreadsheets. >* In Excel, the visual output of date data doesn't match internal storage >* stored as an integer: # of days since arbitrary "Start" date >>* Depends on program; might be e.g. Jan 1 1904, 1900, 1876; Google Sheets has 12/30/1899 >>* Issue is that different programs will report the same stored integer as different dates depending on its "start" value >>* Also, entring just a month/day will lead the computer to assume the year is the current year >>* However, since the date is stored as an integer you can carry out math operations with date cells in Excel (e.g. = __ + 40) >* Alternative approaches: >>* Separate out e.g. month/day/year in different columns >>* Enter date as a consistent integer pattern (e.g. 20160912" for 09/12/16) >>>* Easy for you to read, and there are ways for programs to intepret this >>>* **Excel**: automated method: convert to "*yyyymmdd*" as a custom cell format, and then copy/paste as number only; could write a script if you know how to do this >* Can add time as a variable for either format ## 3. Exporting data from spreadsheets ### Learning Objectives * Store spreadsheet data in universal file formats. * Export data from a spreadsheet to a .csv file. >* In general, Excel and the like are best/mostly good for data entry/proximal organization >* Leaving data in Excel long-term makes it vulnerable to updates, backwards compatiblity issues, etc: >> - Because it is a **proprietary format**, and it is possible that in the future, technology won’t exist (or will become sufficiently rare) to make it inconvenient, if not impossible, to open the file. >> - **Other spreadsheet software** may not be able to open the files saved in a proprietary Excel format. >> - **Different versions of Excel** may be changed so they handle data differently, leading to inconsistencies. >>* Can conflict with data management requirements for journals, etc. >* Qualities for ideal storage medium: >>* **Universal**-acessible to many/all programs >>* **Static**-not liable to updates, obsolescence, etc. >>* **Open**-minimal limitations on accessiblity (eg. software liscences, operating systems) >* Text (ASCII data) is the most universal, static, open, way to handle information >>* columns indicated by tabs/commas, observations separated by rows >>* Column titles (a.k.a. 'headers') are in top row >>>* Avoid any unnecesary spaces, tabs, in raw data (including at headers)--can confuse reading (human or computer) of data >>>>* Keep only related observations on a sheet--rows not sharing the same format or level will still be interpreted as related data >>>* .csv (or .tsv) format can only handle one Excel 'tab' or sheet at a time, so avoid making 'sheets' in excel (or equivalent) significant (e.g. a tab for each experimental site) >>> **General rule:** if the observations are related and will be analyzed together, put them in the same data file. >>> Question: How to handle ~literal commas in data? >>>>* As a general rule, avoid using extraneous commas, but can use single or double quotes to separate these out (they will be treated as a 'text string'); Excel is usually ~intelligent enough to catch these automatically, but shouldn't count on it >>>> ## For next week: 1. Read Hadley Wickham, *Tidy Data*, Vol. 59, Issue 10, Sep 2014, Journal of Statistical Software. [http://www.jstatsoft.org/v59/i10](http://www.jstatsoft.org/v59/i10). 2. Download and install R and RStudio [with these instructions.](http://www.datacarpentry.org/R-ecology-lesson/#setup_instructions) --- ## Aug 2: Introduction to R ## What is R and why is it useful? Why use RStudio? > ### Learning Objectives > > * Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. > * Organize files and directories for a set of analyses as an R Project, and understand the purpose of the working directory. > * Use the built-in RStudio help interface to search for more information on R functions. > * Demonstrate how to provide sufficient information for troubleshooting with the R user community. - The term "R" is used to refer to both the programming language and the software that interprets the scripts written using it. - RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer. ### Why learn R? #### R does not involve lots of pointing and clicking, and that's a good thing The learning curve might be steeper than with other software, but with R, the results of your analysis does not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that's a good thing! So, if you want to redo your analysis because you collected more data, you don't have to remember which button you clicked in which order to obtain your results, you just have to run your script again. Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes. Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use. #### R code is great for reproducibility Reproducibility is when someone else (including your future self) can obtain the same results from the same dataset when using the same analysis. R integrates with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically. An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements. #### R is interdisciplinary and extensible With 10,000+ packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more. #### R works on data of all shapes and sizes The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won't make much difference to you. R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient. R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web. #### R produces high-quality graphics The plotting functionalities in R are endless, and allow you to adjust any aspect of your graph to convey most effectively the message from your data. #### R has a large community Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as [Stack Overflow](https://stackoverflow.com/). #### Not only is R free, but it is also open-source and cross-platform Anyone can inspect the source code to see how R works. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs. ### Knowing your way around RStudio Let's start by learning about [RStudio](https://www.rstudio.com/), which is an Integrated Development Environment (IDE) for working with R. We will use RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop. RStudio is divided into 4 "Panes": the **Source** for your scripts and documents (top-left, in the default layout), the R **Console** (bottom-left), your **Environment/History** (top-right), and your **Files/Plots/Packages/Help/Viewer** (bottom-right). The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout). One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone. ### Getting set up It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the **working directory**. All of the scripts within this folder can then use *relative paths* to files that indicate where inside the project a file is located (as opposed to absolute paths, which point to where a file is on a specific computer). Working this way makes it a lot easier to move your project around on your computer and share it with others without worrying about whether or not the underlying scripts will still work. RStudio provides a helpful set of tools to do this through its "Projects" interface, which not only creates a working directory for you but also remembers its location (allowing you to quickly navigate to it) and optionally preserves custom settings and open files to make it easier to resume work after a break. Below, we will go through the steps for creating an "R Project" for this tutorial. * Start RStudio (presentation of RStudio -below- should happen here) * Under the `File` menu, click on `New project`, choose `Existing directory` and select the folder already containing your projec or `New directory` and create one. * Click on `Create project` #### Organizing your working directory Using a consistent folder structure across your projects will help keep things organized, and will also make it easy to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you may create directories (folders) for **scripts**, **data**, and **documents**. - **`data/`** Use this folder to store your raw data and intermediate datasets you may create for the need of a particular analysis. For the sake of transparency and [provenance](https://en.wikipedia.org/wiki/Provenance), you should *always* keep a copy of your raw data accessible and do as much of your data cleanup and preprocessing programmatically (i.e., with scripts, rather than manually) as possible. Separating raw data from processed data is also a good idea. For example, you could have files `data/raw/tree_survey.plot1.txt` and `...plot2.txt` kept separate from a `data/processed/tree.survey.csv` file generated by the `scripts/01.preprocess.tree_survey.R` script. - **`documents/`** This would be a place to keep outlines, drafts, and other text. - **`scripts/`** This would be the location to keep your R scripts for different analyses or plotting, and potentially a separate folder for your functions (more on that later). You may want additional directories or subdirectories depending on your project needs, but these should form the backbone of your working directory. For this workshop, we will need a `data/` folder to store our raw data, and we will create later a `data_output/` folder when we learn how to export data as CSV files. ### Interacting with R The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or *code*, instructions in R because it is a common language that both the computer and we can understand. We call the instructions *commands* and we tell the computer to follow the instructions by *executing* (also called *running*) those commands. There are two main ways of interacting with R: by using the console or by using script files (plain text files that contain your code). The console pane (in RStudio, the bottom left panel) is the place where commands written in the R language can be typed and executed immediately by the computer. It is also where the results will be shown for commands that have been executed. You can type commands directly into the console and press `Enter` to execute those commands, but they will be forgotten when you close the session. Because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor, and save the script. This way, there is a complete record of what we did, and anyone (including our future selves!) can easily replicate the results on their computer. RStudio allows you to execute commands directly from the script editor by using the <kbd>`Ctrl`</kbd> + <kbd>`Enter`</kbd> shortcut (on Macs, <kbd>`Cmd`</kbd> + <kbd>`Return`</kbd> will work, too). The command on the current line in the script (indicated by the cursor) or all of the commands in the currently selected text will be sent to the console and executed when you press <kbd>`Ctrl`</kbd> + <kbd>`Enter`</kbd>. You can find other keyboard shortcuts in this [RStudio cheatsheet about the RStudio IDE](https://github.com/rstudio/cheatsheets/blob/master/source/pdfs/rstudio-IDE-cheatsheet.pdf). At some point in your analysis you may want to check the content of a variable or the structure of an object, without necessarily keeping a record of it in your script. You can type these commands and execute them directly in the console. RStudio provides the <kbd>`Ctrl`</kbd> + <kbd>`1`</kbd> and <kbd>`Ctrl`</kbd> + <kbd>`2`</kbd> shortcuts allow you to jump between the script and the console panes. If R is ready to accept commands, the R console shows a `>` prompt. If it receives a command (by typing, copy-pasting or sent from the script editor using <kbd>`Ctrl`</kbd> + <kbd>`Enter`</kbd>), R will try to execute it, and when ready, will show the results and come back with a new `>` prompt to wait for new commands. If R is still waiting for you to enter more data because it isn't complete yet, the console will show a `+` prompt. It means that you haven't finished entering a complete command. This is because you have not 'closed' a parenthesis or quotation, i.e. you don't have the same number of left-parentheses as right-parentheses, or the same number of opening and closing quotation marks. When this happens, and you thought you finished typing your command, click inside the console window and press `Esc`; this will cancel the incomplete command and return you to the `>` prompt. ### Seeking help #### Use the built-in RStudio help interface to search for more information on R functions One of the most fastest ways to get help, is to use the RStudio help interface. This panel by default can be found at the lower right hand panel of RStudio. As seen in the screenshot, by typing the word "Mean", RStudio tries to also give a number of suggestions that you might be interested in. The description is then shown in the display window. #### I know the name of the function I want to use, but I'm not sure how to use it If you need help with a specific function, let's say `barplot()`, you can type `?barplot` If you just need to remind yourself of the names of the arguments, you can use: `args(lm)` #### I want to use a function that does X, there must be a function for it but I don't know which one... If you are looking for a function to do a particular task, you can use the`help.search()` function, which is called by the double question mark `??`. However, this only looks through the installed packages for help pages with a match to your search request. If you can't find what you are looking for, you can use the [rdocumentation.org](http://www.rdocumentation.org) website that searches through the help files across all packages available. Finally, a generic Google or internet search "R <task\>" will often either send you to the appropriate package documentation or a helpful forum where someone else has already asked your question. #### I am stuck... I get an error message that I don't understand Start by googling the error message. However, this doesn't always work very well because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. "subscript out of bounds"). If the message is very generic, you might also include the name of the function or package you're using in your query. However, you should check Stack Overflow. Search using the `[r]` tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers: [http://stackoverflow.com/questions/tagged/r](http://stackoverflow.com/questions/tagged/r) The [Introduction to R](http://cran.r-project.org/doc/manuals/R-intro.pdf) can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language. The [R FAQ](http://cran.r-project.org/doc/FAQ/R-FAQ.html) is dense and technical but it is full of useful information. #### Asking for help The key to receiving help from someone is for them to rapidly grasp your problem. You should make it as easy as possible to pinpoint where the issue might be. Try to use the correct words to describe your problem. For instance, a package is not the same thing as a library. Most people will understand what you meant, but others have really strong feelings about the difference in meaning. The key point is that it can make things confusing for people trying to help you. Be as precise as possible when describing your problem. If possible, try to reduce what doesn't work to a simple *reproducible example*. If you can reproduce the problem using a very small data frame instead of your 50,000 rows and 10,000 columns one, provide the small one with the description of your problem. When appropriate, try to generalize what you are doing so even people who are not in your field can understand the question. For instance instead of using a subset of your real dataset, create a small (3 columns, 5 rows) generic one. For more information on how to write a reproducible example see [this article by Hadley Wickham](http://adv-r.had.co.nz/Reproducibility.html). To share an object with someone else, if it's relatively small, you can use the function `dput()`. It will output R code that can be used to recreate the exact same object as the one in memory: dput(head(iris)) # iris is an example data frame that comes with R and head() is a function that returns the first part of the data frame If the object is larger, provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your issue). Alternatively, in particular if your question is not related to a data frame, you can save any R object to a file with `saveRDS()` The content of this file is however not human readable and cannot be posted directly on Stack Overflow. Instead, it can be sent to someone by email who can read it with the `readRDS()` command (here it is assumed that the downloaded file is in a `Downloads` folder in the user's home directory): `some_data <- readRDS(file="~/Downloads/iris.rds")` Last, but certainly not least, **always include the output of `sessionInfo()`** as it provides critical information about your platform, the versions of R and the packages that you are using, and other information that can be very helpful to understand your problem. #### Where to ask for help? * Your friendly colleagues: if you know someone with more experience than you they might be able and willing to help you. * [Stack Overflow](http://stackoverflow.com/questions/tagged/r): if your question hasn't been answered before and is well crafted, chances are you will get an answer in less than 5 min. Remember to follow their guidelines on [how to ask a good question](http://stackoverflow.com/help/how-to-ask). * The [R-help mailing list](https://stat.ethz.ch/mailman/listinfo/r-help): it is read by a lot of people (including most of the R core team), a lot of people post to it, but the tone can be pretty dry, and it is not always very welcoming to new users. If your question is valid, you are likely to get an answer very fast but don't expect that it will come with smiley faces. Also, here more than anywhere else, be sure to use correct vocabulary (otherwise you might get an answer pointing to the misuse of your words rather than answering your question). You will also have more success if your question is about a base function rather than a specific package. * If your question is about a specific package, see if there is a mailing list for it. Usually it's included in the DESCRIPTION file of the package that can be accessed using `packageDescription("name-of-package")`. You may also want to try to email the author of the package directly, or open an issue on the code repository (e.g., GitHub). * There are also some topic-specific mailing lists (GIS, phylogenetics, etc...), the complete list is [here](http://www.r-project.org/mail.html). #### More resources * The [Posting Guide](http://www.r-project.org/posting-guide.html) for the R mailing lists. * [How to ask for R help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html)useful guidelines * [This blog post by Jon Skeet](http://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) has quite comprehensive advice on how to ask programming questions. * The [reprex](https://cran.rstudio.com/web/packages/reprex/) package is very helpful to create reproducible examples when asking for help. The [rOpenSci community call "How to ask questions so they get answered"](https://ropensci.org/blog/blog/2017/02/17/comm-call-v13), [Github link](https://github.com/ropensci/commcalls/issues/14) and [video recording](https://vimeo.com/208749032) includes a presentation of the reprex package and of its philosophy. ## R Basics > ### Learning Objectives > > * Define the following terms as they relate to R: object, assign, call, function, arguments, options. > * Create objects and and assign values to them. > * Use comments to inform script. > * Do simple arithmetic operations in R using values and objects. > * Call functions and use arguments to change their default options. > * Inspect the content of vectors and manipulate their content. > * Subset and extract values from vectors. > * Correctly define and handle missing values in vectors. You can get output from R simply by typing math in the console: However, to do useful and interesting things, we need to assign _values_ to _objects_. To create an object, we need to give it a name followed by the assignment operator `<-`, and the value we want to give it: `<-` is the assignment operator. It assigns values on the right to objects on the left. So, after executing `x <- 3`, the value of `x` is `3`. The arrow can be read as 3 **goes into** `x`. For historical reasons, you can also use `=` for assignments, but not in every context. Because of the [slight](http://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) [differences](https://web.archive.org/web/20130610005305/https://stat.ethz.ch/pipermail/r-help/2009-March/191462.html) in syntax, it is good practice to always use `<-` for assignments. In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> at the same time as the <kbd>-</kbd> key) will write ` <- ` in a single keystroke. Objects can be given any name such as `x`, `current_temperature`, or `subject_id`. You want your object names to be explicit and not too long. They cannot start with a number (`2x` is not valid, but `x2` is). R is case sensitive (e.g., `weight_kg` is different from `Weight_kg`). There are some names that cannot be used because they are the names of fundamental functions in R (e.g., `if`, `else`, `for`, see [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) for a complete list). In general, even if it's allowed, it's best to not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`, `weights`). If in doubt, check the help to see if the name is already in use. It's also best to avoid dots (`.`) within a variable name as in `my.dataset`. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it's best to avoid them. It is also recommended to use nouns for variable names, and verbs for function names. It's important to be consistent in the styling of your code (where you put spaces, how you name variables, etc.). Using a consistent coding style makes your code clearer to read for your future self and your collaborators. In R, three popular style guides are [Google's](https://google.github.io/styleguide/Rguide.xml), [Jean Fan's](http://jef.works/R-style-guide/) and the [tidyverse's](http://style.tidyverse.org/). The tidyverse's is very comprehensive and may seem overwhelming at first. You can install the [**`lintr`**](https://github.com/jimhester/lintr) to automatically check for issues in the styling of your code. When assigning a value to an object, R does not print anything. You can force R to print the value by using parentheses or by typing the object name: ```{r, purl=FALSE} weight_kg <- 55 # doesn't print anything (weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` weight_kg # and so does typing the name of the object ``` Now that R has `weight_kg` in memory, we can do arithmetic with it. For instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg): ```{r, purl=FALSE} 2.2 * weight_kg ``` We can also change a variable's value by assigning it a new one: ```{r, purl=FALSE} weight_kg <- 57.5 2.2 * weight_kg ``` This means that assigning a value to one variable does not change the values of other variables. For example, let's store the animal's weight in pounds in a new variable, `weight_lb`: ```{r, purl=FALSE} weight_lb <- 2.2 * weight_kg ``` and then change `weight_kg` to 100. ```{r, purl=FALSE} weight_kg <- 100 ``` What do you think is the current content of the object `weight_lb`? 126.5 or 220? #### Comments The comment character in R is `#`, anything to the right of a `#` in a script will be ignored by R. It is useful to leave notes, and explanations in your scripts. RStudio makes it easy to comment or uncomment a paragraph: after selecting the lines you want to comment, press at the same time on your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. > ### Challenge > > What are the values after each statement in the following? > > ```{r, purl=FALSE} > mass <- 47.5 # mass? > age <- 122 # age? > mass <- mass * 2.0 # mass? > age <- age - 20 # age? > mass_index <- mass/age # mass_index? > ``` >> Successive assignments relative to existing values are possible (*i.e.* recursive definition) >> R evaluates then assigns to object during "<-, =" #### Functions and their arguments Functions are "canned scripts" that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R *packages* (more on that later). A function usually gets one or more inputs called *arguments*. Functions often (but not always) return a *value*. A typical example would be the function `sqrt()`. The input (the argument) must be a number, and the return value (in fact, the output) is the square root of that number. Executing a function ('running it') is called *calling* the function. An example of a function call is: ```{r, eval=FALSE, purl=FALSE} b <- sqrt(a) ``` Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function calculates the square root, and returns the value which is then assigned to variable `b`. This function is very simple, because it takes just one argument. The return 'value' of a function need not be numerical (like that of `sqrt()`), and it also does not need to be a single item: it can be a set of things, or even a dataset. We'll see that when we read data files into R. Arguments can be anything, not only numbers or filenames, but also other objects. Exactly what each argument means differs per function, and must be looked up in the documentation (see below). Some functions take arguments which may either be specified by the user, or, if left out, take on a *default* value: these are called *options*. Options are typically used to alter the way the function operates, such as whether it ignores 'bad values', or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default. Let's try a function that can take multiple arguments: `round()`. ```{r, results='show', purl=FALSE} round(3.14159) ``` Here, we've called `round()` with just one argument, `3.14159`, and it has returned the value `3`. That's because the default is to round to the nearest whole number. If we want more digits we can see how to do that by getting information about the `round` function. We can use `args(round)` or look at the help for this function using `?round`. ```{r, results='show', purl=FALSE} args(round) ``` ```{r, eval=FALSE, purl=FALSE} ?round ``` We see that if we want a different number of digits, we can type `digits=2` or however many we want. ```{r, results='show', purl=FALSE} round(3.14159, digits = 2) ``` If you provide the arguments in the exact same order as they are defined you don't have to name them: ```{r, results='show', purl=FALSE} round(3.14159, 2) ``` And if you do name the arguments, you can switch their order: ```{r, results='show', purl=FALSE} round(digits = 2, x = 3.14159) ``` It's good practice to put the non-optional arguments (like the number you're rounding) first in your function call, and to specify the names of all optional arguments. If you don't, someone reading your code might have to look up the definition of a function with unfamiliar arguments to understand what you're doing. #### Objects vs. variables What are known as `objects` in `R` are known as `variables` in many other programming languages. Depending on the context, `object` and `variable` can have drastically different meanings. However, in this lesson, the two words are used synonymously. For more information see: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects ### Vectors and data types A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the `c()` function. For example we can create a vector of animal weights and assign it to a new object `weight_g`: ```{r, purl=FALSE} weight_g <- c(50, 60, 65, 82) weight_g ``` A vector can also contain characters: ```{r, purl=FALSE} animals <- c("mouse", "rat", "dog") animals ``` The quotes around "mouse", "rat", etc. are essential here. Without the quotes R will assume there are objects called `mouse`, `rat` and `dog`. As these objects don't exist in R's memory, there will be an error message. There are many functions that allow you to inspect the content of a vector. `length()` tells you how many elements are in a particular vector: ```{r, purl=FALSE} length(weight_g) length(animals) ``` An important feature of a vector, is that all of the elements are the same type of data. The function `class()` indicates the class (the type of element) of an object: ```{r, purl=FALSE} class(weight_g) class(animals) ``` The function `str()` provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects: ```{r, purl=FALSE} str(weight_g) str(animals) ``` You can use the `c()` function to add other elements to your vector: ```{r, purl=FALSE} weight_g <- c(weight_g, 90) # add to the end of the vector weight_g <- c(30, weight_g) # add to the beginning of the vector weight_g ``` In the first line, we take the original vector `weight_g`, add the value `90` to the end of it, and save the result back into `weight_g`. Then we add the value `30` to the beginning, again saving the result back into `weight_g`. We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating. We just saw 2 of the 6 main **atomic vector** types (or **data types**) that R uses: `"character"` and `"numeric"`. These are the basic building blocks that all R objects are built from. The other 4 are: * `"logical"` for `TRUE` and `FALSE` (the boolean data type) * `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R that it's an integer) * `"complex"` to represent complex numbers with real and imaginary parts (e.g., `1 + 4i`) and that's all we're going to say about them * `"raw"` that we won't discuss further Vectors are one of the many **data structures** that R uses. Other important ones are lists (`list`), matrices (`matrix`), data frames (`data.frame`), factors (`factor`) and arrays (`array`). > ### Challenge > > * We’ve seen that atomic vectors can be of type character, numeric, integer, and logical. But what happens if we try to mix these types in a single vector? > > * What will happen in each of these examples? (hint: use `class()` to check the data type of your objects): > > ```r > num_char <- c(1, 2, 3, 'a') > num_logical <- c(1, 2, 3, TRUE) > char_logical <- c('a', 'b', 'c', TRUE) > tricky <- c(1, 2, 3, '4') > ``` > > * Why do you think it happens? > > * You've probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class _coercion_. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced? ### Subsetting vectors If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance: ```{r, results='show', purl=FALSE} animals <- c("mouse", "rat", "dog", "cat") animals[2] animals[c(3, 2)] ``` We can also repeat the indices to create an object with more elements than the original one: ```{r, results='show', purl=FALSE} more_animals <- animals[c(1, 2, 3, 2, 1, 4)] more_animals ``` R indices start at 1, because that's what human beings typically do. Other programming languages (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do. #### Conditional subsetting Another common way of subsetting is by using a logical vector. `TRUE` will select the element with the same index, while `FALSE` will not: ```{r, results='show', purl=FALSE} weight_g <- c(21, 34, 39, 54, 55) weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] ``` Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50: ```{r, results='show', purl=FALSE} weight_g > 50 # will return logicals with TRUE for the indices that meet the condition ## so we can use this to select only the values above 50 weight_g[weight_g > 50] ``` You can combine multiple tests using `&` (both conditions are true, AND) or `|` (at least one of the conditions is true, OR): ```{r, results='show', purl=FALSE} weight_g[weight_g < 30 | weight_g > 50] weight_g[weight_g >= 30 & weight_g == 21] ``` Here, `<` stands for "less than", `>` for "greater than", `>=` for "greater than or equal to", and `==` for "equal to". The double equal sign `==` is a test for numerical equality between the left and right hand sides, and should not be confused with the single `=` sign, which performs variable assignment (similar to `<-`). A common task is to search for certain strings in a vector. One could use the "or" operator `|` to test for equality to multiple values, but this can quickly become tedious. The function `%in%` allows you to test if any of the elements of a search vector are found: ```{r, results='show', purl=FALSE} animals <- c("mouse", "rat", "dog", "cat") animals[animals == "cat" | animals == "rat"] # returns both rat and cat animals %in% c("rat", "cat", "dog", "duck", "goat") animals[animals %in% c("rat", "cat", "dog", "duck", "goat")] ``` > ### Challenge > > * Can you figure out why `"four" > "five"` returns `TRUE`? <!-- ```{r, purl=FALSE} ## Answers ## * When using ">" or "<" on strings, R compares their alphabetical order. Here ## "four" comes after "five", and therefore is "greater than" it. ``` --> ### Missing data As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as `NA`. When doing operations on numbers, most functions will return `NA` if the data you are working with include missing values. This feature makes it harder to overlook the cases where you are dealing with missing data. You can add the argument `na.rm=TRUE` to calculate the result while ignoring the missing values. ```{r, purl=FALSE} heights <- c(2, 4, 4, NA, 6) mean(heights) max(heights) mean(heights, na.rm = TRUE) max(heights, na.rm = TRUE) ``` If your data include missing values, you may want to become familiar with the functions `is.na()`, `na.omit()`, and `complete.cases()`. See below for examples. ```{r, purl=FALSE} ## Extract those elements which are not missing values. heights[!is.na(heights)] ## Returns the object with incomplete cases removed. ## The returned object is atomic. na.omit(heights) ## Extract those elements which are complete cases. heights[complete.cases(heights)] ``` > ### Challenge > > 1. Using this vector of length measurements, create a new vector with the NAs removed. > > ```r > lengths <- c(10,24,NA,18,NA,20) > ``` > > 2. Use the function `median()` to calculate the median of the `lengths` vector. Now that we have learned how to write scripts, and the basics of R's data structures, we are ready to start working with the Portal dataset we have been using in the other lessons, and learn about data frames. ## Starting with data ### Presentation of the Survey Data We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent: | Column | Description | |------------------|------------------------------------| | record\_id | Unique id for the observation | | month | month of observation | | day | day of observation | | year | year of observation | | plot\_id | ID of a particular plot | | species\_id | 2-letter code | | sex | sex of animal ("M", "F") | | hindfoot\_length | length of the hindfoot in mm | | weight | weight of the animal in grams | | genus | genus of animal | | species | species of animal | | taxa | e.g. Rodent, Reptile, Bird, Rabbit | | plot\_type | type of plot | We are going to use the R function `download.file()` to download the CSV file that contains the survey data from figshare, and we will use `read.csv()` to load into memory the content of the CSV file as an object of class `data.frame`. To download the data into the `data/` subdirectory (or whatever directory you would like to put the data in), run the following (replacing `data/` with the name of the directory you choose, or nothing if you wish to download it into your working directory): ```{r, eval=FALSE, purl=TRUE} download.file("https://ndownloader.figshare.com/files/2292169", "data/portal_data_joined.csv") ``` You are now ready to load the data (again, if you have a different name for your data directory, or don't have one, change `data/` to the correct name or delete): ```{r, eval=TRUE, purl=FALSE} surveys <- read.csv('data/portal_data_joined.csv') ``` This statement doesn't produce any output because, as you might recall, assignments don't display anything. If we want to check that our data has been loaded, we can print the variable's value: `surveys`. Wow... that was a lot of output. At least it means the data loaded properly. An easier check is to look at the top (the first 6 lines) of this data frame using the function `head()`. ### For next week: Load your data (or the data we just downloaded) into R's memory as an object. 1. What is the class and structure of the object containing this data? 2. Use the function `head()` to look at the first few lines of the data, and compare against the table above (or your knowledge of your own data). Do the data look as you expect them to? ## Aug 9: Working with dataframes ## Starting with data, continued > ### Learning Objectives > > * Describe what a data frame is. > * Load external data from a .csv file into a data frame in R. > * Summarize the contents of a data frame in R. > * Manipulate categorical data in R. > * Change how character strings are handled in a data frame. > * Format dates in R If you haven't already, load the data (remember that the file path and/or name might be different for you): ```{r, eval=TRUE, purl=FALSE} surveys <- read.csv('data/portal_data_joined.csv') ``` ### What are data frames? Data frames are the _de facto_ data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by the functions `read.csv()` or `read.table()`; in other words, when importing spreadsheets from your hard drive (or the web). A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because the column are vectors, they all contain the same type of data (e.g., characters, integers, factors). We can see this when inspecting the <b>str</b>ucture of a data frame with the function `str()`. ### Inspecting `data.frame` Objects We already saw how the functions `head()` and `str()` can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let's try them out! * Size: * `dim(surveys)` - returns a vector with the number of rows in the first element, and the number of columns as the second element (the **dim**ensions of the object) * `nrow(surveys)` - returns the number of rows * `ncol(surveys)` - returns the number of columns * Content: * `head(surveys)` - shows the first 6 rows * `tail(surveys)` - shows the last 6 rows * Names: * `names(surveys)` - returns the column names (synonym of `colnames()` for `data.frame` objects) * `rownames(surveys)` - returns the row names * Summary: * `str(surveys)` - structure of the object and information about the class, length and content of each column * `summary(surveys)` - summary statistics for each column Note: most of these functions are "generic", they can be used on other types of objects besides `data.frame`. > ### Challenge > > Based on the output of `str(surveys)`, can you answer the following questions? > > * What is the class of the object `surveys`? > * How many rows and how many columns are in this object? > * How many species have been recorded during these surveys? <!--- ```{r, echo=FALSE, purl=FALSE} ## Answers ## * class: data frame ## * how many rows: 34786, how many columns: 13 ## * how many species: 48 ``` ---> ### Indexing and subsetting data frames Our survey data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the "coordinates" we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes. ```{r, purl=FALSE} surveys[1, 1] # first element in the first column of the data frame (as a vector) surveys[1, 6] # first element in the 6th column (as a vector) surveys[, 1] # first column in the data frame (as a vector) surveys[1] # first column in the data frame (as a data.frame) surveys[1:3, 7] # first three elements in the 7th column (as a vector) surveys[3, ] # the 3rd element for all columns (as a data.frame) head_surveys <- surveys[1:6, ] # equivalent to head(surveys) ``` `:` is a special function that creates numeric vectors of integers in increasing or decreasing order, test `1:10` and `10:1` for instance. You can also exclude certain parts of a data frame using the "`-`" sign: ```{r, purl=FALSE} surveys[,-1] # The whole data frame, except the first column surveys[-c(7:34786),] # Equivalent to head(surveys) ``` As well as using numeric values to subset a `data.frame` (or `matrix`), columns can be called by name, using one of the four following notations: ```{r, eval = FALSE, purl=FALSE} surveys["species_id"] # Result is a data.frame surveys[, "species_id"] # Result is a vector surveys[["species_id"]] # Result is a vector surveys$species_id # Result is a vector ``` For our purposes, the last three notations are equivalent. RStudio knows about the columns in your data frame, so you can take advantage of the autocompletion feature to get the full and correct column name. > ### Challenge > > 1. Create a `data.frame` (`surveys_200`) containing only the observations from row 200 of the `surveys` dataset. > > 2. Notice how `nrow()` gave you the number of rows in a `data.frame`? > > * Use that number to pull out just that last row in the data frame. > * Compare that with what you see as the last row using `tail()` to make sure it's meeting expectations. > * Pull out that last row using `nrow()` instead of the row number. > * Create a new data frame object (`surveys_last`) from that last row. > > 3. Use `nrow()` to extract the row that is in the middle of the data frame. Store the content of this row in an object named `surveys_middle`. > > 4. Combine `nrow()` with the `-` notation above to reproduce the behavior of`head(surveys)` keeping just the first through 6th rows of the surveys dataset. <!--- ```{r, purl=FALSE} ## Answers surveys_200 <- surveys[200, ] surveys_last <- surveys[nrow(surveys), ] surveys_middle <- surveys[nrow(surveys)/2, ] surveys_head <- surveys[-c(7:nrow(surveys)),] ``` ---> ### Factors When we did `str(surveys)` we saw that several of the columns consist of integers, however, the columns `genus`, `species`, `sex`, `plot_type`, ... are of a special class called a `factor`. Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting. Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings. Once created, factors can only contain a pre-defined set of values, known as *levels*. By default, R always sorts *levels* in alphabetical order. For instance, if you have a factor with 2 levels: ```{r, purl=TRUE} sex <- factor(c("male", "female", "female", "male")) ``` R will assign `1` to the level `"female"` and `2` to the level `"male"` (because `f` comes before `m`, even though the first element in this vector is `"male"`). You can check this by using the function `levels()`, and check the number of levels using `nlevels()`: ```{r, purl=FALSE} levels(sex) nlevels(sex) ``` Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., "low", "medium", "high"), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the `sex` vector would be: ```{r, results=TRUE, purl=FALSE} sex # current order sex <- factor(sex, levels = c("male", "female")) sex # after re-ordering ``` In R's memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: `"female"`, `"male"` is more descriptive than `1`, `2`. Which one is "male"? You wouldn't be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the species names in our example dataset). #### Converting factors If you need to convert a factor to a character vector, you use `as.character(x)`. ```{r, purl=FALSE} as.character(sex) ``` Converting factors where the levels appear as numbers (such as concentration levels, or years) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. Another method is to use the `levels()` function. Compare: ```{r, purl=TRUE} f <- factor(c(1990, 1983, 1977, 1998, 1990)) as.numeric(f) # wrong! and there is no warning... as.numeric(as.character(f)) # works... as.numeric(levels(f))[f] # The recommended way. ``` Notice that in the `levels()` approach, three important steps occur: * We obtain all the factor levels using `levels(f)` * We convert these levels to numeric values using `as.numeric(levels(f))` * We then access these numeric values using the underlying integers of the vector `f` inside the square brackets #### Renaming factors When your data is stored as a factor, you can use the `plot()` function to get a quick glance at the number of observations represented by each factor level. Let's look at the number of males and females captured over the course of the experiment: ```{r, purl=TRUE} ## bar plot of the number of females and males captured during the experiment: plot(surveys$sex) ``` In addition to males and females, there are about 1700 individuals for which the sex information hasn't been recorded. Additionally, for these individuals, there is no label to indicate that the information is missing. Let's rename this label to something more meaningful. Before doing that, we're going to pull out the data on sex and work with that data, so we're not modifying the working copy of the data frame: ```{r, results=TRUE, purl=FALSE} sex <- surveys$sex head(sex) levels(sex) levels(sex)[1] <- "missing" levels(sex) head(sex) ``` > ### Challenge > > * Rename "F" and "M" to "female" and "male" respectively. > * Now that we have renamed the factor level to "missing", can you recreate the barplot such that "missing" is last (after "male")? > <!--- ```{r correct-order, purl=FALSE} ## Answers levels(sex)[2:3] <- c("female", "male") sex <- factor(sex, levels = c("female", "male", "missing")) plot(sex) ``` ---> #### Using `stringsAsFactors=FALSE` By default, when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the `factor` data type. Depending on what you want to do with the data, you may want to keep these columns as `character`. To do so, `read.csv()` and `read.table()` have an argument called `stringsAsFactors` which can be set to `FALSE`. In many cases, it's preferable to set `stringsAsFactors = FALSE` when importing your data, and converting as a factor only the columns that require this data type. Compare the output of `str(surveys)` when setting `stringsAsFactors = TRUE` (default) and `stringsAsFactors = FALSE` (remembering to change the file direction as needed): ```{r, eval=FALSE, purl=FALSE} ## Compare the difference between when the data are being read as ## `factor`, and when they are being read as `character`. surveys <- read.csv("data/portal_data_joined.csv", stringsAsFactors = TRUE) str(surveys) surveys <- read.csv("data/portal_data_joined.csv", stringsAsFactors = FALSE) str(surveys) ## Convert the column "plot_type" into a factor surveys$plot_type <- factor(surveys$plot_type) ``` The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance). ## Homework > ### Challenge > > 1. We have seen how data frames are created when using the `read.csv()`, but they can also be created by hand with the `data.frame()` function. There are a few mistakes in this hand-crafted `data.frame`, can you spot and fix them? Don't hesitate to experiment! > > ```{r, eval=FALSE, purl=FALSE} > animal_data <- data.frame(animal=c(dog, cat, sea cucumber, sea urchin), > feel=c("furry", "squishy", "spiny"), > weight=c(45, 8 1.1, 0.8)) > ``` > > 2. Can you predict the class for each of the columns in the following example? Check your guesses using `str(country_climate)`: > * Are they what you expected? Why? Why not? > * What would have been different if we had added `stringsAsFactors = FALSE` to this call? > * What would you need to change to ensure that each column had the accurate data type? > > ```{r, eval=FALSE, purl=FALSE} > country_climate <- data.frame( > country=c("Canada", "Panama", "South Africa", "Australia"), > climate=c("cold", "hot", "temperate", "hot/temperate"), > temperature=c(10, 30, 18, "15"), > northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"), > has_kangaroo=c(FALSE, FALSE, FALSE, 1) > <!--- Answers > > ```{r, eval=FALSE, echo=FALSE, purl=FALSE} > ## Answers > ## * missing quotations around the names of the animals > ## * missing one entry in the "feel" column (probably for one of the furry animals) > ## * missing one comma in the weight column > > ## Answers > ## * `country`, `climate`, `temperature`, and `northern_hemisphere` are > ## factors; `has_kangaroo` is numeric. > ## * using `stringsAsFactors=FALSE` would have made them character instead of > ## factors > ## * removing the quotes in temperature, northern_hemisphere, and replacing 1 > ## by TRUE in the `has_kangaroo` column would probably what was originally > ## intended. > ``` > > --> > ### Formatting Dates As we've already discussed, dates and times can be difficult to represent correctly when communicating with computers (and other people, too!). We've recommended some ways to store dates that can help with this, including some standard formats (e.g. YYYYMMDD, or storing each element of date in a separate variable). Here we're going to briefly introduce another way to deal with dates AND times in R using POSIX format. For more information and other approaches, check out [this helpful guide](https://www.stat.berkeley.edu/~s133/dates.html), from which this section borrows heavily. POSIX stands for "portable operating system interface" and is used by many operating systems, including UNIX systems. Dates stored in the POSIX format are date/time values and allow modification of time zones. POSIX date classes store times to the nearest second, which can be useful if you have data at that scale. There are two POSIX date/time classes, which differ in the way that the values are stored internally. The POSIXct class stores date/time values as the number of seconds since January 1, 1970, while the POSIXlt class stores them as a list with elements for second, minute, hour, day, month, and year, among others. Unless you need the list nature of the POSIXlt class, the POSIXct class is the usual choice for storing dates in R. The ggplot2 plotting package that we will use next week uses the POSIXct class. The 'surveys' dataset has a separate column for day, month, and year, and each contains integer values, as we can confirm with `str()`: ```{r, eval=FALSE, purl=FALSE} str(surveys) ``` We first need to make a character vector for our dates in the default input format for POSIX dates: the year, followed by the month and day, separated by slashes or dashes. For date/time values, the date may be followed by white space (e.g. space or tab) and a time in the form hour:minutes:seconds or hour:minutes, which then may be followed by white space and the time zone. Here are some examples of valid POSIX inputs: 1915/6/16 2005-06-24 11:25 1990/2/17 12:20:05 2012-7-31 12:20:05 MST ```{r, purl=FALSE} ## Create a date character vector ## Using the '$' function to add it to the surveys data frame ## Using the function paste which pastes values together in a character string ## The 'sep' argument indicates the character to use to separate each component surveys$date <- paste(surveys$year, surveys$month, surveys$day, sep="-") head(surveys$date) class(surveys$date) str(surveys) # notice the new 'date' column, with 'chr' as the class ``` The new variable `date` is character class. To convert it to POSIX format, you'll need to modify it using the as.POSIX() function. ```{r, purl=FALSE} ## Formate date as POSIX ## The 'tz' argument allows you specify timezone (more important if you actually have time data) ## "UTC" is GMT and "" indicates the current timezone ## The 'format' argument tells the function what order the date components are in surveys$date <- as.POSIXct(surveys$date, tz="UTC", format="%Y-%m-%d") class(surveys$date) ``` Great! That sets us up to start working with date values. ## Aug 16: Data manipulation with dplyr and tidyr > ### Learning Objectives > > * Understand what an R package is and how to install them > * Understand the purpose of the **`dplyr`** and **`tidyr`** packages. > > * Select certain columns in a data frame with the **`dplyr`** function `select`. > > * Select certain rows in a data frame according to filtering conditions with the **`dplyr`** function `filter` . > > * Link the output of one **`dplyr`** function to the input of another function with the 'pipe' operator `%>%`. > > * Add new columns to a data frame that are functions of existing columns with `mutate`. > > * Understand the split-apply-combine concept for data analysis. > > * Use `summarize`, `group_by`, and `tally` to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results. > > * Understand the concept of a wide and a long table format and for which purpose those formats are useful. > > * Understand what key-value pairs are. > > * Reshape a data frame from long to wide format and back with the `spread` and `gather` commands from the **`tidyr`** package. ------------ ## Packages Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. Enter **`dplyr`**. **`dplyr`** is a package for making tabular data manipulation easier. It pairs nicely with **`tidyr`** which enables you to swiftly convert between different data formats for plotting and analysis. Packages in R are basically sets of additional functions that let you do more stuff. The functions we've been using so far, like `str()` or `data.frame()`, come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it. You can install packages within RStudio using the 'Packages' tab in the lower right quadrant, or you can install them directly using the function install.packages(): ```{r, message = FALSE, purl = FALSE} install.packages("tidyverse") ## install the tidyverse packages ``` The **`tidyverse`** package is an "umbrella-package" that installs several packages useful for data analysis which work together well such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, etc. To load the package type: ```{r, message = FALSE, purl = FALSE} library("tidyverse") ## load the tidyverse packages, incl. dplyr library("dplyr") #if you just wanted to load one package ``` By default, you need to load most packages each time you start a new R session. It is useful to load the packages you will need at the top of your script (e.g. add the `library()` functions), so that if you need to re-run your analyses you don't need to remember to load the packages first. There are many packages available for R, with new ones being developed every day. Anyone (including you!) can make a package, as long as they follow some [simple guidelines.](http://r-pkgs.had.co.nz/) If you think that there might be a package with functions you could use, there probably is. Try googling 'R package *thing I need to do*" and see what comes up. ## dplyr for data manipulation #### What are **`dplyr`** and **`tidyr`**? The package **`dplyr`** provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). The package **`tidyr`** addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups - like plots or aquaria. Moving back and forth between these formats is nontrivial, and **`tidyr`** gives you tools for this and more sophisticated data manipulation. To learn more about **`dplyr`** and **`tidyr`** after the workshop, you may want to check out this [handy data transformation with **`dplyr`** cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf) and this [one about **`tidyr`**](https://github.com/rstudio/cheatsheets/blob/master/source/pdfs/data-import-cheatsheet.pdf). ### Selecting columns and filtering rows We're going to learn some of the most common **`dplyr`** functions: `select()`, `filter()`, `mutate()`, `group_by()`, and `summarize()`. To select columns of a data frame, use `select()`. The first argument to this function is the data frame (`surveys`), and the subsequent arguments are the columns to keep. ```{r, results = 'hide', purl = FALSE} select(surveys, plot_id, species_id, weight) ``` To choose rows based on a specific criteria, use `filter()`: ```{r, purl = FALSE} filter(surveys, year == 1995) ``` ### Pipes But what if you wanted to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes. With intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as things are evaluated from the inside out. The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like `%>%` and are made available via the `magrittr` package, installed automatically with **`dplyr`**. If you use RStudio, you can type the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you have a PC or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd> if you have a Mac. ```{r, purl = FALSE} surveys %>% filter(weight < 5) %>% select(species_id, sex, weight) ``` In the above, we use the pipe to send the `surveys` dataset first through `filter()` to keep rows where `weight` is less than 5, then through `select()` to keep only the `species_id`, `sex`, and `weight` columns. Since `%>%` takes the object on its left and passes it as the first argument to the function on its right, we don't need to explicitly include it as an argument to the `filter()` and `select()` functions anymore. If we wanted to create a new object with this smaller version of the data, we could do so by assigning it a new name: ```{r, purl = FALSE} surveys_sml <- surveys %>% filter(weight < 5) %>% select(species_id, sex, weight) surveys_sml ``` Note that the final data frame is the leftmost part of this expression. > ### Challenge > > Using pipes, subset the `survey` data to include individuals collected before 1995 and retain only the columns `year`, `sex`, and `weight`. <!--- ```{r, eval=FALSE, purl=FALSE} ## Answer surveys %>% filter(year < 1995) %>% select(year, sex, weight) ``` ---> ### Mutate Frequently you'll want to create new columns based on the values in existing columns, for example to do unit conversions, or find the ratio of values in two columns. For this we'll use `mutate()`. To create a new column of weight in kg: ```{r, purl = FALSE} surveys %>% mutate(weight_kg = weight / 1000) ``` You can also create a second new column based on the first new column within the same call of `mutate()`: ```{r, purl = FALSE} surveys %>% mutate(weight_kg = weight / 1000, weight_kg2 = weight_kg * 2) ``` If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the `head()` of the data. (Pipes work with non-**`dplyr`** functions, too, as long as the **`dplyr`** or `magrittr` package is loaded). ```{r, purl = FALSE} surveys %>% mutate(weight_kg = weight / 1000) %>% head ``` Note that we don't include parentheses at the end of our call to `head()` above. When piping into a function with no additional arguments, you can call the function with or without parentheses (e.g. `head` or `head()`). The first few rows of the output are full of `NA`s, so if we wanted to remove those we could insert a `filter()` in the chain: ```{r, purl = FALSE} surveys %>% filter(!is.na(weight)) %>% mutate(weight_kg = weight / 1000) %>% head ``` `is.na()` is a function that determines whether something is an `NA`. The `!` symbol negates the result, so we're asking for everything that *is not* an `NA`. > ### Challenge > > Create a new data frame from the `surveys` data that meets the following criteria: contains only the `species_id` column and a new column called `hindfoot_half` containing values that are half the `hindfoot_length` values. In this `hindfoot_half` column, there are no `NA`s and all values are less than 30. > > **Hint**: think about how the commands should be ordered to produce this data frame! <!--- ```{r, eval=FALSE, purl=FALSE} ## Answer surveys_hindfoot_half <- surveys %>% filter(!is.na(hindfoot_length)) %>% mutate(hindfoot_half = hindfoot_length / 2) %>% filter(hindfoot_half < 30) %>% select(species_id, hindfoot_half) ``` ---> ### Split-apply-combine data analysis and the summarize() function Many data analysis tasks can be approached using the *split-apply-combine* paradigm: split the data into groups, apply some analysis to each group, and then combine the results. **`dplyr`** makes this very easy through the use of the `group_by()` function. #### The `summarize()` function `group_by()` is often used together with `summarize()`, which collapses each group into a single-row summary of that group. `group_by()` takes as arguments the column names that contain the **categorical** variables for which you want to calculate the summary statistics. So to view the mean `weight` by sex: ```{r, purl = FALSE} surveys %>% group_by(sex) %>% summarize(mean_weight = mean(weight, na.rm = TRUE)) ``` You may also have noticed that the output from these calls doesn't run off the screen anymore. That's because **`dplyr`** has changed our `data.frame` object to an object of class `tbl_df`, also known as a "tibble". Tibble's data structure is very similar to a data frame. For our purposes the only differences are that, (1) in addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen, (2) columns of class `character` are never converted into factors. You can also group by multiple columns: ```{r, purl = FALSE} surveys %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight, na.rm = TRUE)) ``` When grouping both by `sex` and `species_id`, the first rows are for individuals that escaped before their sex could be determined and weighted. You may notice that the last column does not contain `NA` but `NaN` (which refers to "Not a Number"). To avoid this, we can remove the missing values for weight before we attempt to calculate the summary statistics on weight. Because the missing values are removed, we can omit `na.rm = TRUE` when computing the mean: ```{r, purl = FALSE} surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight)) ``` Here, again, the output from these calls doesn't run off the screen anymore. Recall that **`dplyr`** has changed our object from`data.frame` to `tbl_df`. If you want to display more data, you can use the `print()` function at the end of your chain with the argument `n` specifying the number of rows to display: ```{r, purl = FALSE} surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight)) %>% print(n = 15) ``` Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex: ```{r, purl = FALSE} surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight), min_weight = min(weight)) ``` ### Tallying When working with data, it is also common to want to know the number of observations found for each factor or combination of factors. For this, **`dplyr`** provides `tally()`. For example, if we wanted to group by sex and find the number of rows of data for each sex, we would do: ```{r, purl = FALSE} surveys %>% group_by(sex) %>% tally ``` Here, `tally()` is the action applied to the groups created by `group_by()` and counts the total number of records for each category. > ### Challenge > > 1. How many individuals were caught in each `plot_type` surveyed? > > 2. Use `group_by()` and `summarize()` to find the mean, min, and max hindfoot length for each species (using `species_id`). > > 3. What was the heaviest animal measured in each year? Return the columns `year`, `genus`, `species_id`, and `weight`. > > 4. You saw above how to count the number of individuals of each `sex` using a combination of `group_by()` and `tally()`. How could you get the same result using `group_by()` and `summarize()`? Hint: see `?n`. <!--- ```{r, echo=FALSE, purl=FALSE} ## Answer 1 surveys %>% group_by(plot_type) %>% tally ## Answer 2 surveys %>% filter(!is.na(hindfoot_length)) %>% group_by(species_id) %>% summarize( mean_hindfoot_length = mean(hindfoot_length), min_hindfoot_length = min(hindfoot_length), max_hindfoot_length = max(hindfoot_length) ) ## Answer 3 surveys %>% filter(!is.na(weight)) %>% group_by(year) %>% filter(weight == max(weight)) %>% select(year, genus, species, weight) %>% arrange(year) ## Answer 4 surveys %>% group_by(sex) %>% summarize(n = n()) ``` ---> ## Exporting data Now that you have learned how to use **`dplyr`** to extract information from or summarize your raw data, you may want to export these new datasets. Similar to the `read.csv()` function used for reading CSV files into R, there is a `write.csv()` function that generates CSV files from data frames. Before using `write.csv()`, we are going to create a new folder, `data_output`, in our working directory that will store this generated dataset. We don't want to write generated datasets in the same directory as our raw data. It's good practice to keep them separate. The `data` folder should only contain the raw, unaltered data, and should be left alone to make sure we don't delete or modify it. In contrast, our script will generate the contents of the `data_output` directory, so even if the files it contains are deleted, we can always re-generate them. To try this out, we can prepare a version of the dataset that doesn't include any missing data. Let's start by removing observations for which the `species_id` is missing. In this dataset, the missing species are represented by an empty string and not an `NA`. Let's also remove observations for which `weight` and the `hindfoot_length` are missing. This dataset should also only contain observations of animals for which the sex has been determined: ```{r, purl=FALSE} surveys_complete <- surveys %>% filter(species_id != "", # remove missing species_id !is.na(weight), # remove missing weight !is.na(hindfoot_length), # remove missing hindfoot_length sex != "") # remove missing sex ``` We might be interested in plotting how species abundances have changed through time, in which case could remove observations for rare species (i.e., that have been observed less than 50 times). We can do this in two steps: first we are going to create a dataset that counts how often each species has been observed, and filter out the rare species; then, we will extract only the observations for these more common species: ```{r, purl=FALSE} ## Extract the most common species_id species_counts <- surveys_complete %>% group_by(species_id) %>% tally %>% filter(n >= 50) ## Only keep the most common species surveys_complete <- surveys_complete %>% filter(species_id %in% species_counts$species_id) ``` ```{r, eval=FALSE, purl=TRUE, echo=FALSE} ### Create the dataset for exporting: ## Start by removing observations for which the `species_id`, `weight`, ## `hindfoot_length`, or `sex` data are missing: surveys_complete <- surveys %>% filter(species_id != "", # remove missing species_id !is.na(weight), # remove missing weight !is.na(hindfoot_length), # remove missing hindfoot_length sex != "") # remove missing sex ## Now remove rare species in two steps. First, make a list of species which ## appear at least 50 times in our dataset: species_counts <- surveys_complete %>% group_by(species_id) %>% tally %>% filter(n >= 50) %>% select(species_id) ## Second, keep only those species: surveys_complete <- surveys_complete %>% filter(species_id %in% species_counts$species_id) ``` Now that the dataset is ready, we can save it as a CSV file in our `data_output` folder. By default, `write.csv()` includes a column with row names (in our case the names are just the row numbers), so we need to add `row.names = FALSE` so they are not included: ```{r, purl=FALSE, eval=FALSE} write.csv(surveys_complete, file = "data_output/surveys_complete.csv", row.names = FALSE) ``` ## Homework Take your data (or the surveys data) and create a filtered, subset, grouped by, or otherwise manipulated data frame. Now export it using `write.csv()`. ## Extra credit (not covered in class) ### Reshaping with gather and spread **`dplyr`** is one part of a larger **`tidyverse`** that enables you to work with data in tidy data formats. **`tidyr`** enables a wide range of manipulations of the structure data itself. For example, the survey data presented here is in almost in what we call a **long** format - every observation of every individual is its own row. This is an ideal format for data with a rich set of information per observation. It makes it difficult, however, to look at the relationships between measurements across plots. For example, what is the relationship between mean weights of different genera across the entire data set? To answer that question, we'd want each plot to have a single row, with all of the measurements in a single plot having their own column. This is called a **wide** data format. For the `surveys` data as we have it right now, this is going to be one heck of a wide data frame! However, if we were to summarize data within plots and species, we might begin to have some relationships we'd want to examine. Let's see this in action. First, using **`dplyr`**, let's create a data frame with the mean body weight of each genera by plot. ```{r, purl=FALSE} surveys_gw <- surveys %>% filter(!is.na(weight)) %>% group_by(genus, plot_id) %>% summarize(mean_weight = mean(weight)) head(surveys_gw) ``` #### Long to Wide with `spread` Now, to make this long data wide, we use `spread` from `tidyr` to spread out the different taxa into columns. `spread` takes three arguments - the data, the *key* column, or column with identifying information, the *values* column - the one with the numbers. We'll use a pipe so we can ignore the data argument. ```{r, purl=FALSE} surveys_gw_wide <- surveys_gw %>% spread(genus, mean_weight) head(surveys_gw_wide) ``` Notice that some genera have `NA` values. That's because some of those genera don't have any record in that plot. Sometimes it is fine to leave those as `NA`. Sometimes we want to fill them as zeros, in which case we would add the argument `fill=0`. ```{r, purl=FALSE} surveys_gw %>% spread(genus, mean_weight, fill = 0) %>% head ``` We can now do things like plot the weight of *Baiomys* against *Chaetodipus* or examine their correlation. ```{r, purl=FALSE} surveys_gw %>% spread(genus, mean_weight, fill = 0) %>% cor(use = "pairwise.complete") ``` #### Wide to long with `gather` What if we had the opposite problem, and wanted to go from a wide to long format? For that, we use `gather` to sweep up a set of columns into one key-value pair. We give it the arguments of a new key and value column name, and then we specify which columns we either want or do not want gathered up. So, to go backwards from `surveys_gw_wide`, and exclude `plot_id` from the gathering, we would do the following: ```{r, purl=FALSE} surveys_gw_long <- surveys_gw_wide %>% gather(genus, mean_weight, -plot_id) head(surveys_gw_long) ``` Note that now the `NA` genera are included in the long format. Going from wide to long to wide can be a useful way to balance out a dataset so every replicate has the same composition. We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it's easier to specify what to gather than what to leave alone. And if the columns are in a row, we don't even need to list them all out - just use the `:` operator! ```{r, purl=FALSE} surveys_gw_wide %>% gather(genus, mean_weight, Baiomys:Spermophilus) %>% head ``` > ### Challenge > > 1. Make a wide data frame with `year` as columns, `plot_id` as rows, and the values are the number of genera per plot. You will need to summarize before reshaping, and use the function `n_distinct` to get the number of unique types of a genera. It's a powerful function! See `?n_distinct` for more. > > 2. Now take that data frame, and make it long again, so each row is a unique `plot_id` `year` combination. > > 3. The `surveys` data set is not truly wide or long because there are two columns of measurement - `hindfoot_length` and `weight`. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let's walk through a common solution for this type of problem. First, use `gather` to create a truly long dataset where we have a key column called `measurement` and a `value` column that takes on the value of either `hindfoot_length` or `weight`. Hint: You'll need to specify which columns are being gathered. > > 4. With this new truly long data set, calculate the average of each `measurement` in each `year` for each different `plot_type`. Then`spread` them into a wide data set with a column for `hindfoot_length` and `weight`. Hint: Remember, you only need to specify the key and value columns for `spread`. <!--- ```{r, echo=FALSE, purl=FALSE} ## Answer 1 rich_time <- surveys %>% group_by(plot_id, year) %>% summarize(n_genera = n_distinct(genus)) %>% spread(year, n_genera) head(rich_time) ## Answer 2 rich_time %>% gather(year, n_genera, -plot_id) ## Answer 3 surveys_long <- surveys %>% gather(measurement, value, hindfoot_length, weight) ## Answer 4 surveys_long %>% group_by(year, measurement, plot_type) %>% summarize(mean_value = mean(value, na.rm=TRUE)) %>% spread(measurement, mean_value) ``` ---> ## Aug 17: Data visualization ------------ > ### Learning Objectives > > * Produce scatter plots, boxplots, and time series plots using ggplot. > * Set universal plot settings. > * Modify the aesthetics of an existing ggplot plot (including axis labels and color). > * Build complex and customized plots from data in a data frame. -------------- We start by loading the required packages. **`ggplot2`** is included in the **`tidyverse`** package. ```{r load-package, message=FALSE, purl=FALSE} library(tidyverse) # alternately, you can just load the ggplot2 package library(ggplot2) ``` If not still in the workspace, load the data we saved in the previous lesson. ```{r load-data, eval=FALSE, purl=FALSE} surveys_complete <- read.csv('data_output/surveys_complete.csv') ``` ## Plotting with **`ggplot2`** **`ggplot2`** is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties, so we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking. ggplot likes data in the 'long' format: i.e., a column for every dimension, and a row for every observation. Well structured data will save you lots of time when making figures with ggplot. ggplot graphics are built step by step by adding new elements. To build a ggplot we need to: - bind the plot to a specific data frame using the `data` argument ```{r, eval=FALSE, purl=FALSE} ggplot(data = surveys_complete) ``` - define aesthetics (`aes`), by selecting the variables to be plotted and the variables to define the presentation such as plotting size, shape color, etc. ```{r, eval=FALSE, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) ``` - add `geoms` -- graphical representation of the data in the plot (points, lines, bars). To add a geom to the plot use `+` operator ```{r first-ggplot, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point() ``` The `+` in the **`ggplot2`** package is particularly useful because it allows you to modify existing `ggplot` objects. This means you can easily set up plot"templates" and conveniently explore different types of plots, so the above plot can also be generated with code like this: ```{r, first-ggplot-with-plus, eval=FALSE, purl=FALSE} # Assign plot to a variable surveys_plot <- ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) # Draw the plot surveys_plot + geom_point() ``` ```{r, eval=FALSE, purl=TRUE, echo=FALSE, purl=FALSE} ## Create a ggplot and draw it. surveys_plot <- ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) surveys_plot + geom_point() ``` Notes: - Anything you put in the `ggplot()` function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x and y axis you set up in `aes()`. - You can also specify aesthetics for a given geom independently of the aesthetics defined globally in the `ggplot()` function. - The `+` sign used to add layers must be placed at the end of each line containing a layer. If, instead, the `+` sign is added in the line before the other layer, **`ggplot2`** will not add the new layer and will return an error message. ```{r, ggplot-with-plus-position, eval=FALSE, purl=FALSE} # this is the correct syntax for adding layers surveys_plot + geom_point() # this will not add the new layer and will return an error message surveys_plot + geom_point() ``` ## Building your plots iteratively Building plots with ggplot is typically an iterative process. We start by defining the dataset we'll use, lay the axes, and choose a geom: ```{r create-ggplot-object, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point() ``` Then, we start modifying this plot to extract more information from it. For instance, we can add transparency (`alpha`) to avoid overplotting: ```{r adding-transparency, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point(alpha = 0.1) ``` We can also add colors for all the points: ```{r adding-colors, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point(alpha = 0.1, color = "blue") ``` Or to color each species in the plot differently: ```{r color-by-species, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point(alpha = 0.1, aes(color=species_id)) ``` > ### Challenge > > Use what you just learned to create a scatter plot of `weight` over `species_id` with the plot types showing in different colors. Is this a good way to show this type of data? ## Boxplot We can use boxplots to visualize the distribution of weight within each species: ```{r boxplot, purl=FALSE} ggplot(data = surveys_complete, aes(x = species_id, y = weight)) + geom_boxplot() ``` By adding points to boxplot, we can have a better idea of the number of measurements and of their distribution: ```{r boxplot-with-points, purl=FALSE} ggplot(data = surveys_complete, aes(x = species_id, y = weight)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.3, color = "tomato") ``` Notice how the boxplot layer is behind the jitter layer? What do you need to change in the code to put the boxplot in front of the points such that it's not hidden? > ### Challenges > > Boxplots are useful summaries, but hide the *shape* of the distribution. For example, if there is a bimodal distribution, it would not be observed with a boxplot. An alternative to the boxplot is the violin plot (sometimes known as a beanplot), where the shape (of the density of points) is drawn. > > - Replace the box plot with a violin plot; see `geom_violin()`. > > In many types of data, it is important to consider the *scale* of the observations. For example, it may be worth changing the scale of the axis to better distribute the observations in the space of the plot. Changing the scale of the axes is done similarly to adding/modifying other components (i.e., by incrementally adding commands). Try making these modifications: > > - Represent weight on the log10 scale; see `scale_y_log10()`. > > So far, we've looked at the distribution of weight within species. Try making a new plot to explore the distribution of another variable within each species. > > - Create boxplot for `hindfoot_length`. Overlay the boxplot layer on a jitter layer to show actual measurements. > > - Add color to the datapoints on your boxplot according to the plot from which the sample was taken (`plot_id`). > > Hint: Check the class for `plot_id`. Consider changing the class of `plot_id` from integer to factor. Why does this change how R makes the graph? ## Plotting time series data Let's calculate number of counts per year for each species. First we need to group the data and count records within each group: ```{r, purl=FALSE} yearly_counts <- surveys_complete %>% group_by(year, species_id) %>% tally ``` Timelapse data can be visualized as a line plot with years on the x axis and counts on the y axis: ```{r first-time-series, purl=FALSE} ggplot(data = yearly_counts, aes(x = year, y = n)) + geom_line() ``` Unfortunately, this does not work because we plotted data for all the species together. We need to tell ggplot to draw a line for each species by modifying the aesthetic function to include `group = species_id`: ```{r time-series-by-species, purl=FALSE} ggplot(data = yearly_counts, aes(x = year, y = n, group = species_id)) + geom_line() ``` We will be able to distinguish species in the plot if we add colors (using `color` also automatically groups the data): ```{r time-series-with-colors, purl=FALSE} ggplot(data = yearly_counts, aes(x = year, y = n, color = species_id)) + geom_line() ``` ## Faceting ggplot has a special technique called *faceting* that allows the user to split one plot into multiple plots based on a factor included in the dataset. We will use it to make a time series plot for each species: ```{r first-facet, purl=FALSE} ggplot(data = yearly_counts, aes(x = year, y = n)) + geom_line() + facet_wrap(~ species_id) ``` Now we would like to split the line in each plot by the sex of each individual measured. To do that we need to make counts in the data frame grouped by `year`, `species_id`, and `sex`: ```{r, purl=FALSE} yearly_sex_counts <- surveys_complete %>% group_by(year, species_id, sex) %>% tally ``` We can now make the faceted plot by splitting further by sex using `color` (within a single plot): ```{r facet-by-species-and-sex, purl=FALSE} ggplot(data = yearly_sex_counts, aes(x = year, y = n, color = sex)) + geom_line() + facet_wrap(~ species_id) ``` Usually plots with white background look more readable when printed. We can set the background to white using the function `theme_bw()`. Additionally, you can remove the grid: ```{r facet-by-species-and-sex-white-bg, purl=FALSE} ggplot(data = yearly_sex_counts, aes(x = year, y = n, color = sex)) + geom_line() + facet_wrap(~ species_id) + theme_bw() + theme(panel.grid = element_blank()) ``` ## **`ggplot2`** themes In addition to `theme_bw()`, **`ggplot2`** comes with several other themes which can be useful to quickly change the look of your visualization. The complete list of themes is available at <http://docs.ggplot2.org/current/ggtheme.html>. `theme_minimal()` and `theme_light()` are popular, and `theme_void()` can be useful as a starting point to create a new hand-crafted theme. The [ggthemes](https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html) package provides a wide variety of options (including an Excel 2003 theme). The [**`ggplot2`** extensions website](https://www.ggplot2-exts.org) provides a list of packages that extend the capabilities of **`ggplot2`**, including additional themes. > ### Challenge > Use what you just learned to create a plot that depicts how the average weight of each species changes through the years. <!-- Answer ```{r average-weight-time-series, purl=FALSE} yearly_weight <- surveys_complete %>% group_by(year, species_id) %>% summarize(avg_weight = mean(weight)) ggplot(data = yearly_weight, aes(x=year, y=avg_weight)) + geom_line() + facet_wrap(~ species_id) + theme_bw() ``` --> The `facet_wrap` geometry extracts plots into an arbitrary number of dimensions to allow them to cleanly fit on one page. On the other hand, the `facet_grid` geometry allows you to explicitly specify how you want your plots to be arranged via formula notation (`rows ~ columns`; a `.` can be used as a placeholder that indicates only one row or column). Let's modify the previous plot to compare how the weights of males and females has changed through time: ```{r average-weight-time-facet-sex-rows, purl=FALSE} # One column, facet by rows yearly_sex_weight <- surveys_complete %>% group_by(year, sex, species_id) %>% summarize(avg_weight = mean(weight)) ggplot(data = yearly_sex_weight, aes(x=year, y=avg_weight, color = species_id)) + geom_line() + facet_grid(sex ~ .) ``` ```{r average-weight-time-facet-sex-columns, purl=FALSE} # One row, facet by column ggplot(data = yearly_sex_weight, aes(x=year, y=avg_weight, color = species_id)) + geom_line() + facet_grid(. ~ sex) ``` > ### Challenge > With all of this information in hand, please take a few minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own. Use the RStudio [**`ggplot2`** cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf) for inspiration. > Here are some ideas: > * See if you can change the thickness of the lines. > * Can you find a way to change the name of the legend? What about its labels? > * Try using a different color palette (see http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/). After creating your plot, you can save it to a file in your favorite format. You can easily change the dimension (and resolution) of your plot by adjusting the appropriate arguments (`width`, `height` and `dpi`): ```{r ggsave-example, eval=FALSE, purl=FALSE} my_plot <- ggplot(data = yearly_sex_counts, aes(x = year, y = n, color = sex)) + geom_line() + facet_wrap(~ species_id) + labs(title = 'Observed species in time', x = 'Year of observation', y = 'Number of species') + theme_bw() + theme(axis.text.x = element_text(colour="grey20", size=12, angle=90, hjust=.5, vjust=.5), axis.text.y = element_text(colour="grey20", size=12), text=element_text(size=16)) ggsave("name_of_file.png", my_plot, width=15, height=10) ``` Note: The parameters `width` and `height` also determine the font size in the saved plot. ## Go forth and make more beautiful plots from well-formatted data files!