# Data management workshop (BSURE) --- Three things to remember: * You're all adults, this is your workshop * But also, this is a safe space for ignorance and diversity * Anyone can learn this stuff, and will be great at it if they try ## July 25: Spreadsheets ## 1. Good Data Entry Practices ### Theory/General Background Alternatives: excel, google sheets, gnumeric, Libre Office, Open Office How have we used spreadsheets (Mostly Excel, some google docs): -graphing -Data ~management/accumulation -Pivot Tables -Statistical tests (forumula or add-ons) Aren't going to be focusing on Spreadsheets for this because: -Hard to track actions/"procedure": lots of steps, possible approaches -Also direct access to raw data, so possible to inadvertantly change or lose data -Can be fairly time-intensive, memorization-heavy -*Are* good for data entry and direct organization ### Learning Objectives * Describe best practices for data entry and formatting in spreadsheets. * Apply best practices to arrange variables and observations in a spreadsheet. Issues with spreadsheets: - inconsistent organization (e.g. by species, plot) - cannot have letters in data boxes: messes up calculations, formulas, and graphs - computers don't have "context"! can't understand things the way people can - different date formats (also missing year in some cases) - variable locations differ, resulting in redefining options for plotting each data set - also empty spaces--harder to interpret, and different programs can handle them differently - also should keep units in headers/titles, rather than mixing with data - don't combine independent variables * Think about how the data is going to end up - are you the only one using it? Will other people have to read the data? What program(s) will you want to analyse the data with? *General notes* - mistake to appproach spreadsheet like a lab notebook--things like visual ques, verbal descriptions, won't give PC information; need to respect differences between data types--point, integer, text, etc. - should anticipate potential uses of data; try to set up a good ~framework to minimize work you'll need to do down the road - Directory: set of folders, so e.g. folder for raw data, folder for ~reformatted data, folder for statistical analysis, for reports, etc. ** Recommended way to organize a spreadsheet ** * variables as columns * observations as rows * keep separate variables as different columns * for missing data, use e.g. "NA", "na", "-", "*", etc. rather than a black space * avoid using 0 for 'no data', as it has a numeric value > **Note:** the best layouts/formats (as well as software and > interfaces) for **data entry** and **data analysis** might be > different. It is important to take this into account, and ideally > automate the conversion from one to another. ### Raw data When you're working with spreadsheets, during data clean up or analyses, it's very easy to end up with a spreadsheet that looks very different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when Reviewer #3 asks for a different analysis, **you must**: - **create a new file or tab with your cleaned or analyzed data.** Do not modify that original dataset, or you will never know where you started! - **keep track of the steps you took in your clean up or analysis.** You should track these steps as you would any step in an experiment. You can do this in another text file, or a good option is to create a new tab in your spreadsheet with your notes. This way the notes and data stay together. ## Exercise We're going to take that messy version of the survey data and clean it up. - If you don't already have it, download the data by clicking [here](https://ndownloader.figshare.com/files/2252083) to get it from FigShare. - Open up the data in a spreadsheet program. You can see that there are two tabs. Two field assistants conducted the surveys, one in 2013 and one in 2014, and they both kept track of the data in their own way. Now you're the person in charge of this project and you want to be able to start doing statistics with the data. - With the person next to you, work on the messy data so that a computer will be able to understand it. Clean up the 2013 and 2014 tabs, and put them all together in one spreadsheet. > **Important** Do not forget of our first piece of advice, to > **create a new file (or tab)** for the cleaned data, **never > modify the original (raw) data**. > > **Also**, did you keep track of what you did? How? After you go through this exercise, we'll discuss as a group what you think was wrong with this data and how you fixed it. ## 2. Dates as Data ### Learning Objectives * Describe how dates are stored and formatted in spreadsheets. * Describe the advantages of alternative date formatting in spreadsheets. * Demonstrate best practices for entering dates in spreadsheets. >* In Excel, the visual output of date data doesn't match internal storage >* stored as an integer: # of days since arbitrary "Start" date >>* Depends on program; might be e.g. Jan 1 1904, 1900, 1876; Google Sheets has 12/30/1899 >>* Issue is that different programs will report the same stored integer as different dates depending on its "start" value >>* Also, entring just a month/day will lead the computer to assume the year is the current year >>* However, since the date is stored as an integer you can carry out math operations with date cells in Excel (e.g. = __ + 40) >* Alternative approaches: >>* Separate out e.g. month/day/year in different columns >>* Enter date as a consistent integer pattern (e.g. 20160912" for 09/12/16) >>>* Easy for you to read, and there are ways for programs to intepret this >>>* **Excel**: automated method: convert to "*yyyymmdd*" as a custom cell format, and then copy/paste as number only; could write a script if you know how to do this >* Can add time as a variable for either format ## 3. Exporting data from spreadsheets ### Learning Objectives * Store spreadsheet data in universal file formats. * Export data from a spreadsheet to a .csv file. >* In general, Excel and the like are best/mostly good for data entry/proximal organization >* Leaving data in Excel long-term makes it vulnerable to updates, backwards compatiblity issues, etc: >> - Because it is a **proprietary format**, and it is possible that in the future, technology won’t exist (or will become sufficiently rare) to make it inconvenient, if not impossible, to open the file. >> - **Other spreadsheet software** may not be able to open the files saved in a proprietary Excel format. >> - **Different versions of Excel** may be changed so they handle data differently, leading to inconsistencies. >>* Can conflict with data management requirements for journals, etc. >* Qualities for ideal storage medium: >>* **Universal**-acessible to many/all programs >>* **Static**-not liable to updates, obsolescence, etc. >>* **Open**-minimal limitations on accessiblity (eg. software liscences, operating systems) >* Text (ASCII data) is the most universal, static, open, way to handle information >>* columns indicated by tabs/commas, observations separated by rows >>* Column titles (a.k.a. 'headers') are in top row >>>* Avoid any unnecesary spaces, tabs, in raw data (including at headers)--can confuse reading (human or computer) of data >>>>* Keep only related observations on a sheet--rows not sharing the same format or level will still be interpreted as related data >>>* .csv (or .tsv) format can only handle one Excel 'tab' or sheet at a time, so avoid making 'sheets' in excel (or equivalent) significant (e.g. a tab for each experimental site) >>> **General rule:** if the observations are related and will be analyzed together, put them in the same data file. >>> Question: How to handle ~literal commas in data? >>>>* As a general rule, avoid using extraneous commas, but can use single or double quotes to separate these out (they will be treated as a 'text string'); Excel is usually ~intelligent enough to catch these automatically, but shouldn't count on it >>>> ## For next week: 1. Read Hadley Wickham, *Tidy Data*, Vol. 59, Issue 10, Sep 2014, Journal of Statistical Software. [http://www.jstatsoft.org/v59/i10](http://www.jstatsoft.org/v59/i10). 2. Download and install R and RStudio [with these instructions.](http://www.datacarpentry.org/R-ecology-lesson/#setup_instructions) --- ## Aug 2: Introduction to R ## What is R and why is it useful? Why use RStudio? > ### Learning Objectives > > * Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. > * Organize files and directories for a set of analyses as an R Project, and understand the purpose of the working directory. > * Use the built-in RStudio help interface to search for more information on R functions. > * Demonstrate how to provide sufficient information for troubleshooting with the R user community. - The term "R" is used to refer to both the programming language and the software that interprets the scripts written using it. - RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer. ### Why learn R? #### R does not involve lots of pointing and clicking, and that's a good thing The learning curve might be steeper than with other software, but with R, the results of your analysis does not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that's a good thing! So, if you want to redo your analysis because you collected more data, you don't have to remember which button you clicked in which order to obtain your results, you just have to run your script again. Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes. Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use. #### R code is great for reproducibility Reproducibility is when someone else (including your future self) can obtain the same results from the same dataset when using the same analysis. R integrates with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically. An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements. #### R is interdisciplinary and extensible With 10,000+ packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more. #### R works on data of all shapes and sizes The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won't make much difference to you. R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient. R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web. #### R produces high-quality graphics The plotting functionalities in R are endless, and allow you to adjust any aspect of your graph to convey most effectively the message from your data. #### R has a large community Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as [Stack Overflow](https://stackoverflow.com/). #### Not only is R free, but it is also open-source and cross-platform Anyone can inspect the source code to see how R works. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs. ### Knowing your way around RStudio Let's start by learning about [RStudio](https://www.rstudio.com/), which is an Integrated Development Environment (IDE) for working with R. We will use RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop. RStudio is divided into 4 "Panes": the **Source** for your scripts and documents (top-left, in the default layout), the R **Console** (bottom-left), your **Environment/History** (top-right), and your **Files/Plots/Packages/Help/Viewer** (bottom-right). The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout). One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone. ### Getting set up It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the **working directory**. All of the scripts within this folder can then use *relative paths* to files that indicate where inside the project a file is located (as opposed to absolute paths, which point to where a file is on a specific computer). Working this way makes it a lot easier to move your project around on your computer and share it with others without worrying about whether or not the underlying scripts will still work. RStudio provides a helpful set of tools to do this through its "Projects" interface, which not only creates a working directory for you but also remembers its location (allowing you to quickly navigate to it) and optionally preserves custom settings and open files to make it easier to resume work after a break. Below, we will go through the steps for creating an "R Project" for this tutorial. * Start RStudio (presentation of RStudio -below- should happen here) * Under the File menu, click on New project, choose Existing directory and select the folder already containing your projec or New directory and create one. * Click on Create project #### Organizing your working directory Using a consistent folder structure across your projects will help keep things organized, and will also make it easy to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you may create directories (folders) for **scripts**, **data**, and **documents**. - **data/** Use this folder to store your raw data and intermediate datasets you may create for the need of a particular analysis. For the sake of transparency and [provenance](https://en.wikipedia.org/wiki/Provenance), you should *always* keep a copy of your raw data accessible and do as much of your data cleanup and preprocessing programmatically (i.e., with scripts, rather than manually) as possible. Separating raw data from processed data is also a good idea. For example, you could have files data/raw/tree_survey.plot1.txt and ...plot2.txt kept separate from a data/processed/tree.survey.csv file generated by the scripts/01.preprocess.tree_survey.R script. - **documents/** This would be a place to keep outlines, drafts, and other text. - **scripts/** This would be the location to keep your R scripts for different analyses or plotting, and potentially a separate folder for your functions (more on that later). You may want additional directories or subdirectories depending on your project needs, but these should form the backbone of your working directory. For this workshop, we will need a data/ folder to store our raw data, and we will create later a data_output/ folder when we learn how to export data as CSV files. ### Interacting with R The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or *code*, instructions in R because it is a common language that both the computer and we can understand. We call the instructions *commands* and we tell the computer to follow the instructions by *executing* (also called *running*) those commands. There are two main ways of interacting with R: by using the console or by using script files (plain text files that contain your code). The console pane (in RStudio, the bottom left panel) is the place where commands written in the R language can be typed and executed immediately by the computer. It is also where the results will be shown for commands that have been executed. You can type commands directly into the console and press Enter to execute those commands, but they will be forgotten when you close the session. Because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor, and save the script. This way, there is a complete record of what we did, and anyone (including our future selves!) can easily replicate the results on their computer. RStudio allows you to execute commands directly from the script editor by using the <kbd>Ctrl</kbd> + <kbd>Enter</kbd> shortcut (on Macs, <kbd>Cmd</kbd> + <kbd>Return</kbd> will work, too). The command on the current line in the script (indicated by the cursor) or all of the commands in the currently selected text will be sent to the console and executed when you press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>. You can find other keyboard shortcuts in this [RStudio cheatsheet about the RStudio IDE](https://github.com/rstudio/cheatsheets/blob/master/source/pdfs/rstudio-IDE-cheatsheet.pdf). At some point in your analysis you may want to check the content of a variable or the structure of an object, without necessarily keeping a record of it in your script. You can type these commands and execute them directly in the console. RStudio provides the <kbd>Ctrl</kbd> + <kbd>1</kbd> and <kbd>Ctrl</kbd> + <kbd>2</kbd> shortcuts allow you to jump between the script and the console panes. If R is ready to accept commands, the R console shows a > prompt. If it receives a command (by typing, copy-pasting or sent from the script editor using <kbd>Ctrl</kbd> + <kbd>Enter</kbd>), R will try to execute it, and when ready, will show the results and come back with a new > prompt to wait for new commands. If R is still waiting for you to enter more data because it isn't complete yet, the console will show a + prompt. It means that you haven't finished entering a complete command. This is because you have not 'closed' a parenthesis or quotation, i.e. you don't have the same number of left-parentheses as right-parentheses, or the same number of opening and closing quotation marks. When this happens, and you thought you finished typing your command, click inside the console window and press Esc; this will cancel the incomplete command and return you to the > prompt. ### Seeking help #### Use the built-in RStudio help interface to search for more information on R functions One of the most fastest ways to get help, is to use the RStudio help interface. This panel by default can be found at the lower right hand panel of RStudio. As seen in the screenshot, by typing the word "Mean", RStudio tries to also give a number of suggestions that you might be interested in. The description is then shown in the display window. #### I know the name of the function I want to use, but I'm not sure how to use it If you need help with a specific function, let's say barplot(), you can type ?barplot If you just need to remind yourself of the names of the arguments, you can use: args(lm) #### I want to use a function that does X, there must be a function for it but I don't know which one... If you are looking for a function to do a particular task, you can use thehelp.search() function, which is called by the double question mark ??. However, this only looks through the installed packages for help pages with a match to your search request. If you can't find what you are looking for, you can use the [rdocumentation.org](http://www.rdocumentation.org) website that searches through the help files across all packages available. Finally, a generic Google or internet search "R <task\>" will often either send you to the appropriate package documentation or a helpful forum where someone else has already asked your question. #### I am stuck... I get an error message that I don't understand Start by googling the error message. However, this doesn't always work very well because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. "subscript out of bounds"). If the message is very generic, you might also include the name of the function or package you're using in your query. However, you should check Stack Overflow. Search using the [r] tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers: [http://stackoverflow.com/questions/tagged/r](http://stackoverflow.com/questions/tagged/r) The [Introduction to R](http://cran.r-project.org/doc/manuals/R-intro.pdf) can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language. The [R FAQ](http://cran.r-project.org/doc/FAQ/R-FAQ.html) is dense and technical but it is full of useful information. #### Asking for help The key to receiving help from someone is for them to rapidly grasp your problem. You should make it as easy as possible to pinpoint where the issue might be. Try to use the correct words to describe your problem. For instance, a package is not the same thing as a library. Most people will understand what you meant, but others have really strong feelings about the difference in meaning. The key point is that it can make things confusing for people trying to help you. Be as precise as possible when describing your problem. If possible, try to reduce what doesn't work to a simple *reproducible example*. If you can reproduce the problem using a very small data frame instead of your 50,000 rows and 10,000 columns one, provide the small one with the description of your problem. When appropriate, try to generalize what you are doing so even people who are not in your field can understand the question. For instance instead of using a subset of your real dataset, create a small (3 columns, 5 rows) generic one. For more information on how to write a reproducible example see [this article by Hadley Wickham](http://adv-r.had.co.nz/Reproducibility.html). To share an object with someone else, if it's relatively small, you can use the function dput(). It will output R code that can be used to recreate the exact same object as the one in memory: dput(head(iris)) # iris is an example data frame that comes with R and head() is a function that returns the first part of the data frame If the object is larger, provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your issue). Alternatively, in particular if your question is not related to a data frame, you can save any R object to a file with saveRDS() The content of this file is however not human readable and cannot be posted directly on Stack Overflow. Instead, it can be sent to someone by email who can read it with the readRDS() command (here it is assumed that the downloaded file is in a Downloads folder in the user's home directory): some_data <- readRDS(file="~/Downloads/iris.rds") Last, but certainly not least, **always include the output of sessionInfo()** as it provides critical information about your platform, the versions of R and the packages that you are using, and other information that can be very helpful to understand your problem. #### Where to ask for help? * Your friendly colleagues: if you know someone with more experience than you they might be able and willing to help you. * [Stack Overflow](http://stackoverflow.com/questions/tagged/r): if your question hasn't been answered before and is well crafted, chances are you will get an answer in less than 5 min. Remember to follow their guidelines on [how to ask a good question](http://stackoverflow.com/help/how-to-ask). * The [R-help mailing list](https://stat.ethz.ch/mailman/listinfo/r-help): it is read by a lot of people (including most of the R core team), a lot of people post to it, but the tone can be pretty dry, and it is not always very welcoming to new users. If your question is valid, you are likely to get an answer very fast but don't expect that it will come with smiley faces. Also, here more than anywhere else, be sure to use correct vocabulary (otherwise you might get an answer pointing to the misuse of your words rather than answering your question). You will also have more success if your question is about a base function rather than a specific package. * If your question is about a specific package, see if there is a mailing list for it. Usually it's included in the DESCRIPTION file of the package that can be accessed using packageDescription("name-of-package"). You may also want to try to email the author of the package directly, or open an issue on the code repository (e.g., GitHub). * There are also some topic-specific mailing lists (GIS, phylogenetics, etc...), the complete list is [here](http://www.r-project.org/mail.html). #### More resources * The [Posting Guide](http://www.r-project.org/posting-guide.html) for the R mailing lists. * [How to ask for R help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html)useful guidelines * [This blog post by Jon Skeet](http://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) has quite comprehensive advice on how to ask programming questions. * The [reprex](https://cran.rstudio.com/web/packages/reprex/) package is very helpful to create reproducible examples when asking for help. The [rOpenSci community call "How to ask questions so they get answered"](https://ropensci.org/blog/blog/2017/02/17/comm-call-v13), [Github link](https://github.com/ropensci/commcalls/issues/14) and [video recording](https://vimeo.com/208749032) includes a presentation of the reprex package and of its philosophy. ## R Basics > ### Learning Objectives > > * Define the following terms as they relate to R: object, assign, call, function, arguments, options. > * Create objects and and assign values to them. > * Use comments to inform script. > * Do simple arithmetic operations in R using values and objects. > * Call functions and use arguments to change their default options. > * Inspect the content of vectors and manipulate their content. > * Subset and extract values from vectors. > * Correctly define and handle missing values in vectors. You can get output from R simply by typing math in the console: However, to do useful and interesting things, we need to assign _values_ to _objects_. To create an object, we need to give it a name followed by the assignment operator <-, and the value we want to give it: <- is the assignment operator. It assigns values on the right to objects on the left. So, after executing x <- 3, the value of x is 3. The arrow can be read as 3 **goes into** x. For historical reasons, you can also use = for assignments, but not in every context. Because of the [slight](http://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) [differences](https://web.archive.org/web/20130610005305/https://stat.ethz.ch/pipermail/r-help/2009-March/191462.html) in syntax, it is good practice to always use <- for assignments. In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> at the same time as the <kbd>-</kbd> key) will write  <-  in a single keystroke. Objects can be given any name such as x, current_temperature, or subject_id. You want your object names to be explicit and not too long. They cannot start with a number (2x is not valid, but x2 is). R is case sensitive (e.g., weight_kg is different from Weight_kg). There are some names that cannot be used because they are the names of fundamental functions in R (e.g., if, else, for, see [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) for a complete list). In general, even if it's allowed, it's best to not use other function names (e.g., c, T, mean, data, df, weights). If in doubt, check the help to see if the name is already in use. It's also best to avoid dots (.) within a variable name as in my.dataset. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it's best to avoid them. It is also recommended to use nouns for variable names, and verbs for function names. It's important to be consistent in the styling of your code (where you put spaces, how you name variables, etc.). Using a consistent coding style makes your code clearer to read for your future self and your collaborators. In R, three popular style guides are [Google's](https://google.github.io/styleguide/Rguide.xml), [Jean Fan's](http://jef.works/R-style-guide/) and the [tidyverse's](http://style.tidyverse.org/). The tidyverse's is very comprehensive and may seem overwhelming at first. You can install the [**lintr**](https://github.com/jimhester/lintr) to automatically check for issues in the styling of your code. When assigning a value to an object, R does not print anything. You can force R to print the value by using parentheses or by typing the object name: {r, purl=FALSE} weight_kg <- 55 # doesn't print anything (weight_kg <- 55) # but putting parenthesis around the call prints the value of weight_kg weight_kg # and so does typing the name of the object  Now that R has weight_kg in memory, we can do arithmetic with it. For instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg): {r, purl=FALSE} 2.2 * weight_kg  We can also change a variable's value by assigning it a new one: {r, purl=FALSE} weight_kg <- 57.5 2.2 * weight_kg  This means that assigning a value to one variable does not change the values of other variables. For example, let's store the animal's weight in pounds in a new variable, weight_lb: {r, purl=FALSE} weight_lb <- 2.2 * weight_kg  and then change weight_kg to 100. {r, purl=FALSE} weight_kg <- 100  What do you think is the current content of the object weight_lb? 126.5 or 220? #### Comments The comment character in R is #, anything to the right of a # in a script will be ignored by R. It is useful to leave notes, and explanations in your scripts. RStudio makes it easy to comment or uncomment a paragraph: after selecting the lines you want to comment, press at the same time on your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. > ### Challenge > > What are the values after each statement in the following? > > {r, purl=FALSE} > mass <- 47.5 # mass? > age <- 122 # age? > mass <- mass * 2.0 # mass? > age <- age - 20 # age? > mass_index <- mass/age # mass_index? >  >> Successive assignments relative to existing values are possible (*i.e.* recursive definition) >> R evaluates then assigns to object during "<-, =" #### Functions and their arguments Functions are "canned scripts" that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R *packages* (more on that later). A function usually gets one or more inputs called *arguments*. Functions often (but not always) return a *value*. A typical example would be the function sqrt(). The input (the argument) must be a number, and the return value (in fact, the output) is the square root of that number. Executing a function ('running it') is called *calling* the function. An example of a function call is: {r, eval=FALSE, purl=FALSE} b <- sqrt(a)  Here, the value of a is given to the sqrt() function, the sqrt() function calculates the square root, and returns the value which is then assigned to variable b. This function is very simple, because it takes just one argument. The return 'value' of a function need not be numerical (like that of sqrt()), and it also does not need to be a single item: it can be a set of things, or even a dataset. We'll see that when we read data files into R. Arguments can be anything, not only numbers or filenames, but also other objects. Exactly what each argument means differs per function, and must be looked up in the documentation (see below). Some functions take arguments which may either be specified by the user, or, if left out, take on a *default* value: these are called *options*. Options are typically used to alter the way the function operates, such as whether it ignores 'bad values', or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default. Let's try a function that can take multiple arguments: round(). {r, results='show', purl=FALSE} round(3.14159)  Here, we've called round() with just one argument, 3.14159, and it has returned the value 3. That's because the default is to round to the nearest whole number. If we want more digits we can see how to do that by getting information about the round function. We can use args(round) or look at the help for this function using ?round. {r, results='show', purl=FALSE} args(round)  {r, eval=FALSE, purl=FALSE} ?round  We see that if we want a different number of digits, we can type digits=2 or however many we want. {r, results='show', purl=FALSE} round(3.14159, digits = 2)  If you provide the arguments in the exact same order as they are defined you don't have to name them: {r, results='show', purl=FALSE} round(3.14159, 2)  And if you do name the arguments, you can switch their order: {r, results='show', purl=FALSE} round(digits = 2, x = 3.14159)  It's good practice to put the non-optional arguments (like the number you're rounding) first in your function call, and to specify the names of all optional arguments. If you don't, someone reading your code might have to look up the definition of a function with unfamiliar arguments to understand what you're doing. #### Objects vs. variables What are known as objects in R are known as variables in many other programming languages. Depending on the context, object and variable can have drastically different meanings. However, in this lesson, the two words are used synonymously. For more information see: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects ### Vectors and data types A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c() function. For example we can create a vector of animal weights and assign it to a new object weight_g: {r, purl=FALSE} weight_g <- c(50, 60, 65, 82) weight_g  A vector can also contain characters: {r, purl=FALSE} animals <- c("mouse", "rat", "dog") animals  The quotes around "mouse", "rat", etc. are essential here. Without the quotes R will assume there are objects called mouse, rat and dog. As these objects don't exist in R's memory, there will be an error message. There are many functions that allow you to inspect the content of a vector. length() tells you how many elements are in a particular vector: {r, purl=FALSE} length(weight_g) length(animals)  An important feature of a vector, is that all of the elements are the same type of data. The function class() indicates the class (the type of element) of an object: {r, purl=FALSE} class(weight_g) class(animals)  The function str() provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects: {r, purl=FALSE} str(weight_g) str(animals)  You can use the c() function to add other elements to your vector: {r, purl=FALSE} weight_g <- c(weight_g, 90) # add to the end of the vector weight_g <- c(30, weight_g) # add to the beginning of the vector weight_g  In the first line, we take the original vector weight_g, add the value 90 to the end of it, and save the result back into weight_g. Then we add the value 30 to the beginning, again saving the result back into weight_g. We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating. We just saw 2 of the 6 main **atomic vector** types (or **data types**) that R uses: "character" and "numeric". These are the basic building blocks that all R objects are built from. The other 4 are: * "logical" for TRUE and FALSE (the boolean data type) * "integer" for integer numbers (e.g., 2L, the L indicates to R that it's an integer) * "complex" to represent complex numbers with real and imaginary parts (e.g., 1 + 4i) and that's all we're going to say about them * "raw" that we won't discuss further Vectors are one of the many **data structures** that R uses. Other important ones are lists (list), matrices (matrix), data frames (data.frame), factors (factor) and arrays (array). > ### Challenge > > * We’ve seen that atomic vectors can be of type character, numeric, integer, and logical. But what happens if we try to mix these types in a single vector? > > * What will happen in each of these examples? (hint: use class() to check the data type of your objects): > > r > num_char <- c(1, 2, 3, 'a') > num_logical <- c(1, 2, 3, TRUE) > char_logical <- c('a', 'b', 'c', TRUE) > tricky <- c(1, 2, 3, '4') >  > > * Why do you think it happens? > > * You've probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class _coercion_. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced? ### Subsetting vectors If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance: {r, results='show', purl=FALSE} animals <- c("mouse", "rat", "dog", "cat") animals[2] animals[c(3, 2)]  We can also repeat the indices to create an object with more elements than the original one: {r, results='show', purl=FALSE} more_animals <- animals[c(1, 2, 3, 2, 1, 4)] more_animals  R indices start at 1, because that's what human beings typically do. Other programming languages (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do. #### Conditional subsetting Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not: {r, results='show', purl=FALSE} weight_g <- c(21, 34, 39, 54, 55) weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]  Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50: {r, results='show', purl=FALSE} weight_g > 50 # will return logicals with TRUE for the indices that meet the condition ## so we can use this to select only the values above 50 weight_g[weight_g > 50]  You can combine multiple tests using & (both conditions are true, AND) or | (at least one of the conditions is true, OR): {r, results='show', purl=FALSE} weight_g[weight_g < 30 | weight_g > 50] weight_g[weight_g >= 30 & weight_g == 21]  Here, < stands for "less than", > for "greater than", >= for "greater than or equal to", and == for "equal to". The double equal sign == is a test for numerical equality between the left and right hand sides, and should not be confused with the single = sign, which performs variable assignment (similar to <-). A common task is to search for certain strings in a vector. One could use the "or" operator | to test for equality to multiple values, but this can quickly become tedious. The function %in% allows you to test if any of the elements of a search vector are found: {r, results='show', purl=FALSE} animals <- c("mouse", "rat", "dog", "cat") animals[animals == "cat" | animals == "rat"] # returns both rat and cat animals %in% c("rat", "cat", "dog", "duck", "goat") animals[animals %in% c("rat", "cat", "dog", "duck", "goat")]  > ### Challenge > > * Can you figure out why "four" > "five" returns TRUE? <!-- {r, purl=FALSE} ## Answers ## * When using ">" or "<" on strings, R compares their alphabetical order. Here ## "four" comes after "five", and therefore is "greater than" it.  --> ### Missing data As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as NA. When doing operations on numbers, most functions will return NA if the data you are working with include missing values. This feature makes it harder to overlook the cases where you are dealing with missing data. You can add the argument na.rm=TRUE to calculate the result while ignoring the missing values. {r, purl=FALSE} heights <- c(2, 4, 4, NA, 6) mean(heights) max(heights) mean(heights, na.rm = TRUE) max(heights, na.rm = TRUE)  If your data include missing values, you may want to become familiar with the functions is.na(), na.omit(), and complete.cases(). See below for examples. {r, purl=FALSE} ## Extract those elements which are not missing values. heights[!is.na(heights)] ## Returns the object with incomplete cases removed. ## The returned object is atomic. na.omit(heights) ## Extract those elements which are complete cases. heights[complete.cases(heights)]  > ### Challenge > > 1. Using this vector of length measurements, create a new vector with the NAs removed. > > r > lengths <- c(10,24,NA,18,NA,20) >  > > 2. Use the function median() to calculate the median of the lengths vector. Now that we have learned how to write scripts, and the basics of R's data structures, we are ready to start working with the Portal dataset we have been using in the other lessons, and learn about data frames. ## Starting with data ### Presentation of the Survey Data We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent: | Column | Description | |------------------|------------------------------------| | record\_id | Unique id for the observation | | month | month of observation | | day | day of observation | | year | year of observation | | plot\_id | ID of a particular plot | | species\_id | 2-letter code | | sex | sex of animal ("M", "F") | | hindfoot\_length | length of the hindfoot in mm | | weight | weight of the animal in grams | | genus | genus of animal | | species | species of animal | | taxa | e.g. Rodent, Reptile, Bird, Rabbit | | plot\_type | type of plot | We are going to use the R function download.file() to download the CSV file that contains the survey data from figshare, and we will use read.csv() to load into memory the content of the CSV file as an object of class data.frame. To download the data into the data/ subdirectory (or whatever directory you would like to put the data in), run the following (replacing data/ with the name of the directory you choose, or nothing if you wish to download it into your working directory): {r, eval=FALSE, purl=TRUE} download.file("https://ndownloader.figshare.com/files/2292169", "data/portal_data_joined.csv")  You are now ready to load the data (again, if you have a different name for your data directory, or don't have one, change data/ to the correct name or delete): {r, eval=TRUE, purl=FALSE} surveys <- read.csv('data/portal_data_joined.csv')  This statement doesn't produce any output because, as you might recall, assignments don't display anything. If we want to check that our data has been loaded, we can print the variable's value: surveys. Wow... that was a lot of output. At least it means the data loaded properly. An easier check is to look at the top (the first 6 lines) of this data frame using the function head(). ### For next week: Load your data (or the data we just downloaded) into R's memory as an object. 1. What is the class and structure of the object containing this data? 2. Use the function head() to look at the first few lines of the data, and compare against the table above (or your knowledge of your own data). Do the data look as you expect them to? ## Aug 9: Working with dataframes ## Starting with data, continued > ### Learning Objectives > > * Describe what a data frame is. > * Load external data from a .csv file into a data frame in R. > * Summarize the contents of a data frame in R. > * Manipulate categorical data in R. > * Change how character strings are handled in a data frame. > * Format dates in R If you haven't already, load the data (remember that the file path and/or name might be different for you): {r, eval=TRUE, purl=FALSE} surveys <- read.csv('data/portal_data_joined.csv')  ### What are data frames? Data frames are the _de facto_ data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by the functions read.csv() or read.table(); in other words, when importing spreadsheets from your hard drive (or the web). A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because the column are vectors, they all contain the same type of data (e.g., characters, integers, factors). We can see this when inspecting the <b>str</b>ucture of a data frame with the function str(). ### Inspecting data.frame Objects We already saw how the functions head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let's try them out! * Size: * dim(surveys) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the **dim**ensions of the object) * nrow(surveys) - returns the number of rows * ncol(surveys) - returns the number of columns * Content: * head(surveys) - shows the first 6 rows * tail(surveys) - shows the last 6 rows * Names: * names(surveys) - returns the column names (synonym of colnames() for data.frame objects) * rownames(surveys) - returns the row names * Summary: * str(surveys) - structure of the object and information about the class, length and content of each column * summary(surveys) - summary statistics for each column Note: most of these functions are "generic", they can be used on other types of objects besides data.frame. > ### Challenge > > Based on the output of str(surveys), can you answer the following questions? > > * What is the class of the object surveys? > * How many rows and how many columns are in this object? > * How many species have been recorded during these surveys? <!--- {r, echo=FALSE, purl=FALSE} ## Answers ## * class: data frame ## * how many rows: 34786, how many columns: 13 ## * how many species: 48  ---> ### Indexing and subsetting data frames Our survey data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the "coordinates" we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes. {r, purl=FALSE} surveys[1, 1] # first element in the first column of the data frame (as a vector) surveys[1, 6] # first element in the 6th column (as a vector) surveys[, 1] # first column in the data frame (as a vector) surveys[1] # first column in the data frame (as a data.frame) surveys[1:3, 7] # first three elements in the 7th column (as a vector) surveys[3, ] # the 3rd element for all columns (as a data.frame) head_surveys <- surveys[1:6, ] # equivalent to head(surveys)  : is a special function that creates numeric vectors of integers in increasing or decreasing order, test 1:10 and 10:1 for instance. You can also exclude certain parts of a data frame using the "-" sign: {r, purl=FALSE} surveys[,-1] # The whole data frame, except the first column surveys[-c(7:34786),] # Equivalent to head(surveys)  As well as using numeric values to subset a data.frame (or matrix), columns can be called by name, using one of the four following notations: {r, eval = FALSE, purl=FALSE} surveys["species_id"] # Result is a data.frame surveys[, "species_id"] # Result is a vector surveys[["species_id"]] # Result is a vector surveys$species_id # Result is a vector  For our purposes, the last three notations are equivalent. RStudio knows about the columns in your data frame, so you can take advantage of the autocompletion feature to get the full and correct column name. > ### Challenge > > 1. Create a data.frame (surveys_200) containing only the observations from row 200 of the surveys dataset. > > 2. Notice how nrow() gave you the number of rows in a data.frame? > > * Use that number to pull out just that last row in the data frame. > * Compare that with what you see as the last row using tail() to make sure it's meeting expectations. > * Pull out that last row using nrow() instead of the row number. > * Create a new data frame object (surveys_last) from that last row. > > 3. Use nrow() to extract the row that is in the middle of the data frame. Store the content of this row in an object named surveys_middle. > > 4. Combine nrow() with the - notation above to reproduce the behavior ofhead(surveys) keeping just the first through 6th rows of the surveys dataset. <!--- {r, purl=FALSE} ## Answers surveys_200 <- surveys[200, ] surveys_last <- surveys[nrow(surveys), ] surveys_middle <- surveys[nrow(surveys)/2, ] surveys_head <- surveys[-c(7:nrow(surveys)),]  ---> ### Factors When we did str(surveys) we saw that several of the columns consist of integers, however, the columns genus, species, sex, plot_type, ... are of a special class called a factor. Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting. Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings. Once created, factors can only contain a pre-defined set of values, known as *levels*. By default, R always sorts *levels* in alphabetical order. For instance, if you have a factor with 2 levels: {r, purl=TRUE} sex <- factor(c("male", "female", "female", "male"))  R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even though the first element in this vector is "male"). You can check this by using the function levels(), and check the number of levels using nlevels(): {r, purl=FALSE} levels(sex) nlevels(sex)  Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., "low", "medium", "high"), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the sex vector would be: {r, results=TRUE, purl=FALSE} sex # current order sex <- factor(sex, levels = c("male", "female")) sex # after re-ordering  In R's memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: "female", "male" is more descriptive than 1, 2. Which one is "male"? You wouldn't be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the species names in our example dataset). #### Converting factors If you need to convert a factor to a character vector, you use as.character(x). {r, purl=FALSE} as.character(sex)  Converting factors where the levels appear as numbers (such as concentration levels, or years) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. Another method is to use the levels() function. Compare: {r, purl=TRUE} f <- factor(c(1990, 1983, 1977, 1998, 1990)) as.numeric(f) # wrong! and there is no warning... as.numeric(as.character(f)) # works... as.numeric(levels(f))[f] # The recommended way.  Notice that in the levels() approach, three important steps occur: * We obtain all the factor levels using levels(f) * We convert these levels to numeric values using as.numeric(levels(f)) * We then access these numeric values using the underlying integers of the vector f inside the square brackets #### Renaming factors When your data is stored as a factor, you can use the plot() function to get a quick glance at the number of observations represented by each factor level. Let's look at the number of males and females captured over the course of the experiment: {r, purl=TRUE} ## bar plot of the number of females and males captured during the experiment: plot(surveys$sex)  In addition to males and females, there are about 1700 individuals for which the sex information hasn't been recorded. Additionally, for these individuals, there is no label to indicate that the information is missing. Let's rename this label to something more meaningful. Before doing that, we're going to pull out the data on sex and work with that data, so we're not modifying the working copy of the data frame: {r, results=TRUE, purl=FALSE} sex <- surveys$sex head(sex) levels(sex) levels(sex)[1] <- "missing" levels(sex) head(sex)  > ### Challenge > > * Rename "F" and "M" to "female" and "male" respectively. > * Now that we have renamed the factor level to "missing", can you recreate the barplot such that "missing" is last (after "male")? > <!--- {r correct-order, purl=FALSE} ## Answers levels(sex)[2:3] <- c("female", "male") sex <- factor(sex, levels = c("female", "male", "missing")) plot(sex)  ---> #### Using stringsAsFactors=FALSE By default, when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which can be set to FALSE. In many cases, it's preferable to set stringsAsFactors = FALSE when importing your data, and converting as a factor only the columns that require this data type. Compare the output of str(surveys) when setting stringsAsFactors = TRUE (default) and stringsAsFactors = FALSE (remembering to change the file direction as needed): {r, eval=FALSE, purl=FALSE} ## Compare the difference between when the data are being read as ## factor, and when they are being read as character. surveys <- read.csv("data/portal_data_joined.csv", stringsAsFactors = TRUE) str(surveys) surveys <- read.csv("data/portal_data_joined.csv", stringsAsFactors = FALSE) str(surveys) ## Convert the column "plot_type" into a factor surveys$plot_type <- factor(surveys$plot_type)  The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance). ## Homework > ### Challenge > > 1. We have seen how data frames are created when using the read.csv(), but they can also be created by hand with the data.frame() function. There are a few mistakes in this hand-crafted data.frame, can you spot and fix them? Don't hesitate to experiment! > > {r, eval=FALSE, purl=FALSE} > animal_data <- data.frame(animal=c(dog, cat, sea cucumber, sea urchin), > feel=c("furry", "squishy", "spiny"), > weight=c(45, 8 1.1, 0.8)) >  > > 2. Can you predict the class for each of the columns in the following example? Check your guesses using str(country_climate): > * Are they what you expected? Why? Why not? > * What would have been different if we had added stringsAsFactors = FALSE to this call? > * What would you need to change to ensure that each column had the accurate data type? > > {r, eval=FALSE, purl=FALSE} > country_climate <- data.frame( > country=c("Canada", "Panama", "South Africa", "Australia"), > climate=c("cold", "hot", "temperate", "hot/temperate"), > temperature=c(10, 30, 18, "15"), > northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"), > has_kangaroo=c(FALSE, FALSE, FALSE, 1) > <!--- Answers > > {r, eval=FALSE, echo=FALSE, purl=FALSE} > ## Answers > ## * missing quotations around the names of the animals > ## * missing one entry in the "feel" column (probably for one of the furry animals) > ## * missing one comma in the weight column > > ## Answers > ## * country, climate, temperature, and northern_hemisphere are > ## factors; has_kangaroo is numeric. > ## * using stringsAsFactors=FALSE would have made them character instead of > ## factors > ## * removing the quotes in temperature, northern_hemisphere, and replacing 1 > ## by TRUE in the has_kangaroo column would probably what was originally > ## intended. >  > > --> > ### Formatting Dates As we've already discussed, dates and times can be difficult to represent correctly when communicating with computers (and other people, too!). We've recommended some ways to store dates that can help with this, including some standard formats (e.g. YYYYMMDD, or storing each element of date in a separate variable). Here we're going to briefly introduce another way to deal with dates AND times in R using POSIX format. For more information and other approaches, check out [this helpful guide](https://www.stat.berkeley.edu/~s133/dates.html), from which this section borrows heavily. POSIX stands for "portable operating system interface" and is used by many operating systems, including UNIX systems. Dates stored in the POSIX format are date/time values and allow modification of time zones. POSIX date classes store times to the nearest second, which can be useful if you have data at that scale. There are two POSIX date/time classes, which differ in the way that the values are stored internally. The POSIXct class stores date/time values as the number of seconds since January 1, 1970, while the POSIXlt class stores them as a list with elements for second, minute, hour, day, month, and year, among others. Unless you need the list nature of the POSIXlt class, the POSIXct class is the usual choice for storing dates in R. The ggplot2 plotting package that we will use next week uses the POSIXct class. The 'surveys' dataset has a separate column for day, month, and year, and each contains integer values, as we can confirm with str(): {r, eval=FALSE, purl=FALSE} str(surveys)  We first need to make a character vector for our dates in the default input format for POSIX dates: the year, followed by the month and day, separated by slashes or dashes. For date/time values, the date may be followed by white space (e.g. space or tab) and a time in the form hour:minutes:seconds or hour:minutes, which then may be followed by white space and the time zone. Here are some examples of valid POSIX inputs: 1915/6/16 2005-06-24 11:25 1990/2/17 12:20:05 2012-7-31 12:20:05 MST {r, purl=FALSE} ## Create a date character vector ## Using the '$' function to add it to the surveys data frame ## Using the function paste which pastes values together in a character string ## The 'sep' argument indicates the character to use to separate each component surveys$date <- paste(surveys$year, surveys$month, surveys$day, sep="-") head(surveys$date) class(surveys$date) str(surveys) # notice the new 'date' column, with 'chr' as the class  The new variable date is character class. To convert it to POSIX format, you'll need to modify it using the as.POSIX() function. {r, purl=FALSE} ## Formate date as POSIX ## The 'tz' argument allows you specify timezone (more important if you actually have time data) ## "UTC" is GMT and "" indicates the current timezone ## The 'format' argument tells the function what order the date components are in surveys$date <- as.POSIXct(surveys$date, tz="UTC", format="%Y-%m-%d") class(surveys$date)  Great! That sets us up to start working with date values. ## Aug 16: Data manipulation with dplyr and tidyr > ### Learning Objectives > > * Understand what an R package is and how to install them > * Understand the purpose of the **dplyr** and **tidyr** packages. > > * Select certain columns in a data frame with the **dplyr** function select. > > * Select certain rows in a data frame according to filtering conditions with the **dplyr** function filter . > > * Link the output of one **dplyr** function to the input of another function with the 'pipe' operator %>%. > > * Add new columns to a data frame that are functions of existing columns with mutate. > > * Understand the split-apply-combine concept for data analysis. > > * Use summarize, group_by, and tally to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results. > > * Understand the concept of a wide and a long table format and for which purpose those formats are useful. > > * Understand what key-value pairs are. > > * Reshape a data frame from long to wide format and back with the spread and gather commands from the **tidyr** package. ------------ ## Packages Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. Enter **dplyr**. **dplyr** is a package for making tabular data manipulation easier. It pairs nicely with **tidyr** which enables you to swiftly convert between different data formats for plotting and analysis. Packages in R are basically sets of additional functions that let you do more stuff. The functions we've been using so far, like str() or data.frame(), come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it. You can install packages within RStudio using the 'Packages' tab in the lower right quadrant, or you can install them directly using the function install.packages(): {r, message = FALSE, purl = FALSE} install.packages("tidyverse") ## install the tidyverse packages  The **tidyverse** package is an "umbrella-package" that installs several packages useful for data analysis which work together well such as **tidyr**, **dplyr**, **ggplot2**, etc. To load the package type: {r, message = FALSE, purl = FALSE} library("tidyverse") ## load the tidyverse packages, incl. dplyr library("dplyr") #if you just wanted to load one package  By default, you need to load most packages each time you start a new R session. It is useful to load the packages you will need at the top of your script (e.g. add the library() functions), so that if you need to re-run your analyses you don't need to remember to load the packages first. There are many packages available for R, with new ones being developed every day. Anyone (including you!) can make a package, as long as they follow some [simple guidelines.](http://r-pkgs.had.co.nz/) If you think that there might be a package with functions you could use, there probably is. Try googling 'R package *thing I need to do*" and see what comes up. ## dplyr for data manipulation #### What are **dplyr** and **tidyr**? The package **dplyr** provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). The package **tidyr** addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups - like plots or aquaria. Moving back and forth between these formats is nontrivial, and **tidyr** gives you tools for this and more sophisticated data manipulation. To learn more about **dplyr** and **tidyr** after the workshop, you may want to check out this [handy data transformation with **dplyr** cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf) and this [one about **tidyr**](https://github.com/rstudio/cheatsheets/blob/master/source/pdfs/data-import-cheatsheet.pdf). ### Selecting columns and filtering rows We're going to learn some of the most common **dplyr** functions: select(), filter(), mutate(), group_by(), and summarize(). To select columns of a data frame, use select(). The first argument to this function is the data frame (surveys), and the subsequent arguments are the columns to keep. {r, results = 'hide', purl = FALSE} select(surveys, plot_id, species_id, weight)  To choose rows based on a specific criteria, use filter(): {r, purl = FALSE} filter(surveys, year == 1995)  ### Pipes But what if you wanted to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes. With intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as things are evaluated from the inside out. The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with **dplyr**. If you use RStudio, you can type the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you have a PC or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd> if you have a Mac. {r, purl = FALSE} surveys %>% filter(weight < 5) %>% select(species_id, sex, weight)  In the above, we use the pipe to send the surveys dataset first through filter() to keep rows where weight is less than 5, then through select() to keep only the species_id, sex, and weight columns. Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don't need to explicitly include it as an argument to the filter() and select() functions anymore. If we wanted to create a new object with this smaller version of the data, we could do so by assigning it a new name: {r, purl = FALSE} surveys_sml <- surveys %>% filter(weight < 5) %>% select(species_id, sex, weight) surveys_sml  Note that the final data frame is the leftmost part of this expression. > ### Challenge > > Using pipes, subset the survey data to include individuals collected before 1995 and retain only the columns year, sex, and weight. <!--- {r, eval=FALSE, purl=FALSE} ## Answer surveys %>% filter(year < 1995) %>% select(year, sex, weight)  ---> ### Mutate Frequently you'll want to create new columns based on the values in existing columns, for example to do unit conversions, or find the ratio of values in two columns. For this we'll use mutate(). To create a new column of weight in kg: {r, purl = FALSE} surveys %>% mutate(weight_kg = weight / 1000)  You can also create a second new column based on the first new column within the same call of mutate(): {r, purl = FALSE} surveys %>% mutate(weight_kg = weight / 1000, weight_kg2 = weight_kg * 2)  If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data. (Pipes work with non-**dplyr** functions, too, as long as the **dplyr** or magrittr package is loaded). {r, purl = FALSE} surveys %>% mutate(weight_kg = weight / 1000) %>% head  Note that we don't include parentheses at the end of our call to head() above. When piping into a function with no additional arguments, you can call the function with or without parentheses (e.g. head or head()). The first few rows of the output are full of NAs, so if we wanted to remove those we could insert a filter() in the chain: {r, purl = FALSE} surveys %>% filter(!is.na(weight)) %>% mutate(weight_kg = weight / 1000) %>% head  is.na() is a function that determines whether something is an NA. The ! symbol negates the result, so we're asking for everything that *is not* an NA. > ### Challenge > > Create a new data frame from the surveys data that meets the following criteria: contains only the species_id column and a new column called hindfoot_half containing values that are half the hindfoot_length values. In this hindfoot_half column, there are no NAs and all values are less than 30. > > **Hint**: think about how the commands should be ordered to produce this data frame! <!--- {r, eval=FALSE, purl=FALSE} ## Answer surveys_hindfoot_half <- surveys %>% filter(!is.na(hindfoot_length)) %>% mutate(hindfoot_half = hindfoot_length / 2) %>% filter(hindfoot_half < 30) %>% select(species_id, hindfoot_half)  ---> ### Split-apply-combine data analysis and the summarize() function Many data analysis tasks can be approached using the *split-apply-combine* paradigm: split the data into groups, apply some analysis to each group, and then combine the results. **dplyr** makes this very easy through the use of the group_by() function. #### The summarize() function group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the **categorical** variables for which you want to calculate the summary statistics. So to view the mean weight by sex: {r, purl = FALSE} surveys %>% group_by(sex) %>% summarize(mean_weight = mean(weight, na.rm = TRUE))  You may also have noticed that the output from these calls doesn't run off the screen anymore. That's because **dplyr** has changed our data.frame object to an object of class tbl_df, also known as a "tibble". Tibble's data structure is very similar to a data frame. For our purposes the only differences are that, (1) in addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen, (2) columns of class character are never converted into factors. You can also group by multiple columns: {r, purl = FALSE} surveys %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight, na.rm = TRUE))  When grouping both by sex and species_id, the first rows are for individuals that escaped before their sex could be determined and weighted. You may notice that the last column does not contain NA but NaN (which refers to "Not a Number"). To avoid this, we can remove the missing values for weight before we attempt to calculate the summary statistics on weight. Because the missing values are removed, we can omit na.rm = TRUE when computing the mean: {r, purl = FALSE} surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight))  Here, again, the output from these calls doesn't run off the screen anymore. Recall that **dplyr** has changed our object fromdata.frame to tbl_df. If you want to display more data, you can use the print() function at the end of your chain with the argument n specifying the number of rows to display: {r, purl = FALSE} surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight)) %>% print(n = 15)  Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex: {r, purl = FALSE} surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = mean(weight), min_weight = min(weight))  ### Tallying When working with data, it is also common to want to know the number of observations found for each factor or combination of factors. For this, **dplyr** provides tally(). For example, if we wanted to group by sex and find the number of rows of data for each sex, we would do: {r, purl = FALSE} surveys %>% group_by(sex) %>% tally  Here, tally() is the action applied to the groups created by group_by() and counts the total number of records for each category. > ### Challenge > > 1. How many individuals were caught in each plot_type surveyed? > > 2. Use group_by() and summarize() to find the mean, min, and max hindfoot length for each species (using species_id). > > 3. What was the heaviest animal measured in each year? Return the columns year, genus, species_id, and weight. > > 4. You saw above how to count the number of individuals of each sex using a combination of group_by() and tally(). How could you get the same result using group_by() and summarize()? Hint: see ?n. <!--- {r, echo=FALSE, purl=FALSE} ## Answer 1 surveys %>% group_by(plot_type) %>% tally ## Answer 2 surveys %>% filter(!is.na(hindfoot_length)) %>% group_by(species_id) %>% summarize( mean_hindfoot_length = mean(hindfoot_length), min_hindfoot_length = min(hindfoot_length), max_hindfoot_length = max(hindfoot_length) ) ## Answer 3 surveys %>% filter(!is.na(weight)) %>% group_by(year) %>% filter(weight == max(weight)) %>% select(year, genus, species, weight) %>% arrange(year) ## Answer 4 surveys %>% group_by(sex) %>% summarize(n = n())  ---> ## Exporting data Now that you have learned how to use **dplyr** to extract information from or summarize your raw data, you may want to export these new datasets. Similar to the read.csv() function used for reading CSV files into R, there is a write.csv() function that generates CSV files from data frames. Before using write.csv(), we are going to create a new folder, data_output, in our working directory that will store this generated dataset. We don't want to write generated datasets in the same directory as our raw data. It's good practice to keep them separate. The data folder should only contain the raw, unaltered data, and should be left alone to make sure we don't delete or modify it. In contrast, our script will generate the contents of the data_output directory, so even if the files it contains are deleted, we can always re-generate them. To try this out, we can prepare a version of the dataset that doesn't include any missing data. Let's start by removing observations for which the species_id is missing. In this dataset, the missing species are represented by an empty string and not an NA. Let's also remove observations for which weight and the hindfoot_length are missing. This dataset should also only contain observations of animals for which the sex has been determined: {r, purl=FALSE} surveys_complete <- surveys %>% filter(species_id != "", # remove missing species_id !is.na(weight), # remove missing weight !is.na(hindfoot_length), # remove missing hindfoot_length sex != "") # remove missing sex  We might be interested in plotting how species abundances have changed through time, in which case could remove observations for rare species (i.e., that have been observed less than 50 times). We can do this in two steps: first we are going to create a dataset that counts how often each species has been observed, and filter out the rare species; then, we will extract only the observations for these more common species: {r, purl=FALSE} ## Extract the most common species_id species_counts <- surveys_complete %>% group_by(species_id) %>% tally %>% filter(n >= 50) ## Only keep the most common species surveys_complete <- surveys_complete %>% filter(species_id %in% species_counts$species_id)  {r, eval=FALSE, purl=TRUE, echo=FALSE} ### Create the dataset for exporting: ## Start by removing observations for which the species_id, weight, ## hindfoot_length, or sex data are missing: surveys_complete <- surveys %>% filter(species_id != "", # remove missing species_id !is.na(weight), # remove missing weight !is.na(hindfoot_length), # remove missing hindfoot_length sex != "") # remove missing sex ## Now remove rare species in two steps. First, make a list of species which ## appear at least 50 times in our dataset: species_counts <- surveys_complete %>% group_by(species_id) %>% tally %>% filter(n >= 50) %>% select(species_id) ## Second, keep only those species: surveys_complete <- surveys_complete %>% filter(species_id %in% species_counts\$species_id)  Now that the dataset is ready, we can save it as a CSV file in our data_output folder. By default, write.csv() includes a column with row names (in our case the names are just the row numbers), so we need to add row.names = FALSE so they are not included: {r, purl=FALSE, eval=FALSE} write.csv(surveys_complete, file = "data_output/surveys_complete.csv", row.names = FALSE)  ## Homework Take your data (or the surveys data) and create a filtered, subset, grouped by, or otherwise manipulated data frame. Now export it using write.csv(). ## Extra credit (not covered in class) ### Reshaping with gather and spread **dplyr** is one part of a larger **tidyverse** that enables you to work with data in tidy data formats. **tidyr** enables a wide range of manipulations of the structure data itself. For example, the survey data presented here is in almost in what we call a **long** format - every observation of every individual is its own row. This is an ideal format for data with a rich set of information per observation. It makes it difficult, however, to look at the relationships between measurements across plots. For example, what is the relationship between mean weights of different genera across the entire data set? To answer that question, we'd want each plot to have a single row, with all of the measurements in a single plot having their own column. This is called a **wide** data format. For the surveys data as we have it right now, this is going to be one heck of a wide data frame! However, if we were to summarize data within plots and species, we might begin to have some relationships we'd want to examine. Let's see this in action. First, using **dplyr**, let's create a data frame with the mean body weight of each genera by plot. {r, purl=FALSE} surveys_gw <- surveys %>% filter(!is.na(weight)) %>% group_by(genus, plot_id) %>% summarize(mean_weight = mean(weight)) head(surveys_gw)  #### Long to Wide with spread Now, to make this long data wide, we use spread from tidyr to spread out the different taxa into columns. spread takes three arguments - the data, the *key* column, or column with identifying information, the *values* column - the one with the numbers. We'll use a pipe so we can ignore the data argument. {r, purl=FALSE} surveys_gw_wide <- surveys_gw %>% spread(genus, mean_weight) head(surveys_gw_wide)  Notice that some genera have NA values. That's because some of those genera don't have any record in that plot. Sometimes it is fine to leave those as NA. Sometimes we want to fill them as zeros, in which case we would add the argument fill=0. {r, purl=FALSE} surveys_gw %>% spread(genus, mean_weight, fill = 0) %>% head  We can now do things like plot the weight of *Baiomys* against *Chaetodipus* or examine their correlation. {r, purl=FALSE} surveys_gw %>% spread(genus, mean_weight, fill = 0) %>% cor(use = "pairwise.complete")  #### Wide to long with gather What if we had the opposite problem, and wanted to go from a wide to long format? For that, we use gather to sweep up a set of columns into one key-value pair. We give it the arguments of a new key and value column name, and then we specify which columns we either want or do not want gathered up. So, to go backwards from surveys_gw_wide, and exclude plot_id from the gathering, we would do the following: {r, purl=FALSE} surveys_gw_long <- surveys_gw_wide %>% gather(genus, mean_weight, -plot_id) head(surveys_gw_long)  Note that now the NA genera are included in the long format. Going from wide to long to wide can be a useful way to balance out a dataset so every replicate has the same composition. We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it's easier to specify what to gather than what to leave alone. And if the columns are in a row, we don't even need to list them all out - just use the : operator! {r, purl=FALSE} surveys_gw_wide %>% gather(genus, mean_weight, Baiomys:Spermophilus) %>% head  > ### Challenge > > 1. Make a wide data frame with year as columns, plot_id as rows, and the values are the number of genera per plot. You will need to summarize before reshaping, and use the function n_distinct to get the number of unique types of a genera. It's a powerful function! See ?n_distinct for more. > > 2. Now take that data frame, and make it long again, so each row is a unique plot_id year combination. > > 3. The surveys data set is not truly wide or long because there are two columns of measurement - hindfoot_length and weight. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let's walk through a common solution for this type of problem. First, use gather to create a truly long dataset where we have a key column called measurement and a value column that takes on the value of either hindfoot_length or weight. Hint: You'll need to specify which columns are being gathered. > > 4. With this new truly long data set, calculate the average of each measurement in each year for each different plot_type. Thenspread them into a wide data set with a column for hindfoot_length and weight. Hint: Remember, you only need to specify the key and value columns for spread. <!--- {r, echo=FALSE, purl=FALSE} ## Answer 1 rich_time <- surveys %>% group_by(plot_id, year) %>% summarize(n_genera = n_distinct(genus)) %>% spread(year, n_genera) head(rich_time) ## Answer 2 rich_time %>% gather(year, n_genera, -plot_id) ## Answer 3 surveys_long <- surveys %>% gather(measurement, value, hindfoot_length, weight) ## Answer 4 surveys_long %>% group_by(year, measurement, plot_type) %>% summarize(mean_value = mean(value, na.rm=TRUE)) %>% spread(measurement, mean_value)  ---> ## Aug 17: Data visualization ------------ > ### Learning Objectives > > * Produce scatter plots, boxplots, and time series plots using ggplot. > * Set universal plot settings. > * Modify the aesthetics of an existing ggplot plot (including axis labels and color). > * Build complex and customized plots from data in a data frame. -------------- We start by loading the required packages. **ggplot2** is included in the **tidyverse** package. {r load-package, message=FALSE, purl=FALSE} library(tidyverse) # alternately, you can just load the ggplot2 package library(ggplot2)  If not still in the workspace, load the data we saved in the previous lesson. {r load-data, eval=FALSE, purl=FALSE} surveys_complete <- read.csv('data_output/surveys_complete.csv')  ## Plotting with **ggplot2** **ggplot2** is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties, so we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking. ggplot likes data in the 'long' format: i.e., a column for every dimension, and a row for every observation. Well structured data will save you lots of time when making figures with ggplot. ggplot graphics are built step by step by adding new elements. To build a ggplot we need to: - bind the plot to a specific data frame using the data argument {r, eval=FALSE, purl=FALSE} ggplot(data = surveys_complete)  - define aesthetics (aes), by selecting the variables to be plotted and the variables to define the presentation such as plotting size, shape color, etc. {r, eval=FALSE, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length))  - add geoms -- graphical representation of the data in the plot (points, lines, bars). To add a geom to the plot use + operator {r first-ggplot, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point()  The + in the **ggplot2** package is particularly useful because it allows you to modify existing ggplot objects. This means you can easily set up plot"templates" and conveniently explore different types of plots, so the above plot can also be generated with code like this: {r, first-ggplot-with-plus, eval=FALSE, purl=FALSE} # Assign plot to a variable surveys_plot <- ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) # Draw the plot surveys_plot + geom_point()  {r, eval=FALSE, purl=TRUE, echo=FALSE, purl=FALSE} ## Create a ggplot and draw it. surveys_plot <- ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) surveys_plot + geom_point()  Notes: - Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x and y axis you set up in aes(). - You can also specify aesthetics for a given geom independently of the aesthetics defined globally in the ggplot() function. - The + sign used to add layers must be placed at the end of each line containing a layer. If, instead, the + sign is added in the line before the other layer, **ggplot2** will not add the new layer and will return an error message. {r, ggplot-with-plus-position, eval=FALSE, purl=FALSE} # this is the correct syntax for adding layers surveys_plot + geom_point() # this will not add the new layer and will return an error message surveys_plot + geom_point()  ## Building your plots iteratively Building plots with ggplot is typically an iterative process. We start by defining the dataset we'll use, lay the axes, and choose a geom: {r create-ggplot-object, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point()  Then, we start modifying this plot to extract more information from it. For instance, we can add transparency (alpha) to avoid overplotting: {r adding-transparency, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point(alpha = 0.1)  We can also add colors for all the points: {r adding-colors, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point(alpha = 0.1, color = "blue")  Or to color each species in the plot differently: {r color-by-species, purl=FALSE} ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point(alpha = 0.1, aes(color=species_id))  > ### Challenge > > Use what you just learned to create a scatter plot of weight over species_id with the plot types showing in different colors. Is this a good way to show this type of data? ## Boxplot We can use boxplots to visualize the distribution of weight within each species: {r boxplot, purl=FALSE} ggplot(data = surveys_complete, aes(x = species_id, y = weight)) + geom_boxplot()  By adding points to boxplot, we can have a better idea of the number of measurements and of their distribution: {r boxplot-with-points, purl=FALSE} ggplot(data = surveys_complete, aes(x = species_id, y = weight)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.3, color = "tomato")  Notice how the boxplot layer is behind the jitter layer? What do you need to change in the code to put the boxplot in front of the points such that it's not hidden? > ### Challenges > > Boxplots are useful summaries, but hide the *shape* of the distribution. For example, if there is a bimodal distribution, it would not be observed with a boxplot. An alternative to the boxplot is the violin plot (sometimes known as a beanplot), where the shape (of the density of points) is drawn. > > - Replace the box plot with a violin plot; see geom_violin(). > > In many types of data, it is important to consider the *scale* of the observations. For example, it may be worth changing the scale of the axis to better distribute the observations in the space of the plot. Changing the scale of the axes is done similarly to adding/modifying other components (i.e., by incrementally adding commands). Try making these modifications: > > - Represent weight on the log10 scale; see scale_y_log10(). > > So far, we've looked at the distribution of weight within species. Try making a new plot to explore the distribution of another variable within each species. > > - Create boxplot for hindfoot_length. Overlay the boxplot layer on a jitter layer to show actual measurements. > > - Add color to the datapoints on your boxplot according to the plot from which the sample was taken (plot_id). > > Hint: Check the class for plot_id. Consider changing the class of plot_id from integer to factor. Why does this change how R makes the graph? ## Plotting time series data Let's calculate number of counts per year for each species. First we need to group the data and count records within each group: {r, purl=FALSE} yearly_counts <- surveys_complete %>% group_by(year, species_id) %>% tally  Timelapse data can be visualized as a line plot with years on the x axis and counts on the y axis: {r first-time-series, purl=FALSE} ggplot(data = yearly_counts, aes(x = year, y = n)) + geom_line()  Unfortunately, this does not work because we plotted data for all the species together. We need to tell ggplot to draw a line for each species by modifying the aesthetic function to include group = species_id: {r time-series-by-species, purl=FALSE} ggplot(data = yearly_counts, aes(x = year, y = n, group = species_id)) + geom_line()  We will be able to distinguish species in the plot if we add colors (using color also automatically groups the data): {r time-series-with-colors, purl=FALSE} ggplot(data = yearly_counts, aes(x = year, y = n, color = species_id)) + geom_line()  ## Faceting ggplot has a special technique called *faceting* that allows the user to split one plot into multiple plots based on a factor included in the dataset. We will use it to make a time series plot for each species: {r first-facet, purl=FALSE} ggplot(data = yearly_counts, aes(x = year, y = n)) + geom_line() + facet_wrap(~ species_id)  Now we would like to split the line in each plot by the sex of each individual measured. To do that we need to make counts in the data frame grouped by year, species_id, and sex: {r, purl=FALSE} yearly_sex_counts <- surveys_complete %>% group_by(year, species_id, sex) %>% tally  We can now make the faceted plot by splitting further by sex using color (within a single plot): {r facet-by-species-and-sex, purl=FALSE} ggplot(data = yearly_sex_counts, aes(x = year, y = n, color = sex)) + geom_line() + facet_wrap(~ species_id)  Usually plots with white background look more readable when printed. We can set the background to white using the function theme_bw(). Additionally, you can remove the grid: {r facet-by-species-and-sex-white-bg, purl=FALSE} ggplot(data = yearly_sex_counts, aes(x = year, y = n, color = sex)) + geom_line() + facet_wrap(~ species_id) + theme_bw() + theme(panel.grid = element_blank())  ## **ggplot2** themes In addition to theme_bw(), **ggplot2** comes with several other themes which can be useful to quickly change the look of your visualization. The complete list of themes is available at <http://docs.ggplot2.org/current/ggtheme.html>. theme_minimal() and theme_light() are popular, and theme_void() can be useful as a starting point to create a new hand-crafted theme. The [ggthemes](https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html) package provides a wide variety of options (including an Excel 2003 theme). The [**ggplot2** extensions website](https://www.ggplot2-exts.org) provides a list of packages that extend the capabilities of **ggplot2**, including additional themes. > ### Challenge > Use what you just learned to create a plot that depicts how the average weight of each species changes through the years. <!-- Answer {r average-weight-time-series, purl=FALSE} yearly_weight <- surveys_complete %>% group_by(year, species_id) %>% summarize(avg_weight = mean(weight)) ggplot(data = yearly_weight, aes(x=year, y=avg_weight)) + geom_line() + facet_wrap(~ species_id) + theme_bw()  --> The facet_wrap geometry extracts plots into an arbitrary number of dimensions to allow them to cleanly fit on one page. On the other hand, the facet_grid geometry allows you to explicitly specify how you want your plots to be arranged via formula notation (rows ~ columns; a . can be used as a placeholder that indicates only one row or column). Let's modify the previous plot to compare how the weights of males and females has changed through time: {r average-weight-time-facet-sex-rows, purl=FALSE} # One column, facet by rows yearly_sex_weight <- surveys_complete %>% group_by(year, sex, species_id) %>% summarize(avg_weight = mean(weight)) ggplot(data = yearly_sex_weight, aes(x=year, y=avg_weight, color = species_id)) + geom_line() + facet_grid(sex ~ .)  {r average-weight-time-facet-sex-columns, purl=FALSE} # One row, facet by column ggplot(data = yearly_sex_weight, aes(x=year, y=avg_weight, color = species_id)) + geom_line() + facet_grid(. ~ sex)  > ### Challenge > With all of this information in hand, please take a few minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own. Use the RStudio [**ggplot2** cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf) for inspiration. > Here are some ideas: > * See if you can change the thickness of the lines. > * Can you find a way to change the name of the legend? What about its labels? > * Try using a different color palette (see http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/). After creating your plot, you can save it to a file in your favorite format. You can easily change the dimension (and resolution) of your plot by adjusting the appropriate arguments (width, height and dpi): {r ggsave-example, eval=FALSE, purl=FALSE} my_plot <- ggplot(data = yearly_sex_counts, aes(x = year, y = n, color = sex)) + geom_line() + facet_wrap(~ species_id) + labs(title = 'Observed species in time', x = 'Year of observation', y = 'Number of species') + theme_bw() + theme(axis.text.x = element_text(colour="grey20", size=12, angle=90, hjust=.5, vjust=.5), axis.text.y = element_text(colour="grey20", size=12), text=element_text(size=16)) ggsave("name_of_file.png", my_plot, width=15, height=10)  Note: The parameters width and height` also determine the font size in the saved plot. ## Go forth and make more beautiful plots from well-formatted data files!