Software Carpentry @ SCSU, May 3-4 2018
June Maselli Mol Bio YU
Betsy Roberts- SCSU Biology (Organizer)
Conor Gilligan- Fordham University
Faruk Senturk - SCSU biology
Chris Wisniewski - SCSU Biology
Rebecca Hedreen - SCSU Biology/Library
Mary Mattheis - SCSU
Chaitanya Kantak- Yale SOM
Anthony Dunston - SCSU- parked 5A
Rebecca Adams - Yale
Steven Brady - SCSU (Organizer)
Megan Sullivan - Yale FES
Carol Henger - Fordham University
Nicholas Edgington - SCSU Biology (Organizer)
Catherine DeRose - Yale (Helper)
Joshua Dull - Yale (Helper)
Kate Nyhan -- Yale (Helper)
Niel Infante - West Virginia University (Leader)
Gabriella DiPreta - SCSU Biology
Day 1: The Bash Shell
Download the example [data set](https://swcarpentry.github.io/shell-novice/data/data-shell.zip), save it to your `Desktop` and unzip it using the tool of your choice.
**Shell challenge #1**
1. I want to use `ls` to list all of the files in all the directories under this one. This is called "recursion". Figure out which flag tells `ls` to list directories recursively.
2. I want to use a shell command to download a file from a webserver. What command should I use?
**Shell challenge #2**
Starting in `~/Desktop/data-shell/north-pacific-gyre/2012-07-03`, give four different commands that change the working directory back to the home directory. *Tip: you can test your commands, then get back to the data directory with* **cd -**
**Shell challenge #3**
1. We're going to need the file `~/Desktop/data-shell/data/amino-acids.txt` for our analysis. Copy it into `~/Desktop/data-shell/north-pacific-gyre/2012-07-03`
2. Make a new directory called `scripts`. Move `goodiff` and `goostats` into it.
3. Check your computer's `Trash Can` or `Recycling` for the `README.txt` file that you removed. Is it there?
4. Make a directory called `backup`, then run the command `cp *.txt backup/`. Look in the `backup` directory. Can you tell what the `*` character means?
**Shell challenge #4**
1. Create a file named `hydrocarbons.txt` that contains the length of each file ending in `ane.pdb`.
2. Sort the file **so that the longest pdb file (ie the one with the most lines) is at the top.** *Hint: how would you learn about more flags to the sort command?*
3. Show the contents of `ammonia.pdb`, and the contents of `methane.pdb`. Now, try typing `cat ammonia.pdb methane.pdb`. Why is `cat` short for "concatenate"?
**Shell challenge #5**
1. Return to `~/Desktop/data-shell/north-pacific-gyre/2012-07-03`. Are all the files the same length?
2. How many total measurements were taken by the analyzer? (Ie, how many lines are in ALL the files?)
3. The `echo` command takes whatever you give it on the command line and sends it out its standard output. Try it out by typing `echo hello workshop!`
4. Show the contents of your `README` file to remind yourself what is in there. Now say `echo 2017-07-08 Found a short file >> README` and check its contents again. What just happened? What did the `>>` operator do to the output of the `echo` command?
**Shell challenge #6**
1. Return to `~/Desktop/data-shell/north-pacific-gyre/2012-07-03`. Create a file named `chemical-1.dat` that contains the *first line* of every data file in the directory.
2. Create a file named `chemical-3.dat` that contains the *third line* of every data file in the directory.
3. Remove `chemical-1.dat` and `chemical-3.dat`. With a single command, can you figure out how to make similar files for each of the first 10 lines? *Hint: you can nest loops!*
**Shell challenge #7**
1. Return to `~/Desktop/data-shell/north-pacific-gyre/2012-07-03`. Run the shell script `scripts/goostats` on every data file in the directory. The `goostats` script takes two arguments, an input file and an output file. Write the output files to the `analysis` directory.
2. Create a shell script that takes a list of file names and returns the shortest file.
3. Run `goostats` specifying an input file but no output file. The error it gives you is somewhat ... cryptic. Run `bash -x scripts/goostats NENE02040Z.txt` -- how does the `-x` flag help you debug the script?
Introduction to RStudio
**RStudio challenge #1**
Which of the following are valid R variable names?
**RStudio challenge #2**
What will be the value of each variable after each statement in the following program?
mass <- 47.5
age <- 122
mass <- mass * 2.3
age <- age - 20
**RStudio challenge #3**
Run the code from the previous challenge, and write a command to compare mass to age. Is mass larger than age?
**RStudio challenge #4**
Install the following packages: ggplot2, dplyr, gapminder
**RStudio challenge #5**
Modify the example so that the figure shows how life expectancy has changed over time:
```ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point()```
Hint: the gapminder dataset has a column called “year”, which should appear on the x-axis.
**RStudio challenge #6**
In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom about the x and y locations of each point. Another aesthetic property we can modify is the point color. Modify the code from the previous challenge to color the points by the “continent” column. What trends do you see in the data? Are they what you expected?
**RStudio challenge #7**
Switch the order of the point and line layers from the previous example. What happened?
**RStudio challenge #8**
Modify the color and size of the points on the point layer in the previous example.
Hint: do not use the aes function.
Modify your solution to the above so that the points are now a different shape and are colored by continent with new trendlines. Hint: The color argument can be used inside the aesthetic.
**RStudio challenge #9**
Create a density plot of GDP per capita, filled by continent.
Transform the x axis to better visualise the data spread.
Add a facet layer to panel the density plots by year.
**RStudio challenge #10**
Write a single command (which can span multiple lines and includes pipes) that will produce a dataframe that has the African values for lifeExp, country and year, but not for other Continents. How many rows does your dataframe have and why?
**RStudio challenge #11**
Calculate the average life expectancy per country. Which has the longest average life expectancy and which has the shortest average life expectancy?
**RStudio challenge #12**
Calculate the average life expectancy in 2002 of 2 randomly selected countries for each continent. Then arrange the continent names in reverse order. Hint: Use the dplyr functions arrange() and sample_n(), they have similar syntax to other dplyr functions.
**R challenge THE LAST**
Write a function called kelvin_to_celsius that takes a temperature in Kelvin and returns that temperature in Celsius
Hint: To convert from Kelvin to Celsius you subtract 273.15
Define the function to convert directly from Fahrenheit to Celsius, by reusing these two functions above.
Swirl is an interactive way to learn R!
[swirl website](http://swirlstats.com/students.html) - about swirl
[swirl course network](http://swirlstats.com/scn/surname.html) - download courses
[swirl + online intro r course](http://www.simonqueenborough.info/R/basic/index.html) - has lecture notes and lab exercises to go with swirl lessons
[from the Rstudio website](https://www.rstudio.com/resources/cheatsheets/)
[for plyr/dplyr/tidyr](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) - because the english version is missing from the Rstudio site