owned this note
owned this note
Published
Linked with GitHub
SWC @ UMN
===
Hello! My name is Brian! Welcome to SWC!
There are two major "flavors" of Unix OS:
* BSD (Berkeley Software Distribution) - "original" Unix OS developed at Bell Labs; macOS is based on BSD Unix
* GNU (recursive acronym "GNU's Not Unix") - the open-source "clone" of BSD Unix first envisioned by Linus Torvalds, who developed the Linux kernel starting in the early 90s. GNU Unix is found in most Linux distributions.
You can specify a range of integers for a bash for loop with
~~~
for x in {1..10}; do
## do something
done
~~~
---
ggplot cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
"The Subtleties of Color"
https://earthobservatory.nasa.gov/blogs/elegantfigures/2013/08/05/subtleties-of-color-part-1-of-6/
A Not-So-Short Introduction to LaTeX: https://tobi.oetiker.ch/lshort/lshort.pdf
A short introduction to LaTeX (less hilarious): http://ricardo.ecn.wfu.edu/~cottrell/ecn297/latex_tut.pdf
Git book
https://git-scm.com/book/en/v2
---
Shell lesson
---
**Shell quick-reference** -- http://swcarpentry.github.io/shell-novice/reference/
**Setup**
1. Download the shell lesson data: http://swcarpentry.github.io/shell-novice/data/shell-novice-data.zip
2. Unzip it on your desktop
**Shell challenge #1**
1. I want to use `ls` to list all of the files in all the directories under this one. This is called "recursion". Figure out which flag tells `ls` to list directories recursively.
2. I want to use a shell command to download a file from a webserver. What command should I use?
**Shell challenge #2**
Starting in `~/Desktop/data-shell/north-pacific-gyre/2012-07-03`, give four different commands that change the working directory back to the home directory. *Tip: you can test your commands, then get back to the data directory with* **cd -**
**Shell challenge #3**
1. We're going to need the file `~/Desktop/data-shell/data/amino-acids.txt` for our analysis. Copy it into `~/Desktop/data-shell/north-pacific-gyre/2012-07-03`.
2. Make a new directory called `scripts`. Move `goodiff` and `goostats` into it.
3. Check your computer's `Trash Can` or `Recycling` for the `README.txt` file that you removed. Is it there?
4. Make a directory called `backup`, then run the command `cp *.txt backup/`. Look in the `backup` directory. Can you tell what the `*` character means?
**Shell challenge #4**
1. Create a file named `hydrocarbons.txt` that contains the length of each file ending in `ane.pdb`.
2. Sort the file **so that the longest pdb file is at the top.** *Hint: how would you learn about more flags to the sort command?*
3. Show the contents of `ammonia.pdb`, and the contents of `methane.pdb`. Now, try typing `cat ammonia.pdb methane.pdb`. Why is `cat` short for "concatenate"?
**Shell challenge #5**
1. Return to `~/Desktop/data-shell/north-pacific-gyre/2012-07-03`. Are all the files the same length?
2. How many total measurements were taken by the analyzer? (Ie, how many lines are in ALL the files?)
3. The `echo` command takes whatever you give it on the command line and sends it out its standard output. Try it out by typing `echo hello workshop!`
4. Show the contents of your `README` file to remind yourself what is in there. Now say `echo 2017-07-08 Found a short file >> README` and check its contents again. What just happened? What did the `>>` operator do to the output of the `echo` command?
**Shell challenge #6**
1. Return to `~/Desktop/data-shell/north-pacific-gyre/2012-07-03`. Create a file named `chemical-1.dat` that contains the *first line* of every data file (the files named NE*.txt) in the directory.
2. Create a file named `chemical-3.dat` that contains the *third line* of every data file in the directory.
3. Remove `chemical-1.dat` and `chemical-3.dat`. With a single command, can you figure out how to make similar files for each of the first 10 lines? *Hint: you can nest loops!*
**Shell challenge #7**
1. Return to `~/Desktop/data-shell/north-pacific-gyre/2012-07-03`. Run the shell script `scripts/goostats` on every data file in the directory. The `goostats` script takes two arguments, an input file and an output file. Write the output files to the `analysis` directory.
2. Create a shell script that takes a list of file names and returns the shortest file.
3. Run `goostats` specifying an input file but no output file. The error it gives you is somewhat ... cryptic. Run `bash -x scripts/goostats NENE02040Z.txt` -- how does the `-x` flag help you debug the script?
Reproducible Science with R
----
Location to download Gapminder data https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv
**R intro challenge #1**
1. Which of the following are valid R variable names?
* min_height
* max.height
* _age
* .mass
* MaxLength
* min-length
* 2widths
* celsius2kelvin
2. What will be the value of each variable after each statement in the following program?
* `mass <- 47.5`
* `age <- 122`
* `mass <- mass * 2.3`
* `age <- age - 20`
3. Run the code from the previous challenge, and write a command to compare mass to age. Is mass larger than age?
4. Clean up your working environment by deleting the mass and age variables.
5. Install the following packages: ggplot2, plyr, gapminder
**R seeking help challenge #1**
1. Look at the help for the `c` function. What kind of vector do you expect you will create if you evaluate the following:
* `c(1, 2, 3)`
* `c('d', 'e', 'f')`
* `c(1, 2, 'f')`
2. Look at the help for the `paste` function. You’ll need to use this later. What is the difference between the `sep` and `collapse` arguments?
**R data structures challenge #1**
1. Start by making a vector with the numbers 1 through 26. Multiply the vector by 2, and give the resulting vector names A through Z (hint: there is a built in vector called LETTERS)
2. Is there a factor in our cats data.frame? what is its name? Try using ?read.csv to figure out how to keep text columns as character vectors instead of factors; then write a command or two to show that the factor in cats is actually a character vector when loaded in this way
3. There are several subtly different ways to call variables, observations and elements from data.frames:
* `cats[1]`
* `cats[[1]]`
* `cats$coat`
* `cats["coat"]`
* `cats[1, 1]`
* `cats[, 1]`
* `cats[1, ]`
4. What do you think will be the result of `length(matrix_example)`? Try it. Were you right? Why / why not?
5. Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the `matrix` function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (hint: read the documentation for `matrix`)
**R exploring data frames challenge #1**
1. Let’s imagine that, like dogs, 1 human year is equivalent to 7 cat years. (The Purina company uses a more sophisticated alogrithm).
1. Create a vector called `human.age` by multiplying `cats$age` by 7.
2. Convert `human.age` to a factor.
3. Convert `human.age` back to a numeric vector using the `as.numeric()` function. Now divide it by 7 to get back the original ages. Explain what happened.
**R exploring data frames challenge #2**
You can create a new data frame right from within R with the following syntax:
`df <- data.frame(id = c('a', 'b', 'c'),
x = 1:3,
y = c(TRUE, TRUE, FALSE),
stringsAsFactors = FALSE)`
Make a data frame that holds the following information for yourself:
first name
last name
lucky number
Then use `rbind` to add an entry for the people sitting beside you. Finally, use `cbind` to add a column with each person’s answer to the question, “Is it time for coffee break?”
**R exploring data frames challenge #3**
1. It’s good practice to also check the last few lines of your data and some in the middle. How would you do this? Searching for ones specifically in the middle isn’t too hard but we could simply ask for a few lines at random. How would you code this?
2. Go to file -> new file -> R script, and write an R script to load in the gapminder dataset. Put it in the `scripts/` directory and add it to version control. Run the script using the source function, using the file path as its argument (or by pressing the “source” button in RStudio).
3. Read the output of `str(gapminder)` again; this time, use what you’ve learned about factors, lists and vectors, as well as the output of functions like `colnames` and `dim` to explain what everything that `str` prints out for gapminder means. If there are any parts you can’t interpret, discuss with your neighbors
**R subsetting data challenge #1**
1. Given the following code:
`x <- c(5.4, 6.2, 7.1, 4.8, 7.5)`
`names(x) <- c('a', 'b', 'c', 'd', 'e')`
`print(x)`
Come up with at least 3 different commands that will produce the following output:
```
b c d
6.2 7.1 4.8
```
After you find 3 different commands, compare notes with your neighbour. Did you have different strategies?
2. Given the following code:
```
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)
```
Write a subsetting command to return the values in x that are greater than 4 and less than 7.
3. Selecting elements of a vector that match any of a list of components is a very common data analysis task. For example, the gapminder data set contains country and continent variables, but no information between these two scales. Suppose we want to pull out information from southeast Asia: how do we set up an operation to produce a logical vector that is TRUE for all of the countries in southeast Asia and FALSE otherwise?
Suppose you have these data:
```
seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
## read in the gapminder data that we downloaded in episode 2
gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE)
## extract the `country` column from a data frame (we'll see this later);
## convert from a factor to a character;
## and get just the non-repeated elements
countries <- unique(as.character(gapminder$country))
```
There’s a wrong way (using only ==), which will give you a warning; a clunky way (using the logical operators == and |); and an elegant way (using %in%). See whether you can come up with all three and explain how they (don’t) work.
4. Given the following code:
```
m <- matrix(1:18, nrow=3, ncol=6)
print(m)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 4 7 10 13 16
[2,] 2 5 8 11 14 17
[3,] 3 6 9 12 15 18
```
Which of the following commands will extract the values 11 and 14?
5. Given the following list:
```
xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris))
```
Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. Hint: the number 2 is contained within the “b” item in the list.
6. Given a linear model:
```
mod <- aov(pop ~ lifeExp, data=gapminder)
```
Extract the residual degrees of freedom (hint: `attributes()` will help you)
7. Fix each of the following common data frame subsetting errors:
a. Extract observations collected for the year 1957
`gapminder[gapminder$year = 1957,]`
b. Extract all columns except 1 through to 4
`gapminder[,-1:4]`
c. Extract the rows where the life expectancy is longer the 80 years
`gapminder[gapminder$lifeExp > 80]`
d. Extract the first row, and the fourth and fifth columns (lifeExp and gdpPercap).
`gapminder[1, 4, 5]`
e. Advanced: extract rows that contain information for the years 2002 and 2007
`gapminder[gapminder$year == 2002 | 2007,]`
Git
---
**Exercise 1**
1. Make a new file about another planet in the solar system. Add a fact or two, then commit that file to the repository as well.
2. Here in the hack pad, type one attribute of a good log message.
- Specify which changes were made to which document.
- Avoid general language that could apply to any document e.g. "fixed typos".
- Be as concise as possible while providing detail
- be descriptive about scope of changes
- Think about what will be good to know 6 months from now that may seem obvious today
- Be wary of spaces or incorrect spelling
- Be concise.
- If there are multiple files, include file name in description
**Exercise 2**
1. Delete the file about another planet that you created in our last exercise. (Remember, you can do so with the `rm` command.) What happens when you say `git status`? Then, recover the file with git checkout.
2. Go to http://github.com. If you have an account already, log in to it; if not, sign up for an account.
**Exercise 3**
On your GitHub project page, click "XXX Commits". Explore this interface for a few minutes; click on a commit, or on the buttons to the side of the commits. Think about the information this is giving you, and how you would get the same information from the git command line. Post anything interesting you find in the hack pad.
- The initial interface is similar to 'git log'
- clicking on a file is similar to 'git diff'
**Exercise 4**
1. If you were the Collaborator, assume the role of the Owner -- give your colleague access to your repo, and if you are now the Collaborator, clone the repo and practice changing it then pushing changes back to GitHub.
2. Play with GitHub some more. In particular, have a look at the commit viewer. Add a comment to your colleague's commit. Why might comments like this be useful in a collaborative environment? If you discover anything cool, add it to the Etherpad.
**Exercise 5**
Reverse roles with your partner. The person who resolved the conflict, make a new commit and push it to GitHub. The other partner, make a conflicting commit, then pull from GitHub and resolve the conflict.