owned this note
owned this note
Published
Linked with GitHub
# Data Carpentry for Genomics
Duke University, December 18-19, 2019
This is a shared notes document for all in the course to view/edit.
Course website: https://duke-gcb.github.io/2019-12-18-duke/
## Project Organization and Management
Lesson: https://datacarpentry.org/organization-genomics/
- Most important collaborator for you is **your future self**
### Data Tidiness
What kinds of metadata are important?
- time point information
- data provenance
- phenotype, genotype
- treatment condition
- demographics
- *and more!*
Sequencing data is worthless without the metadata! **Ask yourself:** Are there metadata standards for the data I'm going to collect?
**The cardinal rules of using spreadsheet programs for data:**
- Leave the raw data raw - don’t change it!
- Data should be machine-readable
- Variables should be in **columns**
- Observations should be in **rows**
- An empty cell is not the same as "unknown" or "NA" - be careful of how you use spaces
- "SampleName" or "Sample-Name" vs "Sample Name"
- Note that capitalization matters to computers! "SampleName" is **not** the same as "samplename"
### Planning for NGS Projects
Can you spot the problems in this [sample submission sheet](https://datacarpentry.org/organization-genomics/files/sample_submission.txt)?
- Inconsistencies
- Missing data
- Ambiguity
**Retrieving sample sequencing data from the facility**
- You will receive documentation (metadata) in addition to the sequencing data itself
- Provide tidy metadata when you submit samples as they will be included in what you get back from the sequencing core
**Storing data**
Storage guidelines:
- Data should be accessible to you, labmembers, and head of the lab
- Store data with redundant backups - at least two different phsyical locations
- **Leave the raw data raw**
### Examining Data on the NCBI SRA Database
The [Short Read Archive (SRA) database](https://www.ncbi.nlm.nih.gov/sra) houses publicly available short read sequencing data.
Use the **SraRunTable** feature to work with information about multiple samples at one time.
Tenaillon dataset: https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605
- Click on the Run Number of the first entry in the bottom table on the page (third table down). This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.
- Go back to the ‘previous page’. At the top of the page and in the Total row you will see there are 312 runs, 109.43 Gb data, and 168.81 Gbases of data. Click the ‘RunInfo Table’ button and save the file locally.
**Note**: This is **not** the actual sequencing data - it is only the metadata saved to a tab-delimited file named SraRunTable.txt.
If you only want to download **one or two** sequencing files, it's easiest to go to the [**European Nucleotide Archive**](https://www.ebi.ac.uk/ena)
- In the search bar, type in SRR2589044. Make sure there are no spaces after the accession number, and press search.
- You will see a table with information about the sample. In the table, there is a header “FASTQ files (FTP)”. If you wanted to download the files to your computer, you could click on the links to download the files. Alternatively, right click and copy the URL to save it for later. We don’t need to download these files right now, and because they are large we won’t put them on our computers now.
If you want to download **multiple files* all at once, it is easiest to use the NCBI software package called the [`sra-toolkit`](https://github.com/ncbi/sra-tools) which is a tool that is used on the command line.
*Note We will not be using this tool for this workshop.*
## Introduction to the Command Line
Shell: What is it? Why do we care?
- Allows reproducible processing of large amounts of data with relative ease.
Please edit this note (clicking the Pencil icon above) and claim a virtual machine from the list below by placing your name next to it:
ec2-3-94-131-81.compute-1.amazonaws.com (Dahlia Rohm)
ec2-54-81-4-35.compute-1.amazonaws.com (Rachel Shahan)
ec2-54-166-133-238.compute-1.amazonaws.com (Sanjay Khandelwal)
ec2-18-234-27-234.compute-1.amazonaws.com (Young J Yun)
ec2-54-88-25-97.compute-1.amazonaws.com (Azita Sadeghpour)
ec2-54-81-79-226.compute-1.amazonaws.com goomber Shelly
ec2-52-91-232-162.compute-1.amazonaws.com (Zhaohui Wang)
ec2-34-207-134-244.compute-1.amazonaws.com (Matthew Hirschey)
ec2-100-24-11-24.compute-1.amazonaws.com (Akshay Bareja)
ec2-34-226-204-158.compute-1.amazonaws.com (Ryan Rickels)
ec2-54-196-108-196.compute-1.amazonaws.com (Criss Guy)
ec2-52-91-110-228.compute-1.amazonaws.com (Julio Barrera)
ec2-3-93-72-49.compute-1.amazonaws.com (Julia Gamache)
ec2-54-162-228-14.compute-1.amazonaws.com (Kevin Friede)
ec2-3-89-70-68.compute-1.amazonaws.com (Dan Leehr)
ec2-174-129-30-52.compute-1.amazonaws.com (Hilmar Lapp)
ec2-3-89-152-175.compute-1.amazonaws.com (Matt Green)
ec2-3-87-250-169.compute-1.amazonaws.com (golshid sanati)
ec2-35-153-136-69.compute-1.amazonaws.com (Zachery Mielko)
ec2-54-204-180-228.compute-1.amazonaws.com (Lingyun)
ec2-34-235-112-16.compute-1.amazonaws.com goomber shelly
ec2-34-227-73-65.compute-1.amazonaws.com (Matt Detter)
ec2-54-161-116-43.compute-1.amazonaws.com (Lijun Ma)
ec2-3-84-215-148.compute-1.amazonaws.com(Zhiguo Sun)
ec2-35-173-183-115.compute-1.amazonaws.com (lihua wang)
ec2-35-153-134-106.compute-1.amazonaws.com (Val G)
ec2-3-82-14-182.compute-1.amazonaws.com (Tyler Allen)
ec2-54-210-185-236.compute-1.amazonaws.com (Rachel Meade)
### Login
ssh dcuser@addressname.com # replace addressname.com by your machine name from above
### Introducing the Shell
`pwd` = "print working directory": tells you where you are in your system or what directory you are working in
`ls` = "list": shows all files in your current working directory
`cd` foldername = "change directory": moves you to a new directory
### Navigating Files & Directories
`cd ..` will take you to the directory within which your current working directory is stored (one level up)
Many commands have subcommands, or flags, that can be used to get more precise outputs from the command.
`ls -l` will list more information about listed items, such as size, permissions, date of last modification
- When you use this command, directories will be identified by a `d` at the beginning of the permissions line.
`man command_of_interest` = "manual": prints the instruction manual for an individual command (VERY USEFUL)
**Tab** key can be used to complete the names of files or directories as you type. Can be used to limit typing errors during coding.
If tab complete doesn't work after one tap, try **double tapping**. This will provide a list of possible completions to what you have already typed.
`ls -a` will reveal hidden directories, the names of which always begin with a `.`
`ls directory` will list the contents of the directory (even if it is not your current working directory)
Every file and directory has a **full pathname** (also known as absolute pathname), beginning with the root directory and ending with the file name. This always begins with a slash. (e.g. `/home/usr/file`)
However, **relative pathnames** reduce the amount of typing necessary while offering flexibility in file location, especially when the file is referenced in a script. This path is relative to the current working directory. (e.g. `../file`)
### Working with Files & Directories
**Wildcards** (`*`) can be used to reference multiple files at once.
Wildcards can reference any number of characters and can be used anywhere in the file reference.
Given a directory with files `dog`, `domestic_cat`, `wild_cat`:
`ls do*` will give me `dog domestic_cat`
`ls *cat` will give me `domestic_cat wild_cat`
`ls *tic*` will give me `domestic_cat`
`ls w*cat` will give me `wild_cat`
`ls *[gm]*` will give me `dog domestic_cat`, any file that contains "g" **OR** "m"
`echo` will print argument input as a line of text. This is a great defensive coding tactic to check your wildcards before executing ano
her command on the group of files, such as deleting them
#### Command History
`ctrl+C` will cancel a command
`history` will print an indexed list of all previously executed commands. To reference these commands by number, you can type simply `!#`, and the referenced command will execute.
`ctrl+R` will do a reverse search through your command history. Very useful if you don't want to spend time bashing the up arrow key (which will direct you to the commands you have already executed from most to least recent).
`ctrl+L` or `clear` will clear your terminal or bash screen
#### Examining Files
`cat <filename>` will print the contents of a file; bad for large files
`less <filename>` will give you a peek at the file in a separate window; great for reading files
Within `less`, if you type `/` then a search term then **Enter**, it will search the file for your term.
- Repeating `/`+**Enter** will take you to the next occurrence of your search term.
- `man` pages use `less`, so these search tactics can be used there too.
`head <filename>` prints the top of a file of interest.
`tail <filename>` prints the bottom of a file of interest
Adding the `-n #` argument after `head` or `tail` will allow you to specify how many lines you want printed.
For example:
`head -n 4 SRR098026.fastq` will print the top 4 lines of the fastq file, or one read entry.
#### FASTQ Format
Each read entry is 4 lines and is organized as such:
1. Always begins with `@` and then information about the read
2. The actual DNA sequence
3. Always begins with a `+` and sometimes the same info in line 1
4. Has a string of characters which represent the quality scores; must have same number of characters as line 2
On line 2, an `N` denotes an unknown base.
On line 4, quality scores are denoted by an established system shown below, where `!` is the lowest quality (0) and `K` is the highest (42).
```
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJK
| | | | |
Quality score: 0........10........20........30........40..
```
#### Creating, Moving, Copying, and Removing
`cp <filename> <copyname>` can be used to copy a file
example:
`cp SRR098026.fastq SRR098026-copy.fastq`
**Make a new directory** using the command `mkdir`
example:
`mkdir backup` produces a file named "backup"
**Move your copy of the fastq file to your backup folder** using the `mv` command
`mv <filename> <new location name>`
example:
`mv SRR098026-copy.fastq backup/`
**The `mv` command can also be used to rename files**
`mv <filename> <new filename>`
example:
`mv original.txt new.txt`
`mv SRR098026.fastq SRR098026-rename.fastq`
And we can move and rename at the same time.
example:
`mv SRR098026-rename.fastq backup/SRR098026-rename2.fastq`
### File Permissions
It's important to set your permissions so that you don't accidentally overwrite your files.
Type the command `ls -l` to see the permissions for each of your files.
```
-rw-r--r-- 1 dcuser dcuser 43332 Nov 15 23:02 SRR098026-backup.fastq
```
r = read permission
w = write permission
x = execute permission
To change permissions of a file, use the command `chmod`
example:
```
chmod -w SRR098026-copy.fastq
```
removes permission to write to the file.
To give yourself permission to write to the file again:
```
chmod +w SRR098026-copy.fastq
```
To update permissions for only owner, group, or user. For example, if you only want the user to have write permission:
```
chmod u+w SRR098026-copy.fastq
```
### Removing files
**Be careful when using this command, as it will *permanently* delete the file (there is NO "recycle bin" in terminal!).**
`rm <filename>`
To remove a directory, add the `-r` flag, which stands for "recursive". `rm` will then *recursively* delete the directory and all of the files and directories inside of it:
`rm -r <directory>`
One way to protect yourself from accidentally deleting data is to **remove write permissions**. This also prevents you from accidentally modifying your raw data. If you try to delete or edit the file or directory, it will tell you that permission is denied.
- This is an especially good idea for backup directories!
**Exercise**
- Make a copy of each of your fastq files: `cp SRR098026.fastq SRR098026-backup.fastq` and `cp SRR097977.fastq SRR097977-backup.fastq`
- Make a new "backup" directory (make sure you've deleted your old one if you haven't already) using the command `mkdir backup`
- Use the wildcard character to move all "copy" files at once `mv SRR*-backup.fastq backup/`
- Remove write permissions to keep your backup data protected `chmod -w backup/*-backup.fastq`
### Searching in files using grep
Grep is a command line tool that lets you search in files and directories easily. For this section, you want to be in the `untrimmed_reads` directory: `cd ~/shell_data/untrimmed_fastq`
To search in a file using grep, use the format:
`grep <search term> <filename>`
example:
`grep NNNNNNNNNN SRR098026.fastq` to search for all instances of 'NNNNNNNNNN' in SRR098026.fastq.
You can make your grep search more useful by also returning the lines before and after the instances of your search term:
- `-B1` will also return the line **before** your search term
- `-A2` will also return the two lines **after** your search term
example:`grep -B1 -A2 NNNNNNNNNN SRR098026.fastq` will give you output that looks like this:
```
@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
```
### Redirecting output
Using Grep to search our files is useful, but we often will also want to save the results of these searches. Redirection is an easy way to do this.
The command for redirecting output to a file is `>`
example:
```
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
```
This will print the output of our search to a file named "bad_reads.txt". Note that this does **not** alter the original file in any way, it just copies the search query result.
- Another important note is that each time you use `>` to write to the same file name, it will **overwrite** the original file. To **append** more information to an existing file, modify your redirect command by using `>>` instead of a single `>` arrow.
- Example: `grep NNN file.fastq >> output.txt` will *add to* the `output.txt` file instead of overwriting it.
**Note** that in the UNIX command line, the file extension doesn't matter. The file could end in `.txt` or `.fastq`, but the command line won't know the difference\*. Tools like `grep` or `less` or `cat` will all try to use the files in the same way.
\* *This is not true of all command line programs, as some more specialized tools may require specific input formats, but this is true of the tools we will be using for this section of the course.*
The **Word Count** tool is used by entering the command `wc <file.txt>`
example:
```
wc bad_reads.txt
```
This will return the output: `537 1073 23217 bad_reads.txt`. Each number corresponds to the number of lines, words and characters in the file.
Frequently, we will only want to get one of these numbers, such as the number of lines. To get the number of lines, add the `-l` argument:
```
wc -l bad_reads.txt
```
Which returns:
`537 bad_reads.txt`
So we can see that there are 537 lines in the file bad_reads.txt.
Giving your output files a different extension than your data files can be a good idea so you don't accidentally try to query your output files. For example, if you had named your output `bad_reads.fastq` instead of `bad_reads.txt`, this command:
`grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.fastq`
would give you the error message:
`grep: input file ‘bad_reads.fastq’ is also the output`
**Piping** information is done using the `|` command. This redirects your output to another program instead of printing to your screen or to a file.
example:
```
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | less
```
This command will send all the lines before/after the search term "NNNNNNNNNN" to `less` for viewing.
Another example:
```
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l
```
Will translate the results of your grep search into the number of lines being returned without having to make an intermediate file.
### For Loops
For loops are useful for performing the same operation on multiple different files.
**Variables** make it easy to use for loops. To assign a variable, use the format:
`<variable name> = <value>`
example:
`foo=abc`
To use this variable, add `$` before the variable name.
For example, `echo $foo` will print:
`abc`
to your screen.
**Variables** can be added to other information/text, but has to be done in a specific way - by adding `{` and `}` (also known as "curly brackets") around your variable name.
```echo foo is $fooHello```
**won't work!** You'll get back `foo is `
```echo foo is ${foo}Hello```
does work:
`foo is abcHello`
To make a for loop using variables, use the format:
`for <variable> in <group to iterate over> ; do <commands> ; done`
To write your for loop over multiple lines, you can also just hit "enter", and the terminal prompt will switch to `>` so you know it's still expecting more input.
```bash
$ for filename in *.fastq
> do
> head -n 2 ${filename}
> done
```
One you've entered `done` and hit enter, the terminal will know that you're ready to run the loop
You can also redirect the oupt to a new file each time you iterate through the loop:
```bash
$ for filename in *.fastq
> do
> head -n 2 ${filename} >> seq_info.txt
> done
```
**NOTE:** that you need to use the "append"(`>>`) redirect command instead of the `>` command inside for loops, as `>` will overwrite the file each time you go through the loop!
**Basename** is a function that removes a uniform part of a name (such as the file extension) from a list of files.
Use the format: `basename <filename> <part to remove>`
`basename SRR097977.fastq .fastq`
will return: `SRR097977`
`basename SRR097977.fastq astq`
will return: `SRR097977.f`
To use basename in a for loop:
```bash
$ for file in *.fastq
> do
> name=$(basename ${file} .fastq)
> echo ${name}
> done
```
Will output just the accession number prefix of all the .fastq files:
```
SRR097977
SRR098026
```
### Writing Scripts
Scripts are used to combine multiple commands into a series that can be run at once. It is essentially a text file with commands in it.
Command line text editors are used to write scripts. We will use `nano`.
`nano <filename>` will create a file and open a text editor window in which you can write and modify code.
`ctrl+O` will save the file. Hit **Enter** to verify the file name.
`ctrl+X` can be used to exit the file. Exiting without saving will result in a prompt to save.
`bash <scriptname>` can be used to run the written script.
To run the script without the `bash` command, you must change the script permissions to be **executable** (denoted by`x`).
To accomplish this:
`chmod +x <scriptname>` will add executing privileges to all users.
### Moving & Downloading Data
`wget <webaddress>` stands for "world wide web get". This **downloads** data from a web address.
`curl <webaddress>`, or "see URL", **displays** content from a web address.
Both are not always available, so use `which curl` or `which wget` to make sure both are available on your shell.
`curl -O <webaddress>` will also download data should `wget` be unavailable.
`wget` will download the file of interest to your current working directory.
To move data from a local server to an Amazon instance:
`scp <filename> <dcuser@....com>:/home/dcuser/<pathname>`
To download data from the Amazon instance:
For Windows PSCP:
```
C:\User\your-pc-username\Downloads> pscp.exe dcuser@ec2-54-88-126-85.compute-1.amazonaws.com:/home/dcuser/shell_data/untrimmed_fastq/scripted_bad_reads.txt.
C:\User\your-pc-username\Downloads
```
For UNIX:
```
scp dcuser@ip.address:/home/dcuser/shell_data/untrimmed_fastq/scripted_bad_reads.txt ~/Downloads
```
### Project Organization
Well-organized projects support collaboration with your future self! Bad labeling now will make looking at data in the future much more challenging and time-consuming.
`mkdir` = "make directory": use to generate directories
`ls -R` lists all contents of a directory and all contents of directories contained within the directory.
#### Organizing Your Files
* Keep original raw data (without write permissions) quarantined in its own directory. Only modify copies of raw data.
* Store results in a separate results folder.
* Use a docs folder to store written analyses.
#### Documenting Your Activity on the Project
Using the `history` command we learned earlier, you can easily store a log of the modifications you have made to your data as a text file for later verification.
For example, if you wanted to append a number of actions you just performed on your file to a running log of actions, you could execute:
`history | tail -n # >> activity_log.txt`
## Data Wrangling and Processing
Command line tools are useful to prepare genomic data for analysis. The process of preparing a dataset for experimental analysis is known as **data wrangling**.
We will be processing public access genomic data using a very common workflow that is broadly applicable to many types of sequence analysis.
We will be processing and data wrangling a multi-generational dataset of *E.coli* from Richard Lenski's lab.
The metadata for this experiment includes:
* strain (strain name)
* generation (generation frozen)
* clade (based on parsimony-based tree)
* reference (study the samples were originally seq'd for)
* population (ancestral population group)
* mutator (hypermutability mutant status)
* facility (facility samples were sequenced at)
* run (sequence read archive sample ID)
* read_type (library type of reads)
* read_length (length of reads in sample)
* sequencing_depth (depth of sequencing)
* cit (citrate-using mutant status)
You can use Excel filtering tools to answer basic questions about the dataset. However, after saving a separate copy of the original raw data, any heavy-duty modifications should be completed in command line, due to the better automation & scalability.
### Bioinformatic Workflows
A series of bioinformatic tools used to process a dataset is collectively referred to as a **pipeline**.
Standardized workflows help to keep data analysis consistant throughout a project.
Today, we will use FastQC to assess read quality in the fastq file.
#### Starting with Data
In order to download the files for today, open Terminal and log in to your AMI.
```
ssh dcuser@<yourAMI>
password: data4Carp
```
Use `who` to check that you are the only one using your machine.
To download the files for today:
```bash
mkdir -p ~/dc_workshop/data/untrimmed_fastq/
cd ~/dc_workshop/data/untrimmed_fastq
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_2.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz
```
**NOTE**: If you are ever stuck and have no prompt line to type on, use `ctrl+C` to cancel any command that the computer is currently processing.
`cp <filename> <copyname>` will help you to maintain the integrity of your original data. Before the names, you can also include pathnames to specify where you want the file to come from or the copy to go to.
Use `ls -lh` to check whether all of your files downloaded properly.
**NOTE**: With commands like `ls`, flags can be compiled together. `-l` (long form list) and `-h` (human readable file size) are separate flags being executed at once by `-lh`.
### Quality Control
#### Details on the FASTQ Format
As mentioned yesterday, the [fastq format](https://en.wikipedia.org/wiki/FASTQ_format) is as follows for each read:
1. Always begins with `@` and then information about the read
2. The actual DNA sequence
3. Always begins with a `+` and sometimes the same info in line 1
4. Has a string of characters which represent the quality scores; must have same number of characters as line 2
On line 2, an `N` denotes an unknown base.
On line 4, quality scores are denoted by an established system shown below, where `!` is the lowest quality (0) and `K` is the highest (42).
```
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJK
| | | | |
Quality score: 0........10........20........30........40..
```
Quality for each read tends to be higher at the beginning of the read and lower at the end.
Use `head -n 4 SRR2584863_1.fastq` to check that your file is properly formatted.
#### Assessing Quality using FastQC
FastQC will give a boxplot of the quality of each read, visualizing the data on line 4 of the fastq read within ranges of good (green), acceptable (yellow), and bad (red).
#### Running FastQC
FastQC accepts wildcards and gzip files (files that end in `.gz`, which are condensed forms of the full file).
`fastqc *.fastq*` will process all fastq files in the current directory. Nice and easy!
FastQC generates a `.zip` and `.html` file. The html file is a stable webpage containing a summary report for the run, and the zip file is a compressed file containing multiple output subfiles.
Make sure to move your results files away from your raw data for further analysis by making a directory specifically for your results files (`mkdir <directory>`) and moving your files there (`mv <file(s)> <directory>`).
#### Viewing the FastQC Results
`open SRR2584863_1_fastqc.html` would open a window on our computer if we were not working on a VM.
If we transfer this file to our computer, we can open it in our own browser.
To download the directory containing the files, open a local instance of Terminal or GitBash:
```bash
$ mkdir -p ~/Desktop/fastqc_html
$ scp dcuser@<yourAMI>:~/dc_workshop/results/fastqc_untrimmed_reads/*.html ~/Desktop/fastqc_html
```
We are effectively telling the computer to grab the directory from the location you specified on your AMI and copy it to the local location you provided (`~/Desktop/fastqc_html/`).
Now, this should open the file locally:
```bash
$ cd ~/Desktop/fastqc_html/
$ open *.html
```
#### Decoding FastQC Outputs
In order to look at the output files, we must unzip them. Because `unzip` only accepts one file at a time, we can unzip all of the `.zip` files at once using a for loop:
```bash
$ for filename in *.zip
> do
> unzip $filename
> done
```
Once the files are unzipped,`ls -F` tells us whether any of the items in the outputs are directories.
In each of the output directories, there is a number of files, including a `summary.txt` file.
The`summary.txt` will tell you which sequences `PASS` or `FAIL`.
#### Documenting Our Work
Combining our `summary.txt` files will make a record of all obtained results. This can be accomplished by concatonating:
`cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt`
`grep` can be used to pull any line that contains `FAIL` and can further be used to remove those lines from the analysis during data tidying.
Before you remove any data, always make sure the QC is threshholding data the way you want it to.
### Trimming and Filtering
We will use the program **trimmomatic** to filter our reads.
Trimmomatic takes many input parameters so it processes your data the correct way. Using the `\` character allows you to continue constructing your command on a new line in your terminal window, which makes it much easier to read and understand the commands you are giving.
- See the main tutorial website for the list of arguments: https://datacarpentry.org/wrangling-genomics/03-trimming/index.html
An example for running trimmomatic (not the command we will use):
```bash
trimmomatic PE -threads 4 SRR_1056_1.fastq SRR_1056_2.fastq \
SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq \
SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq \
ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20
```
`PE` that it will be taking a paired end file as input
`-threads 4` to use four computing threads to run (this will spead up our run)
`SRR_1056_1.fastq` the first input file name
`SRR_1056_2.fastq` the second input file name
`SRR_1056_1.trimmed.fastq` the output file for surviving pairs from the `_1` file
`SRR_1056_1un.trimmed.fastq` the output file for orphaned reads from the `_1` file
`SRR_1056_2.trimmed.fastq` the output file for surviving pairs from the `_2` file
`SRR_1056_2un.trimmed.fastq` the output file for orphaned reads from the `_2` file
`ILLUMINACLIP:SRR_adapters.fa` to clip the Illumina adapters from the input file using the adapter sequences listed in `SRR_adapters.fa`
`SLIDINGWINDOW:4:20` to use a sliding window of size 4 that will remove bases if their phred score is below 20
This command is large and complicated. We will use a for loop to do the same thing for each file without having to type the command in each time.
*Note: run `gzip SRR2584863_1.fastq ` to compress this file again before running the code below*
```bash
$ for infile in *_1.fastq.gz
> do
> base=$(basename ${infile} _1.fastq.gz)
> trimmomatic PE ${infile} ${base}_2.fastq.gz \
> ${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz \
> ${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
> SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
> done
```
We can see how useful the `basename` functionality is in cases like this!
Now we have our original files and our trimmed files:
```
NexteraPE-PE.fa SRR2584866_1.fastq.gz SRR2589044_1.trim.fastq.gz
SRR2584863_1.fastq.gz SRR2584866_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz
SRR2584863_1.trim.fastq.gz SRR2584866_1un.trim.fastq.gz SRR2589044_2.fastq.gz
SRR2584863_1un.trim.fastq.gz SRR2584866_2.fastq.gz SRR2589044_2.trim.fastq.gz
SRR2584863_2.fastq.gz SRR2584866_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz
SRR2584863_2.trim.fastq.gz SRR2584866_2un.trim.fastq.gz
SRR2584863_2un.trim.fastq.gz SRR2589044_1.fastq.gz
```
We want to move the trimmed files to a new directory so in the future, we know that they were in fact trimmed.
``` bash
$ cd ~/dc_workshop/data/untrimmed_fastq
$ mkdir ../trimmed_fastq
$ mv *.trim* ../trimmed_fastq
$ cd ../trimmed_fastq
$ ls #this will show you the trimmed reads in the new directory
```
## Variant Calling Workflow
Before we can call variants in our sample, we must first align the samples to a reference genome. We will use a common aligner: [Burrows-Wheeler Aligner](http://bio-bwa.sourceforge.net/) using the `bwa` command.
Steps to align to genome:
1. index the reference genome
2. align sequencing reads to reference
We are using the command `curl -L -o data/ref_genome/ecoli_rel606.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz` to download the reference.
- note that the `-L` command is useful here when using curl because it allows for redirection (or, if the website says the data is not exactly where you specified, curl will still download the data from the correct location).
Next, unzip the reference:
`gunzip data/ref_genome/ecoli_rel606.fasta.gz`
We are also downloading a shorter set of trimmed fastq files for this workshop:
``` bash
$ curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248
$ tar xvf sub.tar.gz
$ mv sub/ ~/dc_workshop/data/trimmed_fastq_small
```
`tar xvf` is another command used to extract from compressed files/directories.
We will make some new directories to keep our data and results organized:
`$ mkdir -p results/sam results/bam results/bcf results/vcf`
### Index the reference
Now we will perform the first step: index the reference genome (note that this only has to be done once).
```bash
bwa index data/ref_genome/ecoli_rel606.fasta
```
This should run quickly (~2 seconds)
### Align reads to reference genome
Example command to run this alignment:
``` bash
bwa mem ref_genome.fasta input_file_R1.fastq input_file_R2.fastq > output.sam
```
Automatically, `bwa` will only print results to the screen. To save the results, we want to redirect to an output file such as `output.sam`.
- note that we are redirecting the results to "standard out" (or stout), while the progress is being printed to "standard error" (or sterror) on your screen.
- [read more about standard error vs standard out here](https://en.wikipedia.org/wiki/Standard_streams)
We use the following command to align one sample (SRR2584866):
```bash
bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq > results/sam/SRR2584866.aligned.sam
```
**SAM/BAM format**
SAM files (and the compressed version of a SAM file, called a BAM file), is the standard way to store aligned sequencing data.
We will convert our SAM file to BAM format using a program called `samtools`.
``` bash
samtools view -S -b results/sam/SRR2584866.aligned.sam > results/bam/SRR2584866.aligned.bam
```
- The BAM file will be smaller than the SAM file original format
To do the variant calling efficiently, we also need to **sort** the file by coordinate.
```bash
$ samtools sort -o results/bam/SRR2584866.aligned.sorted.bam results/bam/SRR2584866.aligned.bam
```
- *SAM/BAM files can be sorted in multiple ways, e.g. by location of alignment on the chromosome, by read name, etc. It is important to be aware that different alignment tools will output differently sorted SAM/BAM, and different downstream tools require differently sorted alignment files as input.*
`samtools` is a program with lots of functionality. You can also use it to see some statistics about your BAM file with the `flagstat` option:
`samtools flagstat results/bam/SRR2584866.aligned.sorted.bam`
Output:
```
351169 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1169 + 0 supplementary
0 + 0 duplicates
351103 + 0 mapped (99.98% : N/A)
350000 + 0 paired in sequencing
175000 + 0 read1
175000 + 0 read2
346688 + 0 properly paired (99.05% : N/A)
349876 + 0 with itself and mate mapped
58 + 0 singletons (0.02% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
```
### Perform the Variant Calling
For this workshop, we will use bcftools to make variant calls. https://samtools.github.io/bcftools/howtos/index.html
Step 1: Calculate the read coverage of positions in the genome using the mpileup command in bcftools
```bash
bcftools mpileup -O b -o results/bcf/SRR2584866_raw.bcf \
-f data/ref_genome/ecoli_rel606.fasta results/bam/SRR2584866.aligned.sorted.bam
```
Step 2: Perform the variant calling using the call command in bcftools
```bash
# Old version in notes: bcftools call --ploidy 1 -m -v -o results/bcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf
# Updated command to use
bcftools call --ploidy 1 -m -v -o results/vcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf
```
Step 3: Filter and report the variants using the varFilter command in vcfutils
```bash
vcfutils.pl varFilter results/vcf/SRR2584866_variants.vcf > results/vcf/SRR2584866_final_variants.vcf
```
Use `less -S results/vcf/SRR2584866_final_variants.vcf` to open the final variants file.
### Shell Scripts to Automate a Variant Calling Workflows
Variables are assigned using the `=` operator. Ex: `myVarName = 'hello'`
They can be recalled using the `$` in front of the variable name. Ex: `$myVarName`
Use `touch read_qc.sh` to generate an empty file named read_qc.sh. Then `nano read_qc.sh` to open the file.
Script contents:
```bash
set -e
cd ~/dc_workshop/data/untrimmed_fastq/
echo "Running FastQC..."
fastqc *.fastqc
mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads
echo "Saving FastQC results..."
mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads/
mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads/
cd ~/dc_workshop/results/fastqc_untrimmed_reads/
echo "Unzipping..."
for filename in *.zip
do
unzip $filename
done
echo "Saving summary..."
cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt
```
Save the file with Ctrl-O and then Ctrl-X to exit.
To run the script use `bash read_qc.sh`.
### Automating the rest of the workflow
First, use `cd ~/dc_workshop/scripts` to go to the scripts folder
Then use
```bash
curl -O https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/run_variant_calling.sh
```
to download the example script. To access the script use:
```bash
cd ~/dc_workshop/scripts
less run_variant_calling.sh
```
Steps performed in the script:
1. align the reads to a reference genome and output a sam file with bwa
-`bwa mem $genome $fq1 $fq2 > $sam`
2. Convert the sam file to a bam file with samtools
-`samtools view -S -b $sam > $bam`
3. Sort the bam file with samtools
-`samtools sort -o $sorted_bam $bam`
4. Index the bam file with samtools
-`samtools index $sorted_bam`
5. Calculate the read coverage with bcftools
-`bcftools mpileup -O b -o $raw_bcf -f $genome $sorted_bam `
6. Call SNPs with bcftools
-`bcftools call --ploidy 1 -m -v -o $variants $raw_bcf `
7. Filter and report the SNP variants in a vcf format with vcfutils
-`vcfutils.pl varFilter $variants > $final_variants
`
To run the script use `bash run_variant_calling.sh`
### Cloud Computing for Genomics
Advantages:
- more resources (computing power) available than on a local machine
- full administrative rights - install anything
- use pre-configured “images” (machine snapshot where operating system and software are already installed)
- everyone in your group can access a server with the same tools and configurations available
- long-running analyses will continue to run on the server even when you have your computer is turned off
- local operating system doesn't matter (Windows, Linux, MacOS) - once you're on the server, it will be unix-based
Disadvantages:
- difficult to visualize files while they are on the server
- takes time to move data and files to and from the server
- costs money to run
- no local systems administrator to help if there are problems
**Verify your environment**
`whoami` shows your username on computer you have connected to
`df -h ` shows space on hard drive
`free -h` is another command that will show you details of the availability and resources of your cloud server.
`cat /proc/cpuinfo` shows detail information on how many processors (CPUs) the machine has
`tree -L 1` shows a tree view of the file system 1 level below your current location
**Staying Connected to the Cloud**
We will work through how to use `tmux`, but `screen` is another option.
To start a new session, use the command `tmux new -s session_name`
- be sure to give your session a more descriptive name than "session_name"!
Detach session (process keeps running in background) by typing on your keyboard: `control + b`. Next type `d` to detach the session.
`tmux list-sessions` shows you all active sessions running
After detaching, you can get back to that session using the command: `tmux attach -t session_name`
- if you only have one session, you can just enter `tmux attach` and it will auto-connect you to that session
You can switch sessions using `tmux switch -t session_name`
And you can end a session using `tmux kill-session -t session_name`
See course page for list of cloud computing options: https://datacarpentry.org/cloud-genomics/04-which-cloud/index.html
**Installing on cloud server**
`YUM` (for Red Hat and Centos instances) and `APT` (for Debian or Ubuntu instances) are two package manager programs that can be used to install software on the cloud server.
For example, if we didn't already have `tmux` installed, we could run the command `apt search tmux` to see if it is available to be installed. To install `tmux`, run the command `apt install tmux`.
*Note there is also a program called `apt-get` that is nearly the same as `apt`. They have the same functionality most of the time.*
Before you installing using `apt`, you should check for updates:
`sudo apt upgrade`
- `sudo` stands for "super user do", which runs the command from the administrator account (read more about sudo [here](https://en.wikipedia.org/wiki/Sudo))
- note: in this Amazon webserver instance, we do not have `sudo` permissions, so we cannot perform this command right now
Duke's Virtual Computing Manager can be found by logging in to **vcm.duke.edu** with your Duke NetID and password.
Job scheduling on computing clusters is done at Duke using `Slurm`. Read more about this workload manager here: https://slurm.schedmd.com/