Data Carpentry for Genomics

Duke University, December 18-19, 2019

This is a shared notes document for all in the course to view/edit.

Course website: https://duke-gcb.github.io/2019-12-18-duke/

Project Organization and Management

Lesson: https://datacarpentry.org/organization-genomics/

  • Most important collaborator for you is your future self

Data Tidiness

What kinds of metadata are important?

  • time point information
  • data provenance
  • phenotype, genotype
  • treatment condition
  • demographics
  • and more!

Sequencing data is worthless without the metadata! Ask yourself: Are there metadata standards for the data I'm going to collect?

The cardinal rules of using spreadsheet programs for data:

  • Leave the raw data raw - don’t change it!
  • Data should be machine-readable
  • Variables should be in columns
  • Observations should be in rows
  • An empty cell is not the same as "unknown" or "NA" - be careful of how you use spaces
    • "SampleName" or "Sample-Name" vs "Sample Name"
    • Note that capitalization matters to computers! "SampleName" is not the same as "samplename"

Planning for NGS Projects

Can you spot the problems in this sample submission sheet?

  • Inconsistencies
  • Missing data
  • Ambiguity

Retrieving sample sequencing data from the facility

  • You will receive documentation (metadata) in addition to the sequencing data itself
  • Provide tidy metadata when you submit samples as they will be included in what you get back from the sequencing core

Storing data

Storage guidelines:

  • Data should be accessible to you, labmembers, and head of the lab
  • Store data with redundant backups - at least two different phsyical locations
  • Leave the raw data raw

Examining Data on the NCBI SRA Database

The Short Read Archive (SRA) database houses publicly available short read sequencing data.

Use the SraRunTable feature to work with information about multiple samples at one time.

Tenaillon dataset: https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605

  • Click on the Run Number of the first entry in the bottom table on the page (third table down). This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.

  • Go back to the ‘previous page’. At the top of the page and in the Total row you will see there are 312 runs, 109.43 Gb data, and 168.81 Gbases of data. Click the ‘RunInfo Table’ button and save the file locally.

Note: This is not the actual sequencing data - it is only the metadata saved to a tab-delimited file named SraRunTable.txt.

If you only want to download one or two sequencing files, it's easiest to go to the European Nucleotide Archive

  • In the search bar, type in SRR2589044. Make sure there are no spaces after the accession number, and press search.
  • You will see a table with information about the sample. In the table, there is a header “FASTQ files (FTP)”. If you wanted to download the files to your computer, you could click on the links to download the files. Alternatively, right click and copy the URL to save it for later. We don’t need to download these files right now, and because they are large we won’t put them on our computers now.

If you want to download *multiple files all at once, it is easiest to use the NCBI software package called the sra-toolkit which is a tool that is used on the command line.

Note We will not be using this tool for this workshop.

Introduction to the Command Line

Shell: What is it? Why do we care?

  • Allows reproducible processing of large amounts of data with relative ease.

Please edit this note (clicking the Pencil icon above) and claim a virtual machine from the list below by placing your name next to it:

ec2-3-94-131-81.compute-1.amazonaws.com (Dahlia Rohm)
ec2-54-81-4-35.compute-1.amazonaws.com (Rachel Shahan)
ec2-54-166-133-238.compute-1.amazonaws.com (Sanjay Khandelwal)
ec2-18-234-27-234.compute-1.amazonaws.com (Young J Yun)
ec2-54-88-25-97.compute-1.amazonaws.com (Azita Sadeghpour)
ec2-54-81-79-226.compute-1.amazonaws.com goomber Shelly
ec2-52-91-232-162.compute-1.amazonaws.com (Zhaohui Wang)
ec2-34-207-134-244.compute-1.amazonaws.com (Matthew Hirschey)
ec2-100-24-11-24.compute-1.amazonaws.com (Akshay Bareja)
ec2-34-226-204-158.compute-1.amazonaws.com (Ryan Rickels)
ec2-54-196-108-196.compute-1.amazonaws.com (Criss Guy)
ec2-52-91-110-228.compute-1.amazonaws.com (Julio Barrera)
ec2-3-93-72-49.compute-1.amazonaws.com (Julia Gamache)
ec2-54-162-228-14.compute-1.amazonaws.com (Kevin Friede)
ec2-3-89-70-68.compute-1.amazonaws.com (Dan Leehr)
ec2-174-129-30-52.compute-1.amazonaws.com (Hilmar Lapp)
ec2-3-89-152-175.compute-1.amazonaws.com (Matt Green)
ec2-3-87-250-169.compute-1.amazonaws.com (golshid sanati)
ec2-35-153-136-69.compute-1.amazonaws.com (Zachery Mielko)
ec2-54-204-180-228.compute-1.amazonaws.com (Lingyun)
ec2-34-235-112-16.compute-1.amazonaws.com goomber shelly
ec2-34-227-73-65.compute-1.amazonaws.com (Matt Detter)
ec2-54-161-116-43.compute-1.amazonaws.com (Lijun Ma)
ec2-3-84-215-148.compute-1.amazonaws.com(Zhiguo Sun)
ec2-35-173-183-115.compute-1.amazonaws.com (lihua wang)
ec2-35-153-134-106.compute-1.amazonaws.com (Val G)
ec2-3-82-14-182.compute-1.amazonaws.com (Tyler Allen)
ec2-54-210-185-236.compute-1.amazonaws.com (Rachel Meade)

Login

ssh dcuser@addressname.com # replace addressname.com by your machine name from above

Introducing the Shell

pwd = "print working directory": tells you where you are in your system or what directory you are working in

ls = "list": shows all files in your current working directory

cd foldername = "change directory": moves you to a new directory

cd .. will take you to the directory within which your current working directory is stored (one level up)

Many commands have subcommands, or flags, that can be used to get more precise outputs from the command.

ls -l will list more information about listed items, such as size, permissions, date of last modification

  • When you use this command, directories will be identified by a d at the beginning of the permissions line.

man command_of_interest = "manual": prints the instruction manual for an individual command (VERY USEFUL)

Tab key can be used to complete the names of files or directories as you type. Can be used to limit typing errors during coding.

If tab complete doesn't work after one tap, try double tapping. This will provide a list of possible completions to what you have already typed.

ls -a will reveal hidden directories, the names of which always begin with a .

ls directory will list the contents of the directory (even if it is not your current working directory)

Every file and directory has a full pathname (also known as absolute pathname), beginning with the root directory and ending with the file name. This always begins with a slash. (e.g. /home/usr/file)

However, relative pathnames reduce the amount of typing necessary while offering flexibility in file location, especially when the file is referenced in a script. This path is relative to the current working directory. (e.g. ../file)

Working with Files & Directories

Wildcards (*) can be used to reference multiple files at once.

Wildcards can reference any number of characters and can be used anywhere in the file reference.

Given a directory with files dog, domestic_cat, wild_cat:
ls do* will give me dog domestic_cat
ls *cat will give me domestic_cat wild_cat
ls *tic* will give me domestic_cat
ls w*cat will give me wild_cat
ls *[gm]* will give me dog domestic_cat, any file that contains "g" OR "m"

echo will print argument input as a line of text. This is a great defensive coding tactic to check your wildcards before executing ano

her command on the group of files, such as deleting them

Command History

ctrl+C will cancel a command

history will print an indexed list of all previously executed commands. To reference these commands by number, you can type simply !#, and the referenced command will execute.

ctrl+R will do a reverse search through your command history. Very useful if you don't want to spend time bashing the up arrow key (which will direct you to the commands you have already executed from most to least recent).

ctrl+L or clear will clear your terminal or bash screen

Examining Files

cat <filename> will print the contents of a file; bad for large files

less <filename> will give you a peek at the file in a separate window; great for reading files

Within less, if you type / then a search term then Enter, it will search the file for your term.

  • Repeating /+Enter will take you to the next occurrence of your search term.
  • man pages use less, so these search tactics can be used there too.

head <filename> prints the top of a file of interest.
tail <filename> prints the bottom of a file of interest

Adding the -n # argument after head or tail will allow you to specify how many lines you want printed.

For example:
head -n 4 SRR098026.fastq will print the top 4 lines of the fastq file, or one read entry.

FASTQ Format

Each read entry is 4 lines and is organized as such:

  1. Always begins with @ and then information about the read
  2. The actual DNA sequence
  3. Always begins with a + and sometimes the same info in line 1
  4. Has a string of characters which represent the quality scores; must have same number of characters as line 2

On line 2, an N denotes an unknown base.

On line 4, quality scores are denoted by an established system shown below, where ! is the lowest quality (0) and K is the highest (42).

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJK
                  |         |         |         |         |
Quality score:    0........10........20........30........40..  

Creating, Moving, Copying, and Removing

cp <filename> <copyname> can be used to copy a file

example:
cp SRR098026.fastq SRR098026-copy.fastq

Make a new directory using the command mkdir
example:
mkdir backup produces a file named "backup"

Move your copy of the fastq file to your backup folder using the mv command
mv <filename> <new location name>
example:
mv SRR098026-copy.fastq backup/

The mv command can also be used to rename files
mv <filename> <new filename>
example:
mv original.txt new.txt
mv SRR098026.fastq SRR098026-rename.fastq

And we can move and rename at the same time.
example:
mv SRR098026-rename.fastq backup/SRR098026-rename2.fastq

File Permissions

It's important to set your permissions so that you don't accidentally overwrite your files.

Type the command ls -l to see the permissions for each of your files.

-rw-r--r-- 1 dcuser dcuser 43332 Nov 15 23:02 SRR098026-backup.fastq

r = read permission
w = write permission
x = execute permission

To change permissions of a file, use the command chmod
example:

chmod -w SRR098026-copy.fastq

removes permission to write to the file.

To give yourself permission to write to the file again:

chmod +w SRR098026-copy.fastq

To update permissions for only owner, group, or user. For example, if you only want the user to have write permission:

chmod u+w SRR098026-copy.fastq

Removing files

Be careful when using this command, as it will permanently delete the file (there is NO "recycle bin" in terminal!).
rm <filename>

To remove a directory, add the -r flag, which stands for "recursive". rm will then recursively delete the directory and all of the files and directories inside of it:
rm -r <directory>

One way to protect yourself from accidentally deleting data is to remove write permissions. This also prevents you from accidentally modifying your raw data. If you try to delete or edit the file or directory, it will tell you that permission is denied.

  • This is an especially good idea for backup directories!

Exercise

  • Make a copy of each of your fastq files: cp SRR098026.fastq SRR098026-backup.fastq and cp SRR097977.fastq SRR097977-backup.fastq
  • Make a new "backup" directory (make sure you've deleted your old one if you haven't already) using the command mkdir backup
  • Use the wildcard character to move all "copy" files at once mv SRR*-backup.fastq backup/
  • Remove write permissions to keep your backup data protected chmod -w backup/*-backup.fastq

Searching in files using grep

Grep is a command line tool that lets you search in files and directories easily. For this section, you want to be in the untrimmed_reads directory: cd ~/shell_data/untrimmed_fastq

To search in a file using grep, use the format:
grep <search term> <filename>
example:
grep NNNNNNNNNN SRR098026.fastq to search for all instances of 'NNNNNNNNNN' in SRR098026.fastq.

You can make your grep search more useful by also returning the lines before and after the instances of your search term:

  • -B1 will also return the line before your search term
  • -A2 will also return the two lines after your search term

example:grep -B1 -A2 NNNNNNNNNN SRR098026.fastq will give you output that looks like this:

@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Redirecting output

Using Grep to search our files is useful, but we often will also want to save the results of these searches. Redirection is an easy way to do this.

The command for redirecting output to a file is >
example:

grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt

This will print the output of our search to a file named "bad_reads.txt". Note that this does not alter the original file in any way, it just copies the search query result.

  • Another important note is that each time you use > to write to the same file name, it will overwrite the original file. To append more information to an existing file, modify your redirect command by using >> instead of a single > arrow.
  • Example: grep NNN file.fastq >> output.txt will add to the output.txt file instead of overwriting it.

Note that in the UNIX command line, the file extension doesn't matter. The file could end in .txt or .fastq, but the command line won't know the difference*. Tools like grep or less or cat will all try to use the files in the same way.
* This is not true of all command line programs, as some more specialized tools may require specific input formats, but this is true of the tools we will be using for this section of the course.

The Word Count tool is used by entering the command wc <file.txt>
example:

wc bad_reads.txt

This will return the output: 537 1073 23217 bad_reads.txt. Each number corresponds to the number of lines, words and characters in the file.

Frequently, we will only want to get one of these numbers, such as the number of lines. To get the number of lines, add the -l argument:

wc -l bad_reads.txt

Which returns:
537 bad_reads.txt
So we can see that there are 537 lines in the file bad_reads.txt.

Giving your output files a different extension than your data files can be a good idea so you don't accidentally try to query your output files. For example, if you had named your output bad_reads.fastq instead of bad_reads.txt, this command:
grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.fastq
would give you the error message:
grep: input file ‘bad_reads.fastq’ is also the output

Piping information is done using the | command. This redirects your output to another program instead of printing to your screen or to a file.
example:

grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | less

This command will send all the lines before/after the search term "NNNNNNNNNN" to less for viewing.

Another example:

grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l 

Will translate the results of your grep search into the number of lines being returned without having to make an intermediate file.

For Loops

For loops are useful for performing the same operation on multiple different files.

Variables make it easy to use for loops. To assign a variable, use the format:
<variable name> = <value>
example:
foo=abc

To use this variable, add $ before the variable name.

For example, echo $foo will print:
abc
to your screen.

Variables can be added to other information/text, but has to be done in a specific way - by adding { and } (also known as "curly brackets") around your variable name.

echo foo is $fooHello
won't work! You'll get back foo is

echo foo is ${foo}Hello
does work:
foo is abcHello

To make a for loop using variables, use the format:
for <variable> in <group to iterate over> ; do <commands> ; done

To write your for loop over multiple lines, you can also just hit "enter", and the terminal prompt will switch to > so you know it's still expecting more input.

$ for filename in *.fastq
> do
> head -n 2 ${filename}
> done

One you've entered done and hit enter, the terminal will know that you're ready to run the loop

You can also redirect the oupt to a new file each time you iterate through the loop:

$ for filename in *.fastq
> do
> head -n 2 ${filename} >> seq_info.txt
> done

NOTE: that you need to use the "append"(>>) redirect command instead of the > command inside for loops, as > will overwrite the file each time you go through the loop!

Basename is a function that removes a uniform part of a name (such as the file extension) from a list of files.
Use the format: basename <filename> <part to remove>

basename SRR097977.fastq .fastq
will return: SRR097977

basename SRR097977.fastq astq
will return: SRR097977.f

To use basename in a for loop:

$ for file in *.fastq
> do
> name=$(basename ${file} .fastq)
> echo ${name}
> done

Will output just the accession number prefix of all the .fastq files:

SRR097977
SRR098026

Writing Scripts

Scripts are used to combine multiple commands into a series that can be run at once. It is essentially a text file with commands in it.

Command line text editors are used to write scripts. We will use nano.

nano <filename> will create a file and open a text editor window in which you can write and modify code.

ctrl+O will save the file. Hit Enter to verify the file name.

ctrl+X can be used to exit the file. Exiting without saving will result in a prompt to save.

bash <scriptname> can be used to run the written script.

To run the script without the bash command, you must change the script permissions to be executable (denoted byx).

To accomplish this:
chmod +x <scriptname> will add executing privileges to all users.

Moving & Downloading Data

wget <webaddress> stands for "world wide web get". This downloads data from a web address.

curl <webaddress>, or "see URL", displays content from a web address.

Both are not always available, so use which curl or which wget to make sure both are available on your shell.

curl -O <webaddress> will also download data should wget be unavailable.

wget will download the file of interest to your current working directory.

To move data from a local server to an Amazon instance:
scp <filename> <dcuser@....com>:/home/dcuser/<pathname>

To download data from the Amazon instance:

For Windows PSCP:

C:\User\your-pc-username\Downloads> pscp.exe dcuser@ec2-54-88-126-85.compute-1.amazonaws.com:/home/dcuser/shell_data/untrimmed_fastq/scripted_bad_reads.txt.

C:\User\your-pc-username\Downloads

For UNIX:

scp dcuser@ip.address:/home/dcuser/shell_data/untrimmed_fastq/scripted_bad_reads.txt ~/Downloads

Project Organization

Well-organized projects support collaboration with your future self! Bad labeling now will make looking at data in the future much more challenging and time-consuming.

mkdir = "make directory": use to generate directories

ls -R lists all contents of a directory and all contents of directories contained within the directory.

Organizing Your Files

  • Keep original raw data (without write permissions) quarantined in its own directory. Only modify copies of raw data.
  • Store results in a separate results folder.
  • Use a docs folder to store written analyses.

Documenting Your Activity on the Project

Using the history command we learned earlier, you can easily store a log of the modifications you have made to your data as a text file for later verification.

For example, if you wanted to append a number of actions you just performed on your file to a running log of actions, you could execute:
history | tail -n # >> activity_log.txt

Data Wrangling and Processing

Command line tools are useful to prepare genomic data for analysis. The process of preparing a dataset for experimental analysis is known as data wrangling.

We will be processing public access genomic data using a very common workflow that is broadly applicable to many types of sequence analysis.

We will be processing and data wrangling a multi-generational dataset of E.coli from Richard Lenski's lab.

The metadata for this experiment includes:

  • strain (strain name)
  • generation (generation frozen)
  • clade (based on parsimony-based tree)
  • reference (study the samples were originally seq'd for)
  • population (ancestral population group)
  • mutator (hypermutability mutant status)
  • facility (facility samples were sequenced at)
  • run (sequence read archive sample ID)
  • read_type (library type of reads)
  • read_length (length of reads in sample)
  • sequencing_depth (depth of sequencing)
  • cit (citrate-using mutant status)

You can use Excel filtering tools to answer basic questions about the dataset. However, after saving a separate copy of the original raw data, any heavy-duty modifications should be completed in command line, due to the better automation & scalability.

Bioinformatic Workflows

A series of bioinformatic tools used to process a dataset is collectively referred to as a pipeline.

Standardized workflows help to keep data analysis consistant throughout a project.

Today, we will use FastQC to assess read quality in the fastq file.

Starting with Data

In order to download the files for today, open Terminal and log in to your AMI.

ssh dcuser@<yourAMI>
password: data4Carp

Use who to check that you are the only one using your machine.

To download the files for today:

mkdir -p ~/dc_workshop/data/untrimmed_fastq/
cd ~/dc_workshop/data/untrimmed_fastq

curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_2.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz 

NOTE: If you are ever stuck and have no prompt line to type on, use ctrl+C to cancel any command that the computer is currently processing.

cp <filename> <copyname> will help you to maintain the integrity of your original data. Before the names, you can also include pathnames to specify where you want the file to come from or the copy to go to.

Use ls -lh to check whether all of your files downloaded properly.

NOTE: With commands like ls, flags can be compiled together. -l (long form list) and -h (human readable file size) are separate flags being executed at once by -lh.

Quality Control

Details on the FASTQ Format

As mentioned yesterday, the fastq format is as follows for each read:

  1. Always begins with @ and then information about the read
  2. The actual DNA sequence
  3. Always begins with a + and sometimes the same info in line 1
  4. Has a string of characters which represent the quality scores; must have same number of characters as line 2

On line 2, an N denotes an unknown base.

On line 4, quality scores are denoted by an established system shown below, where ! is the lowest quality (0) and K is the highest (42).

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJK
                  |         |         |         |         |
Quality score:    0........10........20........30........40..  

Quality for each read tends to be higher at the beginning of the read and lower at the end.

Use head -n 4 SRR2584863_1.fastq to check that your file is properly formatted.

Assessing Quality using FastQC

FastQC will give a boxplot of the quality of each read, visualizing the data on line 4 of the fastq read within ranges of good (green), acceptable (yellow), and bad (red).

Running FastQC

FastQC accepts wildcards and gzip files (files that end in .gz, which are condensed forms of the full file).

fastqc *.fastq* will process all fastq files in the current directory. Nice and easy!

FastQC generates a .zip and .html file. The html file is a stable webpage containing a summary report for the run, and the zip file is a compressed file containing multiple output subfiles.

Make sure to move your results files away from your raw data for further analysis by making a directory specifically for your results files (mkdir <directory>) and moving your files there (mv <file(s)> <directory>).

Viewing the FastQC Results

open SRR2584863_1_fastqc.html would open a window on our computer if we were not working on a VM.

If we transfer this file to our computer, we can open it in our own browser.

To download the directory containing the files, open a local instance of Terminal or GitBash:

$ mkdir -p ~/Desktop/fastqc_html 
$ scp dcuser@<yourAMI>:~/dc_workshop/results/fastqc_untrimmed_reads/*.html ~/Desktop/fastqc_html

We are effectively telling the computer to grab the directory from the location you specified on your AMI and copy it to the local location you provided (~/Desktop/fastqc_html/).

Now, this should open the file locally:

$ cd ~/Desktop/fastqc_html/ 
$ open *.html 

Decoding FastQC Outputs

In order to look at the output files, we must unzip them. Because unzip only accepts one file at a time, we can unzip all of the .zip files at once using a for loop:

$ for filename in *.zip
> do
> unzip $filename
> done

Once the files are unzipped,ls -F tells us whether any of the items in the outputs are directories.

In each of the output directories, there is a number of files, including a summary.txt file.

Thesummary.txt will tell you which sequences PASS or FAIL.

Documenting Our Work

Combining our summary.txt files will make a record of all obtained results. This can be accomplished by concatonating:

cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt

grep can be used to pull any line that contains FAIL and can further be used to remove those lines from the analysis during data tidying.

Before you remove any data, always make sure the QC is threshholding data the way you want it to.

Trimming and Filtering

We will use the program trimmomatic to filter our reads.

Trimmomatic takes many input parameters so it processes your data the correct way. Using the \ character allows you to continue constructing your command on a new line in your terminal window, which makes it much easier to read and understand the commands you are giving.

An example for running trimmomatic (not the command we will use):

trimmomatic PE -threads 4 SRR_1056_1.fastq SRR_1056_2.fastq  \
              SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq \
              SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq \
              ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20

PE that it will be taking a paired end file as input
-threads 4 to use four computing threads to run (this will spead up our run)
SRR_1056_1.fastq the first input file name
SRR_1056_2.fastq the second input file name
SRR_1056_1.trimmed.fastq the output file for surviving pairs from the _1 file
SRR_1056_1un.trimmed.fastq the output file for orphaned reads from the _1 file
SRR_1056_2.trimmed.fastq the output file for surviving pairs from the _2 file
SRR_1056_2un.trimmed.fastq the output file for orphaned reads from the _2 file
ILLUMINACLIP:SRR_adapters.fa to clip the Illumina adapters from the input file using the adapter sequences listed in SRR_adapters.fa
SLIDINGWINDOW:4:20 to use a sliding window of size 4 that will remove bases if their phred score is below 20

This command is large and complicated. We will use a for loop to do the same thing for each file without having to type the command in each time.
Note: run gzip SRR2584863_1.fastq to compress this file again before running the code below

$ for infile in *_1.fastq.gz
> do
>   base=$(basename ${infile} _1.fastq.gz)
>   trimmomatic PE ${infile} ${base}_2.fastq.gz \
>                ${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz \
>                ${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
>                SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 
> done

We can see how useful the basename functionality is in cases like this!

Now we have our original files and our trimmed files:

NexteraPE-PE.fa               SRR2584866_1.fastq.gz         SRR2589044_1.trim.fastq.gz
SRR2584863_1.fastq.gz         SRR2584866_1.trim.fastq.gz    SRR2589044_1un.trim.fastq.gz
SRR2584863_1.trim.fastq.gz    SRR2584866_1un.trim.fastq.gz  SRR2589044_2.fastq.gz
SRR2584863_1un.trim.fastq.gz  SRR2584866_2.fastq.gz         SRR2589044_2.trim.fastq.gz
SRR2584863_2.fastq.gz         SRR2584866_2.trim.fastq.gz    SRR2589044_2un.trim.fastq.gz
SRR2584863_2.trim.fastq.gz    SRR2584866_2un.trim.fastq.gz
SRR2584863_2un.trim.fastq.gz  SRR2589044_1.fastq.gz

We want to move the trimmed files to a new directory so in the future, we know that they were in fact trimmed.

$ cd ~/dc_workshop/data/untrimmed_fastq
$ mkdir ../trimmed_fastq
$ mv *.trim* ../trimmed_fastq
$ cd ../trimmed_fastq
$ ls #this will show you the trimmed reads in the new directory

Variant Calling Workflow

Before we can call variants in our sample, we must first align the samples to a reference genome. We will use a common aligner: Burrows-Wheeler Aligner using the bwa command.

Steps to align to genome:

  1. index the reference genome
  2. align sequencing reads to reference

We are using the command curl -L -o data/ref_genome/ecoli_rel606.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz to download the reference.

  • note that the -L command is useful here when using curl because it allows for redirection (or, if the website says the data is not exactly where you specified, curl will still download the data from the correct location).

Next, unzip the reference:
gunzip data/ref_genome/ecoli_rel606.fasta.gz

We are also downloading a shorter set of trimmed fastq files for this workshop:

$ curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248
$ tar xvf sub.tar.gz
$ mv sub/ ~/dc_workshop/data/trimmed_fastq_small

tar xvf is another command used to extract from compressed files/directories.

We will make some new directories to keep our data and results organized:
$ mkdir -p results/sam results/bam results/bcf results/vcf

Index the reference

Now we will perform the first step: index the reference genome (note that this only has to be done once).

bwa index data/ref_genome/ecoli_rel606.fasta

This should run quickly (~2 seconds)

Align reads to reference genome

Example command to run this alignment:

bwa mem ref_genome.fasta input_file_R1.fastq input_file_R2.fastq > output.sam

Automatically, bwa will only print results to the screen. To save the results, we want to redirect to an output file such as output.sam.

We use the following command to align one sample (SRR2584866):

bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq > results/sam/SRR2584866.aligned.sam

SAM/BAM format
SAM files (and the compressed version of a SAM file, called a BAM file), is the standard way to store aligned sequencing data.

We will convert our SAM file to BAM format using a program called samtools.

samtools view -S -b results/sam/SRR2584866.aligned.sam > results/bam/SRR2584866.aligned.bam
  • The BAM file will be smaller than the SAM file original format

To do the variant calling efficiently, we also need to sort the file by coordinate.

$ samtools sort -o results/bam/SRR2584866.aligned.sorted.bam results/bam/SRR2584866.aligned.bam 
  • SAM/BAM files can be sorted in multiple ways, e.g. by location of alignment on the chromosome, by read name, etc. It is important to be aware that different alignment tools will output differently sorted SAM/BAM, and different downstream tools require differently sorted alignment files as input.

samtools is a program with lots of functionality. You can also use it to see some statistics about your BAM file with the flagstat option:
samtools flagstat results/bam/SRR2584866.aligned.sorted.bam

Output:

351169 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1169 + 0 supplementary
0 + 0 duplicates
351103 + 0 mapped (99.98% : N/A)
350000 + 0 paired in sequencing
175000 + 0 read1
175000 + 0 read2
346688 + 0 properly paired (99.05% : N/A)
349876 + 0 with itself and mate mapped
58 + 0 singletons (0.02% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Perform the Variant Calling

For this workshop, we will use bcftools to make variant calls. https://samtools.github.io/bcftools/howtos/index.html

Step 1: Calculate the read coverage of positions in the genome using the mpileup command in bcftools

bcftools mpileup -O b -o results/bcf/SRR2584866_raw.bcf \
-f data/ref_genome/ecoli_rel606.fasta results/bam/SRR2584866.aligned.sorted.bam 

Step 2: Perform the variant calling using the call command in bcftools

# Old version in notes: bcftools call --ploidy 1 -m -v -o results/bcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf 

# Updated command to use
bcftools call --ploidy 1 -m -v -o results/vcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf 

Step 3: Filter and report the variants using the varFilter command in vcfutils

vcfutils.pl varFilter results/vcf/SRR2584866_variants.vcf  > results/vcf/SRR2584866_final_variants.vcf

Use less -S results/vcf/SRR2584866_final_variants.vcf to open the final variants file.

Shell Scripts to Automate a Variant Calling Workflows

Variables are assigned using the = operator. Ex: myVarName = 'hello'

They can be recalled using the $ in front of the variable name. Ex: $myVarName

Use touch read_qc.sh to generate an empty file named read_qc.sh. Then nano read_qc.sh to open the file.

Script contents:

set -e
cd ~/dc_workshop/data/untrimmed_fastq/

echo "Running FastQC..."
fastqc *.fastqc

mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads
echo "Saving FastQC results..."
mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads/
mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads/

cd ~/dc_workshop/results/fastqc_untrimmed_reads/
echo "Unzipping..."
for filename in *.zip
    do
    unzip $filename
    done
echo "Saving summary..."
cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt

Save the file with Ctrl-O and then Ctrl-X to exit.

To run the script use bash read_qc.sh.

Automating the rest of the workflow

First, use cd ~/dc_workshop/scripts to go to the scripts folder

Then use

curl -O https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/run_variant_calling.sh

to download the example script. To access the script use:

cd ~/dc_workshop/scripts
less run_variant_calling.sh

Steps performed in the script:

  1. align the reads to a reference genome and output a sam file with bwa
    -bwa mem $genome $fq1 $fq2 > $sam
  2. Convert the sam file to a bam file with samtools
    -samtools view -S -b $sam > $bam
  3. Sort the bam file with samtools
    -samtools sort -o $sorted_bam $bam
  4. Index the bam file with samtools
    -samtools index $sorted_bam
  5. Calculate the read coverage with bcftools
    -bcftools mpileup -O b -o $raw_bcf -f $genome $sorted_bam
  6. Call SNPs with bcftools
    -bcftools call --ploidy 1 -m -v -o $variants $raw_bcf
  7. Filter and report the SNP variants in a vcf format with vcfutils
    -vcfutils.pl varFilter $variants > $final_variants
    To run the script use bash run_variant_calling.sh

Cloud Computing for Genomics

Advantages:

  • more resources (computing power) available than on a local machine
  • full administrative rights - install anything
  • use pre-configured “images” (machine snapshot where operating system and software are already installed)
  • everyone in your group can access a server with the same tools and configurations available
  • long-running analyses will continue to run on the server even when you have your computer is turned off
  • local operating system doesn't matter (Windows, Linux, MacOS) - once you're on the server, it will be unix-based

Disadvantages:

  • difficult to visualize files while they are on the server
  • takes time to move data and files to and from the server
  • costs money to run
  • no local systems administrator to help if there are problems

Verify your environment
whoami shows your username on computer you have connected to

df -h shows space on hard drive

free -h is another command that will show you details of the availability and resources of your cloud server.

cat /proc/cpuinfo shows detail information on how many processors (CPUs) the machine has

tree -L 1 shows a tree view of the file system 1 level below your current location

Staying Connected to the Cloud

We will work through how to use tmux, but screen is another option.

To start a new session, use the command tmux new -s session_name

  • be sure to give your session a more descriptive name than "session_name"!

Detach session (process keeps running in background) by typing on your keyboard: control + b. Next type d to detach the session.

tmux list-sessions shows you all active sessions running

After detaching, you can get back to that session using the command: tmux attach -t session_name

  • if you only have one session, you can just enter tmux attach and it will auto-connect you to that session

You can switch sessions using tmux switch -t session_name

And you can end a session using tmux kill-session -t session_name

See course page for list of cloud computing options: https://datacarpentry.org/cloud-genomics/04-which-cloud/index.html

Installing on cloud server
YUM (for Red Hat and Centos instances) and APT (for Debian or Ubuntu instances) are two package manager programs that can be used to install software on the cloud server.

For example, if we didn't already have tmux installed, we could run the command apt search tmux to see if it is available to be installed. To install tmux, run the command apt install tmux.
Note there is also a program called apt-get that is nearly the same as apt. They have the same functionality most of the time.

Before you installing using apt, you should check for updates:
sudo apt upgrade

  • sudo stands for "super user do", which runs the command from the administrator account (read more about sudo here)
  • note: in this Amazon webserver instance, we do not have sudo permissions, so we cannot perform this command right now

Duke's Virtual Computing Manager can be found by logging in to vcm.duke.edu with your Duke NetID and password.

Job scheduling on computing clusters is done at Duke using Slurm. Read more about this workload manager here: https://slurm.schedmd.com/

Select a repo