Duke University, December 18-19, 2019
This is a shared notes document for all in the course to view/edit.
Course website: https://duke-gcb.github.io/2019-12-18-duke/
Lesson: https://datacarpentry.org/organization-genomics/
What kinds of metadata are important?
Sequencing data is worthless without the metadata! Ask yourself: Are there metadata standards for the data I'm going to collect?
The cardinal rules of using spreadsheet programs for data:
Can you spot the problems in this sample submission sheet?
Retrieving sample sequencing data from the facility
Storing data
Storage guidelines:
The Short Read Archive (SRA) database houses publicly available short read sequencing data.
Use the SraRunTable feature to work with information about multiple samples at one time.
Tenaillon dataset: https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605
Click on the Run Number of the first entry in the bottom table on the page (third table down). This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.
Go back to the ‘previous page’. At the top of the page and in the Total row you will see there are 312 runs, 109.43 Gb data, and 168.81 Gbases of data. Click the ‘RunInfo Table’ button and save the file locally.
Note: This is not the actual sequencing data - it is only the metadata saved to a tab-delimited file named SraRunTable.txt.
If you only want to download one or two sequencing files, it's easiest to go to the European Nucleotide Archive
If you want to download *multiple files all at once, it is easiest to use the NCBI software package called the sra-toolkit
which is a tool that is used on the command line.
Note We will not be using this tool for this workshop.
Shell: What is it? Why do we care?
Please edit this note (clicking the Pencil icon above) and claim a virtual machine from the list below by placing your name next to it:
ec2-3-94-131-81.compute-1.amazonaws.com (Dahlia Rohm)
ec2-54-81-4-35.compute-1.amazonaws.com (Rachel Shahan)
ec2-54-166-133-238.compute-1.amazonaws.com (Sanjay Khandelwal)
ec2-18-234-27-234.compute-1.amazonaws.com (Young J Yun)
ec2-54-88-25-97.compute-1.amazonaws.com (Azita Sadeghpour)
ec2-54-81-79-226.compute-1.amazonaws.com goomber Shelly
ec2-52-91-232-162.compute-1.amazonaws.com (Zhaohui Wang)
ec2-34-207-134-244.compute-1.amazonaws.com (Matthew Hirschey)
ec2-100-24-11-24.compute-1.amazonaws.com (Akshay Bareja)
ec2-34-226-204-158.compute-1.amazonaws.com (Ryan Rickels)
ec2-54-196-108-196.compute-1.amazonaws.com (Criss Guy)
ec2-52-91-110-228.compute-1.amazonaws.com (Julio Barrera)
ec2-3-93-72-49.compute-1.amazonaws.com (Julia Gamache)
ec2-54-162-228-14.compute-1.amazonaws.com (Kevin Friede)
ec2-3-89-70-68.compute-1.amazonaws.com (Dan Leehr)
ec2-174-129-30-52.compute-1.amazonaws.com (Hilmar Lapp)
ec2-3-89-152-175.compute-1.amazonaws.com (Matt Green)
ec2-3-87-250-169.compute-1.amazonaws.com (golshid sanati)
ec2-35-153-136-69.compute-1.amazonaws.com (Zachery Mielko)
ec2-54-204-180-228.compute-1.amazonaws.com (Lingyun)
ec2-34-235-112-16.compute-1.amazonaws.com goomber shelly
ec2-34-227-73-65.compute-1.amazonaws.com (Matt Detter)
ec2-54-161-116-43.compute-1.amazonaws.com (Lijun Ma)
ec2-3-84-215-148.compute-1.amazonaws.com(Zhiguo Sun)
ec2-35-173-183-115.compute-1.amazonaws.com (lihua wang)
ec2-35-153-134-106.compute-1.amazonaws.com (Val G)
ec2-3-82-14-182.compute-1.amazonaws.com (Tyler Allen)
ec2-54-210-185-236.compute-1.amazonaws.com (Rachel Meade)
ssh dcuser@addressname.com # replace addressname.com by your machine name from above
pwd
= "print working directory": tells you where you are in your system or what directory you are working in
ls
= "list": shows all files in your current working directory
cd
foldername = "change directory": moves you to a new directory
cd ..
will take you to the directory within which your current working directory is stored (one level up)
Many commands have subcommands, or flags, that can be used to get more precise outputs from the command.
ls -l
will list more information about listed items, such as size, permissions, date of last modification
d
at the beginning of the permissions line.man command_of_interest
= "manual": prints the instruction manual for an individual command (VERY USEFUL)
Tab key can be used to complete the names of files or directories as you type. Can be used to limit typing errors during coding.
If tab complete doesn't work after one tap, try double tapping. This will provide a list of possible completions to what you have already typed.
ls -a
will reveal hidden directories, the names of which always begin with a .
ls directory
will list the contents of the directory (even if it is not your current working directory)
Every file and directory has a full pathname (also known as absolute pathname), beginning with the root directory and ending with the file name. This always begins with a slash. (e.g. /home/usr/file
)
However, relative pathnames reduce the amount of typing necessary while offering flexibility in file location, especially when the file is referenced in a script. This path is relative to the current working directory. (e.g. ../file
)
Wildcards (*
) can be used to reference multiple files at once.
Wildcards can reference any number of characters and can be used anywhere in the file reference.
Given a directory with files dog
, domestic_cat
, wild_cat
:
ls do*
will give me dog domestic_cat
ls *cat
will give me domestic_cat wild_cat
ls *tic*
will give me domestic_cat
ls w*cat
will give me wild_cat
ls *[gm]*
will give me dog domestic_cat
, any file that contains "g" OR "m"
echo
will print argument input as a line of text. This is a great defensive coding tactic to check your wildcards before executing ano
her command on the group of files, such as deleting them
ctrl+C
will cancel a command
history
will print an indexed list of all previously executed commands. To reference these commands by number, you can type simply !#
, and the referenced command will execute.
ctrl+R
will do a reverse search through your command history. Very useful if you don't want to spend time bashing the up arrow key (which will direct you to the commands you have already executed from most to least recent).
ctrl+L
or clear
will clear your terminal or bash screen
cat <filename>
will print the contents of a file; bad for large files
less <filename>
will give you a peek at the file in a separate window; great for reading files
Within less
, if you type /
then a search term then Enter, it will search the file for your term.
/
+Enter will take you to the next occurrence of your search term.man
pages use less
, so these search tactics can be used there too.head <filename>
prints the top of a file of interest.
tail <filename>
prints the bottom of a file of interest
Adding the -n #
argument after head
or tail
will allow you to specify how many lines you want printed.
For example:
head -n 4 SRR098026.fastq
will print the top 4 lines of the fastq file, or one read entry.
Each read entry is 4 lines and is organized as such:
@
and then information about the read+
and sometimes the same info in line 1On line 2, an N
denotes an unknown base.
On line 4, quality scores are denoted by an established system shown below, where !
is the lowest quality (0) and K
is the highest (42).
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJK
| | | | |
Quality score: 0........10........20........30........40..
cp <filename> <copyname>
can be used to copy a file
example:
cp SRR098026.fastq SRR098026-copy.fastq
Make a new directory using the command mkdir
example:
mkdir backup
produces a file named "backup"
Move your copy of the fastq file to your backup folder using the mv
command
mv <filename> <new location name>
example:
mv SRR098026-copy.fastq backup/
The mv
command can also be used to rename files
mv <filename> <new filename>
example:
mv original.txt new.txt
mv SRR098026.fastq SRR098026-rename.fastq
And we can move and rename at the same time.
example:
mv SRR098026-rename.fastq backup/SRR098026-rename2.fastq
It's important to set your permissions so that you don't accidentally overwrite your files.
Type the command ls -l
to see the permissions for each of your files.
-rw-r--r-- 1 dcuser dcuser 43332 Nov 15 23:02 SRR098026-backup.fastq
r = read permission
w = write permission
x = execute permission
To change permissions of a file, use the command chmod
example:
chmod -w SRR098026-copy.fastq
removes permission to write to the file.
To give yourself permission to write to the file again:
chmod +w SRR098026-copy.fastq
To update permissions for only owner, group, or user. For example, if you only want the user to have write permission:
chmod u+w SRR098026-copy.fastq
Be careful when using this command, as it will permanently delete the file (there is NO "recycle bin" in terminal!).
rm <filename>
To remove a directory, add the -r
flag, which stands for "recursive". rm
will then recursively delete the directory and all of the files and directories inside of it:
rm -r <directory>
One way to protect yourself from accidentally deleting data is to remove write permissions. This also prevents you from accidentally modifying your raw data. If you try to delete or edit the file or directory, it will tell you that permission is denied.
Exercise
cp SRR098026.fastq SRR098026-backup.fastq
and cp SRR097977.fastq SRR097977-backup.fastq
mkdir backup
mv SRR*-backup.fastq backup/
chmod -w backup/*-backup.fastq
Grep is a command line tool that lets you search in files and directories easily. For this section, you want to be in the untrimmed_reads
directory: cd ~/shell_data/untrimmed_fastq
To search in a file using grep, use the format:
grep <search term> <filename>
example:
grep NNNNNNNNNN SRR098026.fastq
to search for all instances of 'NNNNNNNNNN' in SRR098026.fastq.
You can make your grep search more useful by also returning the lines before and after the instances of your search term:
-B1
will also return the line before your search term-A2
will also return the two lines after your search termexample:grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
will give you output that looks like this:
@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Using Grep to search our files is useful, but we often will also want to save the results of these searches. Redirection is an easy way to do this.
The command for redirecting output to a file is >
example:
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
This will print the output of our search to a file named "bad_reads.txt". Note that this does not alter the original file in any way, it just copies the search query result.
>
to write to the same file name, it will overwrite the original file. To append more information to an existing file, modify your redirect command by using >>
instead of a single >
arrow.grep NNN file.fastq >> output.txt
will add to the output.txt
file instead of overwriting it.Note that in the UNIX command line, the file extension doesn't matter. The file could end in .txt
or .fastq
, but the command line won't know the difference*. Tools like grep
or less
or cat
will all try to use the files in the same way.
* This is not true of all command line programs, as some more specialized tools may require specific input formats, but this is true of the tools we will be using for this section of the course.
The Word Count tool is used by entering the command wc <file.txt>
example:
wc bad_reads.txt
This will return the output: 537 1073 23217 bad_reads.txt
. Each number corresponds to the number of lines, words and characters in the file.
Frequently, we will only want to get one of these numbers, such as the number of lines. To get the number of lines, add the -l
argument:
wc -l bad_reads.txt
Which returns:
537 bad_reads.txt
So we can see that there are 537 lines in the file bad_reads.txt.
Giving your output files a different extension than your data files can be a good idea so you don't accidentally try to query your output files. For example, if you had named your output bad_reads.fastq
instead of bad_reads.txt
, this command:
grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.fastq
would give you the error message:
grep: input file ‘bad_reads.fastq’ is also the output
Piping information is done using the |
command. This redirects your output to another program instead of printing to your screen or to a file.
example:
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | less
This command will send all the lines before/after the search term "NNNNNNNNNN" to less
for viewing.
Another example:
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l
Will translate the results of your grep search into the number of lines being returned without having to make an intermediate file.
For loops are useful for performing the same operation on multiple different files.
Variables make it easy to use for loops. To assign a variable, use the format:
<variable name> = <value>
example:
foo=abc
To use this variable, add $
before the variable name.
For example, echo $foo
will print:
abc
to your screen.
Variables can be added to other information/text, but has to be done in a specific way - by adding {
and }
(also known as "curly brackets") around your variable name.
echo foo is $fooHello
won't work! You'll get back foo is
echo foo is ${foo}Hello
does work:
foo is abcHello
To make a for loop using variables, use the format:
for <variable> in <group to iterate over> ; do <commands> ; done
To write your for loop over multiple lines, you can also just hit "enter", and the terminal prompt will switch to >
so you know it's still expecting more input.
$ for filename in *.fastq
> do
> head -n 2 ${filename}
> done
One you've entered done
and hit enter, the terminal will know that you're ready to run the loop
You can also redirect the oupt to a new file each time you iterate through the loop:
$ for filename in *.fastq
> do
> head -n 2 ${filename} >> seq_info.txt
> done
NOTE: that you need to use the "append"(>>
) redirect command instead of the >
command inside for loops, as >
will overwrite the file each time you go through the loop!
Basename is a function that removes a uniform part of a name (such as the file extension) from a list of files.
Use the format: basename <filename> <part to remove>
basename SRR097977.fastq .fastq
will return: SRR097977
basename SRR097977.fastq astq
will return: SRR097977.f
To use basename in a for loop:
$ for file in *.fastq
> do
> name=$(basename ${file} .fastq)
> echo ${name}
> done
Will output just the accession number prefix of all the .fastq files:
SRR097977
SRR098026
Scripts are used to combine multiple commands into a series that can be run at once. It is essentially a text file with commands in it.
Command line text editors are used to write scripts. We will use nano
.
nano <filename>
will create a file and open a text editor window in which you can write and modify code.
ctrl+O
will save the file. Hit Enter to verify the file name.
ctrl+X
can be used to exit the file. Exiting without saving will result in a prompt to save.
bash <scriptname>
can be used to run the written script.
To run the script without the bash
command, you must change the script permissions to be executable (denoted byx
).
To accomplish this:
chmod +x <scriptname>
will add executing privileges to all users.
wget <webaddress>
stands for "world wide web get". This downloads data from a web address.
curl <webaddress>
, or "see URL", displays content from a web address.
Both are not always available, so use which curl
or which wget
to make sure both are available on your shell.
curl -O <webaddress>
will also download data should wget
be unavailable.
wget
will download the file of interest to your current working directory.
To move data from a local server to an Amazon instance:
scp <filename> <dcuser@....com>:/home/dcuser/<pathname>
To download data from the Amazon instance:
For Windows PSCP:
C:\User\your-pc-username\Downloads> pscp.exe dcuser@ec2-54-88-126-85.compute-1.amazonaws.com:/home/dcuser/shell_data/untrimmed_fastq/scripted_bad_reads.txt.
C:\User\your-pc-username\Downloads
For UNIX:
scp dcuser@ip.address:/home/dcuser/shell_data/untrimmed_fastq/scripted_bad_reads.txt ~/Downloads
Well-organized projects support collaboration with your future self! Bad labeling now will make looking at data in the future much more challenging and time-consuming.
mkdir
= "make directory": use to generate directories
ls -R
lists all contents of a directory and all contents of directories contained within the directory.
Using the history
command we learned earlier, you can easily store a log of the modifications you have made to your data as a text file for later verification.
For example, if you wanted to append a number of actions you just performed on your file to a running log of actions, you could execute:
history | tail -n # >> activity_log.txt
Command line tools are useful to prepare genomic data for analysis. The process of preparing a dataset for experimental analysis is known as data wrangling.
We will be processing public access genomic data using a very common workflow that is broadly applicable to many types of sequence analysis.
We will be processing and data wrangling a multi-generational dataset of E.coli from Richard Lenski's lab.
The metadata for this experiment includes:
You can use Excel filtering tools to answer basic questions about the dataset. However, after saving a separate copy of the original raw data, any heavy-duty modifications should be completed in command line, due to the better automation & scalability.
A series of bioinformatic tools used to process a dataset is collectively referred to as a pipeline.
Standardized workflows help to keep data analysis consistant throughout a project.
Today, we will use FastQC to assess read quality in the fastq file.
In order to download the files for today, open Terminal and log in to your AMI.
ssh dcuser@<yourAMI>
password: data4Carp
Use who
to check that you are the only one using your machine.
To download the files for today:
mkdir -p ~/dc_workshop/data/untrimmed_fastq/
cd ~/dc_workshop/data/untrimmed_fastq
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_2.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz
NOTE: If you are ever stuck and have no prompt line to type on, use ctrl+C
to cancel any command that the computer is currently processing.
cp <filename> <copyname>
will help you to maintain the integrity of your original data. Before the names, you can also include pathnames to specify where you want the file to come from or the copy to go to.
Use ls -lh
to check whether all of your files downloaded properly.
NOTE: With commands like ls
, flags can be compiled together. -l
(long form list) and -h
(human readable file size) are separate flags being executed at once by -lh
.
As mentioned yesterday, the fastq format is as follows for each read:
@
and then information about the read+
and sometimes the same info in line 1On line 2, an N
denotes an unknown base.
On line 4, quality scores are denoted by an established system shown below, where !
is the lowest quality (0) and K
is the highest (42).
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJK
| | | | |
Quality score: 0........10........20........30........40..
Quality for each read tends to be higher at the beginning of the read and lower at the end.
Use head -n 4 SRR2584863_1.fastq
to check that your file is properly formatted.
FastQC will give a boxplot of the quality of each read, visualizing the data on line 4 of the fastq read within ranges of good (green), acceptable (yellow), and bad (red).
FastQC accepts wildcards and gzip files (files that end in .gz
, which are condensed forms of the full file).
fastqc *.fastq*
will process all fastq files in the current directory. Nice and easy!
FastQC generates a .zip
and .html
file. The html file is a stable webpage containing a summary report for the run, and the zip file is a compressed file containing multiple output subfiles.
Make sure to move your results files away from your raw data for further analysis by making a directory specifically for your results files (mkdir <directory>
) and moving your files there (mv <file(s)> <directory>
).
open SRR2584863_1_fastqc.html
would open a window on our computer if we were not working on a VM.
If we transfer this file to our computer, we can open it in our own browser.
To download the directory containing the files, open a local instance of Terminal or GitBash:
$ mkdir -p ~/Desktop/fastqc_html
$ scp dcuser@<yourAMI>:~/dc_workshop/results/fastqc_untrimmed_reads/*.html ~/Desktop/fastqc_html
We are effectively telling the computer to grab the directory from the location you specified on your AMI and copy it to the local location you provided (~/Desktop/fastqc_html/
).
Now, this should open the file locally:
$ cd ~/Desktop/fastqc_html/
$ open *.html
In order to look at the output files, we must unzip them. Because unzip
only accepts one file at a time, we can unzip all of the .zip
files at once using a for loop:
$ for filename in *.zip
> do
> unzip $filename
> done
Once the files are unzipped,ls -F
tells us whether any of the items in the outputs are directories.
In each of the output directories, there is a number of files, including a summary.txt
file.
Thesummary.txt
will tell you which sequences PASS
or FAIL
.
Combining our summary.txt
files will make a record of all obtained results. This can be accomplished by concatonating:
cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt
grep
can be used to pull any line that contains FAIL
and can further be used to remove those lines from the analysis during data tidying.
Before you remove any data, always make sure the QC is threshholding data the way you want it to.
We will use the program trimmomatic to filter our reads.
Trimmomatic takes many input parameters so it processes your data the correct way. Using the \
character allows you to continue constructing your command on a new line in your terminal window, which makes it much easier to read and understand the commands you are giving.
An example for running trimmomatic (not the command we will use):
trimmomatic PE -threads 4 SRR_1056_1.fastq SRR_1056_2.fastq \
SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq \
SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq \
ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20
PE
that it will be taking a paired end file as input
-threads 4
to use four computing threads to run (this will spead up our run)
SRR_1056_1.fastq
the first input file name
SRR_1056_2.fastq
the second input file name
SRR_1056_1.trimmed.fastq
the output file for surviving pairs from the _1
file
SRR_1056_1un.trimmed.fastq
the output file for orphaned reads from the _1
file
SRR_1056_2.trimmed.fastq
the output file for surviving pairs from the _2
file
SRR_1056_2un.trimmed.fastq
the output file for orphaned reads from the _2
file
ILLUMINACLIP:SRR_adapters.fa
to clip the Illumina adapters from the input file using the adapter sequences listed in SRR_adapters.fa
SLIDINGWINDOW:4:20
to use a sliding window of size 4 that will remove bases if their phred score is below 20
This command is large and complicated. We will use a for loop to do the same thing for each file without having to type the command in each time.
Note: run gzip SRR2584863_1.fastq
to compress this file again before running the code below
$ for infile in *_1.fastq.gz
> do
> base=$(basename ${infile} _1.fastq.gz)
> trimmomatic PE ${infile} ${base}_2.fastq.gz \
> ${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz \
> ${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
> SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
> done
We can see how useful the basename
functionality is in cases like this!
Now we have our original files and our trimmed files:
NexteraPE-PE.fa SRR2584866_1.fastq.gz SRR2589044_1.trim.fastq.gz
SRR2584863_1.fastq.gz SRR2584866_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz
SRR2584863_1.trim.fastq.gz SRR2584866_1un.trim.fastq.gz SRR2589044_2.fastq.gz
SRR2584863_1un.trim.fastq.gz SRR2584866_2.fastq.gz SRR2589044_2.trim.fastq.gz
SRR2584863_2.fastq.gz SRR2584866_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz
SRR2584863_2.trim.fastq.gz SRR2584866_2un.trim.fastq.gz
SRR2584863_2un.trim.fastq.gz SRR2589044_1.fastq.gz
We want to move the trimmed files to a new directory so in the future, we know that they were in fact trimmed.
$ cd ~/dc_workshop/data/untrimmed_fastq
$ mkdir ../trimmed_fastq
$ mv *.trim* ../trimmed_fastq
$ cd ../trimmed_fastq
$ ls #this will show you the trimmed reads in the new directory
Before we can call variants in our sample, we must first align the samples to a reference genome. We will use a common aligner: Burrows-Wheeler Aligner using the bwa
command.
Steps to align to genome:
We are using the command curl -L -o data/ref_genome/ecoli_rel606.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz
to download the reference.
-L
command is useful here when using curl because it allows for redirection (or, if the website says the data is not exactly where you specified, curl will still download the data from the correct location).Next, unzip the reference:
gunzip data/ref_genome/ecoli_rel606.fasta.gz
We are also downloading a shorter set of trimmed fastq files for this workshop:
$ curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248
$ tar xvf sub.tar.gz
$ mv sub/ ~/dc_workshop/data/trimmed_fastq_small
tar xvf
is another command used to extract from compressed files/directories.
We will make some new directories to keep our data and results organized:
$ mkdir -p results/sam results/bam results/bcf results/vcf
Now we will perform the first step: index the reference genome (note that this only has to be done once).
bwa index data/ref_genome/ecoli_rel606.fasta
This should run quickly (~2 seconds)
Example command to run this alignment:
bwa mem ref_genome.fasta input_file_R1.fastq input_file_R2.fastq > output.sam
Automatically, bwa
will only print results to the screen. To save the results, we want to redirect to an output file such as output.sam
.
We use the following command to align one sample (SRR2584866):
bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq > results/sam/SRR2584866.aligned.sam
SAM/BAM format
SAM files (and the compressed version of a SAM file, called a BAM file), is the standard way to store aligned sequencing data.
We will convert our SAM file to BAM format using a program called samtools
.
samtools view -S -b results/sam/SRR2584866.aligned.sam > results/bam/SRR2584866.aligned.bam
To do the variant calling efficiently, we also need to sort the file by coordinate.
$ samtools sort -o results/bam/SRR2584866.aligned.sorted.bam results/bam/SRR2584866.aligned.bam
samtools
is a program with lots of functionality. You can also use it to see some statistics about your BAM file with the flagstat
option:
samtools flagstat results/bam/SRR2584866.aligned.sorted.bam
Output:
351169 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1169 + 0 supplementary
0 + 0 duplicates
351103 + 0 mapped (99.98% : N/A)
350000 + 0 paired in sequencing
175000 + 0 read1
175000 + 0 read2
346688 + 0 properly paired (99.05% : N/A)
349876 + 0 with itself and mate mapped
58 + 0 singletons (0.02% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
For this workshop, we will use bcftools to make variant calls. https://samtools.github.io/bcftools/howtos/index.html
Step 1: Calculate the read coverage of positions in the genome using the mpileup command in bcftools
bcftools mpileup -O b -o results/bcf/SRR2584866_raw.bcf \
-f data/ref_genome/ecoli_rel606.fasta results/bam/SRR2584866.aligned.sorted.bam
Step 2: Perform the variant calling using the call command in bcftools
# Old version in notes: bcftools call --ploidy 1 -m -v -o results/bcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf
# Updated command to use
bcftools call --ploidy 1 -m -v -o results/vcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf
Step 3: Filter and report the variants using the varFilter command in vcfutils
vcfutils.pl varFilter results/vcf/SRR2584866_variants.vcf > results/vcf/SRR2584866_final_variants.vcf
Use less -S results/vcf/SRR2584866_final_variants.vcf
to open the final variants file.
Variables are assigned using the =
operator. Ex: myVarName = 'hello'
They can be recalled using the $
in front of the variable name. Ex: $myVarName
Use touch read_qc.sh
to generate an empty file named read_qc.sh. Then nano read_qc.sh
to open the file.
Script contents:
set -e
cd ~/dc_workshop/data/untrimmed_fastq/
echo "Running FastQC..."
fastqc *.fastqc
mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads
echo "Saving FastQC results..."
mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads/
mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads/
cd ~/dc_workshop/results/fastqc_untrimmed_reads/
echo "Unzipping..."
for filename in *.zip
do
unzip $filename
done
echo "Saving summary..."
cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt
Save the file with Ctrl-O and then Ctrl-X to exit.
To run the script use bash read_qc.sh
.
First, use cd ~/dc_workshop/scripts
to go to the scripts folder
Then use
curl -O https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/run_variant_calling.sh
to download the example script. To access the script use:
cd ~/dc_workshop/scripts
less run_variant_calling.sh
Steps performed in the script:
bwa mem $genome $fq1 $fq2 > $sam
samtools view -S -b $sam > $bam
samtools sort -o $sorted_bam $bam
samtools index $sorted_bam
bcftools mpileup -O b -o $raw_bcf -f $genome $sorted_bam
bcftools call --ploidy 1 -m -v -o $variants $raw_bcf
vcfutils.pl varFilter $variants > $final_variants
bash run_variant_calling.sh
Advantages:
Disadvantages:
Verify your environment
whoami
shows your username on computer you have connected to
df -h
shows space on hard drive
free -h
is another command that will show you details of the availability and resources of your cloud server.
cat /proc/cpuinfo
shows detail information on how many processors (CPUs) the machine has
tree -L 1
shows a tree view of the file system 1 level below your current location
Staying Connected to the Cloud
We will work through how to use tmux
, but screen
is another option.
To start a new session, use the command tmux new -s session_name
Detach session (process keeps running in background) by typing on your keyboard: control + b
. Next type d
to detach the session.
tmux list-sessions
shows you all active sessions running
After detaching, you can get back to that session using the command: tmux attach -t session_name
tmux attach
and it will auto-connect you to that sessionYou can switch sessions using tmux switch -t session_name
And you can end a session using tmux kill-session -t session_name
See course page for list of cloud computing options: https://datacarpentry.org/cloud-genomics/04-which-cloud/index.html
Installing on cloud server
YUM
(for Red Hat and Centos instances) and APT
(for Debian or Ubuntu instances) are two package manager programs that can be used to install software on the cloud server.
For example, if we didn't already have tmux
installed, we could run the command apt search tmux
to see if it is available to be installed. To install tmux
, run the command apt install tmux
.
Note there is also a program called apt-get
that is nearly the same as apt
. They have the same functionality most of the time.
Before you installing using apt
, you should check for updates:
sudo apt upgrade
sudo
stands for "super user do", which runs the command from the administrator account (read more about sudo here)sudo
permissions, so we cannot perform this command right nowDuke's Virtual Computing Manager can be found by logging in to vcm.duke.edu with your Duke NetID and password.
Job scheduling on computing clusters is done at Duke using Slurm
. Read more about this workload manager here: https://slurm.schedmd.com/