Here we are going to walkthrough a tutorial for processing 16S data for 22 samples from Mono Lake :+1:
These are paired-end, 2x300, Illumina sequencing data, with primers targeting the V4V5 region of the small-subunit ribosomal RNA gene.
These pages are derived from my amplicon processing and analysis tutorial here at Happy Belly Bioinformatics, but modified so the code is working with our Mono Lake data. There is a different tutorial dataset at Happy Belly, as well as maybe more links and information in some places.
This is the processing page (the first part). The analysis page (the second part) can be found here.
Be sure to be logged into our server. Once on our server, we can activate the environment that holds all our required programs by executing the following at the command line:
And we are going to create and change into a directory to do this work. The following uses a variable that holds our unique username, that way the same command works properly for all of us:
Our starting read files are located somewhere else, but we will point to them when the time comes :+1:
Our starting point
We are starting with 1 pair of read files for each of our samples (a forward and a reverse read file for each), and these still have primers on them (the small, highly conserved sequences that were used to amplify our targets).
We'll talk about these steps more as we get to them, but here's an overview table of the main steps we'll be doing as part of processing amplicon data with dada2:
Command | What we're doing | |
---|---|---|
1 | cutadapt |
remove the primers |
2 | filterAndTrim() |
quality trim/filter |
3 | learnErrors() |
generate an error model of our data |
4 | dada() |
infer ASVs (Amplicon Sequence Variants) on both forward and reverse reads independently |
5 | mergePairs() |
merge forward and reverse ASVs to produce full-length ASVs |
6 | makeSequenceTable() |
generate a count table |
7 | removeBimeraDenovo() |
screen for and remove chimeras |
8 | IdTaxa() |
assign taxonomy |
It is convenient to have a file that holds the unique portion of our sample names for purposes of looping through them and naming things in R. So here, we are going to use some command-line commands to help us make a file that holds unique IDs for each sample.
This will be integrating a lot of what we covered in the unix crash course. The code blocks below are broken down into one piece or step at a time, and this is exactly how I built up the final command in real time.
First, here is where our input files are located, and how we can see a list of just the forward read files:
One sample file looks like this:
MLW-0521-05_S186_L001_R1_001.fastq.gz
The unique portion of that, what I would consider a unique sample ID for our purposes here, is this part up front:
MLW-0521-05_S186
So first, to make a file holding those unique portions, we want to get rid of all the path/address information up to the actual filenames. We can use cut
, specifying '/' as the delimeter, and saying we want to keep the 5th "column":
Because these are all nicely formatted the same way, we can also use cut
to keep just the part before the _L001_R1_001.fastq.gz
, by setting the delimiter to '_' and keeping the first and second "columns":
That's the good stuff, so now that we know we have what we want, we can bring that command back up (with the up arrow while in our command-line window) and add our >
redirector in order to write it to a new file:
Then we can peek at it with head
if wanted:
Primers are the short fragments of highly conserved DNA that we used to amplify our target sequences. We need them to do targeted sequencing like this, but they can introduce artificial bases to our true biological sequences, so it is important to remove them (especially when using a single-nucleotide resolution method rather than more traditional OTU-clustering).
We will use cutadapt (version 3.4 on our system) to do this step.
Here are the primers that were used to generate this dataset:
It's important to make sure we are trimming the primers properly, so it's a good idea to visibly check a file or two to make sure things are working like we expect before we run it on all of our samples.
Here we will run the cutadapt
command we think is appropriate on just one sample's files first, look at the output information from cutadapt
, and peek at the reads themselves before and after cutadapt
to make sure the primers were cut off like we expect. Then when we trust our command, we'll write it into a loop in order to run it all samples for us.
Note that the
\
characters (called an escape slash at the command line) are only there so that longer code can be written across multiple lines - they tell the command line to ignore the next character, which here is a newline character. That's done here just for readability on the page.
CODE BREAKDOWN
cutadapt
- this is our base command-g
- here is where we provide the forward primer that will be searched on the forward read-G
- here is where we provide the reverse primer that will be searched on the reverse read-o
- this is where we specify the name we want for the output forward read file-p
- this is where we specify the name we want for the output reverse read file--discard-untrimmed
- this tellscutadapt
to throw away any reads where it doesn't find the primers as we expect them to be (since they should be there for this dataset, their absence likely indicates a problem with a read)- the last two positional arguments are the forward and reverse read input files (these need to be provided in that order)
Some information is printed to the screen when running that, here is looking at some of that output:
98.9% of reads were written. Since we have the --discard-untrimmed
flag specified in our command, this means it successfully found the primers in all of those, which is great – most of our reads had the primers just like we expected.
The forward primer was found and trimmed in almost all forward reads. The lengths cutoff look good, almost all of them were the full forward primer length which is 19 bps.
The reverse primer was also found and trimmed in almost all reverse reads. The lengths cutoff again look good, almost all of them were the full 20 bps of the reverse primer.
Those outputs are pretty clean, but we can also peek at the before and after read files themselves just to visually confirm the primers are indeed removed from the output files:
Peeking at a forward read before cutadapt
:
Peeking at a forward read after cutadapt
:
And we can see that the leading GTGTCAGCAGCCGCGGTAA
is missing from the forward read output file, which matches our forward primer :+1:
Peeking at a reverse read before cutadapt
:
Peeking at a reverse read after cutadapt
:
And we can also see the leading CCGTCAATTTATTTAAGTTT
is missing from the reverse read output file, which corresponds to our reverse primer :+1:
Since this is super-quick on this sample, and we'll be running it on all of them next, let's delete those test output files we just made:
Now that we trust our cutadapt
command to work on these specific samples, we can run one command to process them all if we do it with a for loop.
It can be helpful to organize our files, so let's make a directory that will hold our primer-trimmed read files:
Now we'll start building up our for loop. First we'll just add a line that prints what sample we're working on to our loop and run that:
Our starting reads are somewhere else, remember. So to actually point to the files, we need to add that location and the suffixes that came after the unique portions we kept as sample IDs. It can be helpful to check we are doing this right just with ls
, instead of trying to run something like cutadapt
first.
So here we'll modify it to run ls
on each expected read file for each sample. This way we can spot now if we are using the variable and adding the path (address) and filename suffixes properly.
The path to the directory holding our input reads is: /data1/data/Tag_2021Jun_2x300/
The forward read suffix is: _L001_R1_001.fastq.gz
The reverse read suffix is: _L001_R2_001.fastq.gz
If those were not pointing to the correct location, we would get a No such file or directory
message.
So now that we are looping through our sample IDs properly, and we know how to point to our input read files based on those sample IDs, we can place our original cutadapt
command in the body of our loop, and modify it as needed.
Remember this is what our single-sample command looked like:
The 2 things we need to change from our single-sample command are:
-o
) and reverse read (-p
) output arguments
${sample}
variable in their names in order to be unique for each sample${sample}
variable as we tested aboveAnd there's 1 thing we will add for convenience:
>>
redirector in this case so that we are appending to the file, rather than overwriting it each iteration of the loop (which is what the single >
would do)2>&1
So our final loop would look something like this:
And when we run it, it will print out each sample as it works on them (doesn't take long).
We can look at the output log file with the less
program:
And we can see some info about the first sample right away. Like we saw with the test one we did above, about 98.5% of reads were written out (so primers were found and trimmed).
Instead of scrolling through and looking for this value for all of the samples, we can easily pull that info out into a table. Remember to exit less
, we can press the q
key.
Remember grep is our pattern finder that will grab each line that matches what we search for. So let's look for a word that is only in the lines that we want to pull out of the file, like this line:
"Pairs written (passing filters): 34,261 (98.5%)"
Here, we'll grep
for "written":
But that's also grabbing another line about the bps being written, instead of just the one we want about how many read-pairs were written. If this weren't immediately apparent, we can also tell by quickly counting the number of lines returned that there are more than the number of samples we expect (which is 23 in this case):
So let's try just "Pairs":
That gets the lines we want. And remember we can use cut to split based on columns defined by any single-character delimiter we want. So we can cut
out our percent if we use the "(" symbol as a delimiter and specify the appropriate "column" we want:
That leaves us with the trailing ")" though. One way we can delete that character is with the tr command, adding the -d
flag which specifies to delete all occurrences of the given character (rather than replacing it with something else):
Now, it would be nice to have the sample names with these also. Since this was done in a loop based on the 'unique-sample-IDs.txt' file, we know these are all in the same order. So we can just combine them with the paste command.
This could all be done in one line, but we'll break it apart first by putting our percent values into a file. Here we can use our single >
redirector that let's us write the output to a file instead just printing it to the terminal:
And we can stick that file together with our unique-sample-IDs.txt file with paste
like so:
All look pretty good except the "blank" sample, which also has something weird going on with read-length (some are different lengths than others). Which we can see by taking a peek at it:
Let's look at the cutadapt log for this sample in the less
program. Once the program is open, we can search for specific text if we want by first pressing the /
key, then typing/pasting what we want to look for and hitting return
. Here we'll search for the "blank" sample name: MLW-DNAexBlank-50cyc_S193
Here is a the primary summary output of that sample:
We can see from this that most of the forward reads don't have the primer we are looking for, and only about 38% of the reverse did. But we still have about 14,000 read-pairs for which the primers were successfully found.
Remember to exit
less
, we can press theq
key.
Since we still have ~14,000 reads, and since this is a "blank", I'm not going to worry that something else weird is going on and we'll just ask Dan about it when we get to this point 🙂
The program we're using to recover our actual amplicon sequences is dada2, which runs in R. Now that we have trimmed our primers, we're able to hop into R and do the rest of the processing there.
To access R/RStudio on our setup, we want to run the following command in the server terminal:
This should print a link to our terminal that we can go to in our browser to access our RStudio environment :+1:
We want to do our work in the text editor in the top-left window of RStudio. From here we can conveniently work on our code and run single sections/lines as we want.
After it loads up in our browser, click 'File' -> 'New File' -> 'R Script':
And that will pop open the top-left window for us if it isn't there already.
This should be where we copy-and-paste/type most of our code.
And the first thing we should do is change to the location we've been working in, here is how we can do that in R:
Now that we're in our appropriate working directory, it's a good idea to save our Rscript (this text document). We can do 'File' -> 'Save':
And name it something like 'dada2-processing.R'. The ".R" extension is just convention for a R script.
Next, we are loading the dada2
package and printing out the version it is:
Now we are setting up some variables that will make running the following commands easier:
The dada2 commands are built to run on multiple samples at once. To make that easier, we are going to build out variables that hold the full paths to the read files:
We can get an overview summary image of the quality of our reads with the following function, to start we are just going to look at one forward and one reverse read file:
Forward reads of one sample
On these plots, the bases are along the x-axis, and the quality score on the y-axis. The black underlying heatmap shows the frequency of each score at each base position, the green line is the mean quality score at that base position, the orange is the median, and the dashed orange lines show the quartiles. The red line at the bottom shows what percentage of reads are that length.
In Phred talk:
Phred value | Error probability |
---|---|
40 | 1 in 10,000 |
30 | 1 in 1,000 |
20 | 1 in 100 |
10 | 1 in 10 |
So these look great, towards the end of the reads, we start to get a mean crossing a quality score of 30 (1 in 1,000 error probability).
Reverse reads of one sample
As is common, the reverse reads drop in quality towards the end of the reads. At around 200 bps we start to get a mean quality score dropping below 30 (1 in 1,000 error probability).
In a case like this, I would consider cutting these reverse reads off at around 200 bps IF we have enough overlap to do so (which we do here, as discussed next).
Our primers target positions 515-926 (relative to positions in a reference 16S gene), so nominally we are trying to span a fragment that is 926 - 515 = 411 bases long. We have reads that are reading 300 bps from each side of this fragment, so without any trimming our overlap would be about:
Or, thought of a different way:
We need to be careful when trimming that we don't trim off so much that the reads won't overlap anymore, and will therefore fail to merge later:
We want at least about 12 bases to be overlapping for our merge step to be able to work well (the default for dada2's mergePairs()
command). Treating that as ~20, just to allow some room for variation, means we can potentially cut about at most 170 bases combined between the forward and reverse reads, and still have enough overlap to merge our reads successfully.
This varies by dataset based on the primers used and the length of the reads, so it is something that should be considered for each different combination of those. If we over-trim, and our reads no longer overlap each other at all, then we cannot merge them and will have problems later.
All forward reads
We can plot all samples' forward reads at once like so (will take about a minute):
This can let us take a quick look to see if any particular sample seems to be drastically different from the rest.
All reverse reads
There aren't any samples that jump out looking very different.
Moving forward with trimming. Based on what we looked at above, I'd go with trimming the forward back to 250, and the reverse back to 200, which means we're trimming 150 bps total (which is less than our upper limit of 170 as decided above).
Here's our quality-filtering command, which should take about a minute, and we'll break it down next:
CODE BREAKDOWN
filtered_out <-
- this is where we are specifying the variable to hold the regular output of whatever we are about to runfilterAndTrim(...)
- here is the primary function we are calling
forward_reads
- this first positional argument is the variable that holds the paths to our input forward readsfiltered_forward_reads
- this second positional argument is the variable that holds the paths to where we want the output forward reads to be writtenreverse_reads
- this third positional argument is the variable that holds the paths to our input reverse readsfiltered_reverse_reads
- this fourth positional argument is the variable that holds the paths to where we want the output reverse reads to be writtenmaxEE = c(2,2)
- this is a quality-filtering threshold being applied based on the expected errors and in this case we are saying we want to drop a read if it is has an expected possibility of having 2 erroneous base calls
- we are specifying for both the forward and reverse reads by providing it as
c(2,2)
truncLen = c(250, 200)
- this is telling the program to right off the bat trim the forward read to 250 bps (cutting off the last 50), and the reverse read to 200 bps (cutting off the last 100)multithread = 4
- here we are telling it to run things in parallel using up to 4 threads
That command wrote out the new read files to the locations we had specified in the filtered_forward_reads
and filtered_reverse_reads
objects, we can see them if we run this:
And it also stored some info in the R object filtered_out
, telling how many reads we had at the start and how many made it through the quality-filtering we ran:
And we can see they all maintained just under 80% of total reads:
And most are in the tens of thousands of reads, so I think this is fine to move forward with. If we were losing more reads than we wanted, we could revisit how we are doing the quality-trimming. There are no one-size-fits-all protocols for most of these things. All things being equal, we want the highest quality data we can have. But we also want some data, and there may be times when a sequencing run came back in poorer shape than these, and we might want to lower our stringency in the balance between "the best data we can get" and "working with what we have".
Let's take a look at the quality-filtered quality profiles:
Quality-filtered forward reads of one sample
Looks great.
Quality-filtered reverse reads of one sample
Also looks much better :+1:
And we can also again run them all together:
All quality-filtered forward reads
No sample stands out.
All quality-filtered reverse reads
And all reverse read profiles look similar too :+1:
Next up is generating our error model by learning the specific error-signature of our dataset. Each sequencing run, even when all goes well, will have its own subtle variations to its error profile. This step tries to assess that for both the forward and reverse reads, and then dada2 incorporates that information when it is trying to distinguish actual biologically derived sequences from sequences generated through sequencing error.
We run this separately on the forward and reverse reads, and should take about 4-5 minutes for each, forward and reverse, as done here:
Here the first positional argument is the variable holding the paths to all of our filtered reads, and the second is saying to run things in parallel where possible with up to 4 threads.
If we had a problem running this, we can just load my data object with the following lines:
After that's done, the developers have incorporated a plotting function to visualize how well the estimated error rates match up with the observed. The plots look very similar between forward and reverse, so here is just the reverse:
Ben goes into how to assess this a bit here. Each plot shows, for a given base-swap, how the observed error rates (the points plotted) compare to estimated error rates (the black line). And we want those to match up pretty well (as they do here). I've yet to see one of these that really looked any different from this or Ben's example, so I don't really know how "off" things would be when we should actually have a concern. I'd just write the dada2 developers if I ran into anything strange ¯\_(ツ)_/¯
Here is where we finally get to doing what dada2 was born to do, and that is to try to infer true biological sequences in our data. dada2 does this by incorporating the consensus quality profiles and abundances of each unique read, along with the error profiles it modeled, and then figuring out if each sequence is more likely to be of biological origin or more likely to be spurious. If interested, you can read more about the details of this in the paper of course or looking through the dada2 site.
This takes about 2-3 minutes to run for each:
CODE BREAKDOWN
dada_forward <-
- this is where we are specifying the variable to hold the regular output of what we are running that followsdada(...)
- the primary function we are calling
filtered_forward_reads
- this first positional argument is the variable that holds the paths to our input forward reads (our filtered reads in this case)err = err_forward_reads
- here we are passing the variable that holds the output error profile for the forward reads to theerr
argumentpool = "pseudo"
- this is a method to share data across different samples to help when trying to infer true biological sequences (can read more about it here)multithead = 4
- telling the program to run things in parallel where possible, using up to 4 threads
Now dada2 merges the forward and reverse sequences to reconstruct our full target amplicon. By default, it requires the overlapping region to be identical between the two, which is a good way to go when we are using a single-nucleotide resolution approach like this.
CODE BREAKDOWN
merged_amplicons <-
- this is where we are specifying the variable to hold the regular output of what we are running that followsmergePairs(...)
- the primary function we are calling
dada_forward
- this first positional argument is the variable holding the object we created with thedada()
function above that we ran on the forward readsfiltered_forward_reads
- this second positional argument is the variable that holds the paths to our input forward reads (our filtered reads in this case)dada_reverse
- this third positional argument is the variable holding the object we created with thedada()
function above that we ran on the reverse readsfiltered_reverse_reads
- this fourth positional argument is the variable that holds the paths to our input reverse reads (our filtered reads in this case)
Now we can generate a count table with the makeSequenceTable()
function. This is one of the typical main outputs from processing an amplicon dataset. It is counts of how many times each unique sequence (ASVs here) were detected in each sample.
We can see from the size of seqtab
that there are 12,787 unique sequences across our dataset. But it's not very friendly to look at in its current form because the actual sequences are our rownames – so we'll make a more manageable table in a couple steps.
dada2 identifies likely chimeras by ranking all the unique ASVs in order of abundance, and then seeing if there are any lower-abundance sequences that can be made exactly by mixing left and right portions of two of the more-abundant ones. These are then removed:
So, it's not unusual to have a lot of these identified, but being up around 90% as is the case here is getting closer to the unusual side of things. I've looked into this before, and did again after seeing what this dataset looked like. The dada2 github issues forum has quite a few posts from people asking about high numbers of chimeras being identified, and even some as high as 97% of unique sequences (like this one) sometimes happen. Ben (dada2 developer) notes in that thread that particularly amplicons that span multiple hypervariable regions (which ours do here, spanning V4V5) can be prone to produce many chimeras like this. So though this is a bit on the high end of what's typical in terms of unique sequences, what's more important is what we are losing in terms of abundance, as is noted on the dada2 FAQ:
It is common for chimeric sequences to be a majority (even large majority) of the total number of unique sequence variants inferred by the
dada(...)
algorithm, but they should not be a majority of the sequencing reads. That is, there may be many chimeras, but they are relatively low abundance.
If we want to load this instead of letting it finish, we can do so by running this line in R:
Although we lost almost 90% of our unique sequences, we don't know if those particular sequences held a lot in terms of abundance yet. Here is one quick way we can look at that:
In terms of abundance, we are retaining about 67% of reads following the chimera removal.
A common computational problem that can lead to high chimeras is if the primers weren't properly removed, but that's not the case with this dataset. So given that there are reported cases of even higher proportions of unique sequences being identified as chimeras, and that we are keeping about 67% of the total reads, I would move forward with things here.
The developers' dada2 tutorial provides an example of a nice, quick way to pull out how many reads were dropped at various points of the pipeline. This can serve as a jumping off point if we’re left with too few sequences at the end to help point us towards where we should start digging into where they are being dropped. Here’s a slightly modified version (don't mind that this may look super-busy and confusing, like most things, if broken down piece-by-piece it's less intimidating):
And we can see a breakdown of the number of reads/sequences at each stage, and the final percent retained by the end.
And here's a way we could write out that table if we wanted to have it outside of R:
CODE BREAKDOWN
write.table(...)
- the primary function we are calling
summary_tab
- this first positional argument is the variable name holding the object we want to write out"read-count-tracking.tsv"
- this second possitional argument is specifying the name of the file we want to create/write toquote = FALSE
- this is saying we don't want the fields to be surrounded in quotationscol.names = NA
- this little nuanced part is needed just because of how we have our object table (specifically because it has the sample IDs as rownames, rather than being a column). To write the table out and bump over the column names (which start with 'dada2_input', we need to setcol.names
toNA
like this), otherwise the 'dada2_input' column name would be on top of our sample IDs
After that command, that table now exists as a file on our system, and no longer just as a temporary object in R.
To assign taxonomy, we are going to use the DECIPHER package. There are some DECIPHER-formatted databases available here, which is where the SILVA v138 comes from that we will use below.
The below is commented out because this is the most computationally intensive step and would take maybe a half hour or longer with all of us running it at the same time. So we are just going to load the resultant object from a previous run. But here is what the code was, and how I saved the output object:
This will read in the tax_info
object we would have created with the above:
The typical standard outputs form amplicon processing are a fasta file of our unique ASVs, a count table showing how many times each unique ASV was detected in each sample, and a taxonomy table linking our ASV IDs to their assigned taxonomy. Here is one way we can generate those files from our dada2 objects in R.
Now those output files are no longer confined to just the world of R either :+1:
In addtion to dada2, @bejcal et al. have also created a widely used program for removing contaminants based on incorporated blanks called decontam (Nicole Davis et al. publication here). As usual, they have also provided excellent documentation and have a vignette here showing an example of doing this from a phyloseq object and discussing the various ways their program can be implemented (such as incorporating DNA concentrations if available). Here, we will apply it without DNA concentrations – instead using ASVs detected in the incorporated blanks – starting from our count table generated above.
For decontam to work on our data, we need to provide it with our count table, currently stored in the "asv_tab" variable, and we need to give it a logical (TRUE/FALSE) vector that tells it which samples are "blanks". Here is making that vector and running the program:
It seems none were identified as contaminants, as there would be a count for "TRUE" there if any were.
Let's look at where the counts are for the "blank" sample:
And it seems none of the ASVs from the blank are in any of the actual samples. So we don't need to worry about removing any sequences from our data in this case, but if we had some, there is example code here demonstrating a way that we could.
First we would want to save our R script. We can do that with keyboard shortcuts while in the active RStudio browser, or we can click the disk icon at the top of the page.
When we close R, usually all of our variables and objects in there will go away - as they are only stored in memory. Lots of times this is fine, but if we wanted to, we could save them all to a file like this:
Which will write the data to that file in our current working directory (".RData" is just the conventional extension for this format).
We can get rid of all our current environment information by running the following:
And then in a future R session, we could load them back into the active environment with this command:
For us for now, to exit R/Rstudio on our system, we can click 'File' -> 'Quit Session…' at the bottom:
This will give us an "R Session Ended" window, and we can then close the browser tab.
And back at our terminal that is connected to the server and has been kindly connecting us to RStudio this whole time, we can press ctrl + c
to get our regular prompt back :+1:
Back in our terminal window, we can see the files we've created as the primary outputs from the processing.
We are probably still in the correct location, but we can run this in case we aren't:
And with those primary outputs in hand, we're ready to finally move on to analysis!