2022 Geobio 16S processing tutorial

Here we are going to walkthrough a tutorial for processing 16S data for 22 samples from Mono Lake :+1:

These are paired-end, 2x300, Illumina sequencing data, with primers targeting the V4V5 region of the small-subunit ribosomal RNA gene.

These pages are derived from my amplicon processing and analysis tutorial here at Happy Belly Bioinformatics, but modified so the code is working with our Mono Lake data. There is a different tutorial dataset at Happy Belly, as well as maybe more links and information in some places.

This is the processing page (the first part). The analysis page (the second part) can be found here.

2022 Geobio 16S processing tutorial
Processing

Processing

Setting up our terminal work environment

Be sure to be logged into our server. Once on our server, we can activate the environment that holds all our required programs by executing the following at the command line:

conda activate r_dada2

And we are going to create and change into a directory to do this work. The following uses a variable that holds our unique username, that way the same command works properly for all of us:

mkdir -p /export/data1/ge246lab/GeoBio22/${USER}/16S-demo

cd /export/data1/ge246lab/GeoBio22/${USER}/16S-demo

Our starting read files are located somewhere else, but we will point to them when the time comes :+1:

Processing overview

Our starting point
We are starting with 1 pair of read files for each of our samples (a forward and a reverse read file for each), and these still have primers on them (the small, highly conserved sequences that were used to amplify our targets).

We'll talk about these steps more as we get to them, but here's an overview table of the main steps we'll be doing as part of processing amplicon data with dada2:

	Command	What we're doing
1	`cutadapt`	remove the primers
2	`filterAndTrim()`	quality trim/filter
3	`learnErrors()`	generate an error model of our data
4	`dada()`	infer ASVs (Amplicon Sequence Variants) on both forward and reverse reads independently
5	`mergePairs()`	merge forward and reverse ASVs to produce full-length ASVs
6	`makeSequenceTable()`	generate a count table
7	`removeBimeraDenovo()`	screen for and remove chimeras
8	`IdTaxa()`	assign taxonomy

Making a sample ID file

It is convenient to have a file that holds the unique portion of our sample names for purposes of looping through them and naming things in R. So here, we are going to use some command-line commands to help us make a file that holds unique IDs for each sample.

This will be integrating a lot of what we covered in the unix crash course. The code blocks below are broken down into one piece or step at a time, and this is exactly how I built up the final command in real time.

First, here is where our input files are located, and how we can see a list of just the forward read files:

ls /data1/data/Tag_2021Jun_2x300/MLW-*R1*

One sample file looks like this:

MLW-0521-05_S186_L001_R1_001.fastq.gz

The unique portion of that, what I would consider a unique sample ID for our purposes here, is this part up front:

MLW-0521-05_S186

So first, to make a file holding those unique portions, we want to get rid of all the path/address information up to the actual filenames. We can use cut, specifying '/' as the delimeter, and saying we want to keep the 5th "column":

ls /data1/data/Tag_2021Jun_2x300/MLW-*R1* | cut -f 5 -d '/'

Because these are all nicely formatted the same way, we can also use cut to keep just the part before the _L001_R1_001.fastq.gz, by setting the delimiter to '_' and keeping the first and second "columns":

ls /data1/data/Tag_2021Jun_2x300/MLW-*R1* | cut -f 5 -d '/' | cut -f 1,2 -d "_"

That's the good stuff, so now that we know we have what we want, we can bring that command back up (with the up arrow while in our command-line window) and add our > redirector in order to write it to a new file:

ls /data1/data/Tag_2021Jun_2x300/MLW-*R1* | cut -f 5 -d '/' | cut -f 1,2 -d "_" > unique-sample-IDs.txt

Then we can peek at it with head if wanted:

head unique-sample-IDs.txt
# MLW-0521-05_S186
# MLW-0521-10_S187
# MLW-0521-14_S188
# MLW-0521-18_S189
# MLW-0521-22_S190
# MLW-0521-27_S191
# MLW-0521-35_S192
# MLW-0919-00_S171
# MLW-0919-10_S172
# MLW-0919-16_S173

Removing primers

Primers are the short fragments of highly conserved DNA that we used to amplify our target sequences. We need them to do targeted sequencing like this, but they can introduce artificial bases to our true biological sequences, so it is important to remove them (especially when using a single-nucleotide resolution method rather than more traditional OTU-clustering).

We will use cutadapt (version 3.4 on our system) to do this step.

Here are the primers that were used to generate this dataset:

515F
- GTGYCAGCMGCCGCGGTAA
926R
- CCGYCAATTYMTTTRAGTTT

Example focusing on one file

It's important to make sure we are trimming the primers properly, so it's a good idea to visibly check a file or two to make sure things are working like we expect before we run it on all of our samples.

Here we will run the cutadapt command we think is appropriate on just one sample's files first, look at the output information from cutadapt, and peek at the reads themselves before and after cutadapt to make sure the primers were cut off like we expect. Then when we trust our command, we'll write it into a loop in order to run it all samples for us.

Note that the \ characters (called an escape slash at the command line) are only there so that longer code can be written across multiple lines - they tell the command line to ignore the next character, which here is a newline character. That's done here just for readability on the page.

cutadapt -g GTGYCAGCMGCCGCGGTAA \
         -G CCGYCAATTYMTTTRAGTTT \
         -o test-R1.fq.gz \
         -p test-R2.fq.gz \
         --discard-untrimmed \
         /data1/data/Tag_2021Jun_2x300/MLW-0521-35_S192_L001_R1_001.fastq.gz \
         /data1/data/Tag_2021Jun_2x300/MLW-0521-35_S192_L001_R2_001.fastq.gz

CODE BREAKDOWN

cutadapt - this is our base command

-g - here is where we provide the forward primer that will be searched on the forward read

-G - here is where we provide the reverse primer that will be searched on the reverse read

-o - this is where we specify the name we want for the output forward read file

-p - this is where we specify the name we want for the output reverse read file

--discard-untrimmed - this tells cutadapt to throw away any reads where it doesn't find the primers as we expect them to be (since they should be there for this dataset, their absence likely indicates a problem with a read)

the last two positional arguments are the forward and reverse read input files (these need to be provided in that order)

Some information is printed to the screen when running that, here is looking at some of that output:

=== Summary ===

Total read pairs processed:             15,845
  Read 1 with adapter:                  15,765 (99.5%)
  Read 2 with adapter:                  15,734 (99.3%)
Pairs written (passing filters):        15,663 (98.9%)

Total basepairs processed:     9,500,088 bp
  Read 1:     4,757,391 bp
  Read 2:     4,742,697 bp
Total written (filtered):      8,793,251 bp (92.6%)
  Read 1:     4,410,417 bp
  Read 2:     4,382,834 bp

98.9% of reads were written. Since we have the --discard-untrimmed flag specified in our command, this means it successfully found the primers in all of those, which is great – most of our reads had the primers just like we expected.

=== First read: Adapter 1 ===

Sequence: GTGYCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 15765 times

No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1

Overview of removed sequences
length	count	expect	max.err	error counts
13	1	0.0	1	1
18	83	0.0	1	40 43
19	15655	0.0	1	15236 419
20	26	0.0	1	5 21

The forward primer was found and trimmed in almost all forward reads. The lengths cutoff look good, almost all of them were the full forward primer length which is 19 bps.

=== Second read: Adapter 2 ===

Sequence: CCGYCAATTYMTTTRAGTTT; Type: regular 5'; Length: 20; Trimmed: 15734 times

No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20 bp: 2

Overview of removed sequences
length	count	expect	max.err	error counts
17	1	0.0	1	1
18	8	0.0	1	4 1 3
19	126	0.0	1	26 93 7
20	15542	0.0	2	14680 755 107
21	55	0.0	2	4 40 11
25	2	0.0	2	0 2

The reverse primer was also found and trimmed in almost all reverse reads. The lengths cutoff again look good, almost all of them were the full 20 bps of the reverse primer.

Those outputs are pretty clean, but we can also peek at the before and after read files themselves just to visually confirm the primers are indeed removed from the output files:

Peeking at a forward read before cutadapt:

zcat /data1/data/Tag_2021Jun_2x300/MLW-0521-35_S192_L001_R1_001.fastq.gz | head -n 2
# @M06418:160:000000000-JRNKN:1:1101:10450:1042 1:N:0:ATCTCAGG+TTATGCGA
# GTGTCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAATCCGAGGCTCAACCTCGGACCTGCAGTGGGTACGGGCAAACTAGAGTGCGGTAGGGGAGATTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCAATGGCGAAGGCAGATCTCTGGGCCGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGTAA

Peeking at a forward read after cutadapt:

zcat test-R1.fq.gz | head -n 2
# @M06418:160:000000000-JRNKN:1:1101:10450:1042 1:N:0:ATCTCAGG+TTATGCGA
# TACGTAGGGTGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAATCCGAGGCTCAACCTCGGACCTGCAGTGGGTACGGGCAAACTAGAGTGCGGTAGGGGAGATTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCAATGGCGAAGGCAGATCTCTGGGCCGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGTAA

And we can see that the leading GTGTCAGCAGCCGCGGTAA is missing from the forward read output file, which matches our forward primer :+1:

Peeking at a reverse read before cutadapt:

zcat /data1/data/Tag_2021Jun_2x300/MLW-0521-35_S192_L001_R2_001.fastq.gz | head -n 2
# @M06418:160:000000000-JRNKN:1:1101:10450:1042 2:N:0:ATCTCAGG+TTATGCGA
# CCGTCAATTTATTTAAGTTTTAGCCTTGCGGCCGTACTCCCCAGGCGGGGAACTTAATGCGTTAGCTGCGACACGGAATCCGTGGAACGGACCCCACATCTAGTTCCCACCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGCTCCTCAGCGTCAGTTACGGCCCAGAGATCTGCCTTCGCCATTGGTGTTCCTCCTGATATCTGCGCATTCCACCGCTACACCAGGAATTCCAACCCCCCCTACCGCACTCCAGTTTGCCCGTACCCACTCCAGGT

Peeking at a reverse read after cutadapt:

zcat test-R2.fq.gz | head -n 2
# @M06418:160:000000000-JRNKN:1:1101:10450:1042 2:N:0:ATCTCAGG+TTATGCGA
# TAGCCTTGCGGCCGTACTCCCCAGGCGGGGAACTTAATGCGTTAGCTGCGACACGGAATCCGTGGAACGGACCCCACATCTAGTTCCCACCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGCTCCTCAGCGTCAGTTACGGCCCAGAGATCTGCCTTCGCCATTGGTGTTCCTCCTGATATCTGCGCATTCCACCGCTACACCAGGAATTCCAACCCCCCCTACCGCACTCCAGTTTGCCCGTACCCACTCCAGGT

And we can also see the leading CCGTCAATTTATTTAAGTTT is missing from the reverse read output file, which corresponds to our reverse primer :+1:

Since this is super-quick on this sample, and we'll be running it on all of them next, let's delete those test output files we just made:

rm test-R1.fq.gz test-R2.fq.gz

Running on all samples

Now that we trust our cutadapt command to work on these specific samples, we can run one command to process them all if we do it with a for loop.

It can be helpful to organize our files, so let's make a directory that will hold our primer-trimmed read files:

mkdir primer-trimmed-reads

Now we'll start building up our for loop. First we'll just add a line that prints what sample we're working on to our loop and run that:

for sample in $(cat unique-sample-IDs.txt)
do

    echo "On sample: ${sample}"

done

Our starting reads are somewhere else, remember. So to actually point to the files, we need to add that location and the suffixes that came after the unique portions we kept as sample IDs. It can be helpful to check we are doing this right just with ls, instead of trying to run something like cutadapt first.

So here we'll modify it to run ls on each expected read file for each sample. This way we can spot now if we are using the variable and adding the path (address) and filename suffixes properly.

The path to the directory holding our input reads is: /data1/data/Tag_2021Jun_2x300/

The forward read suffix is: _L001_R1_001.fastq.gz

The reverse read suffix is: _L001_R2_001.fastq.gz

for sample in $(cat unique-sample-IDs.txt)
do

    echo "On sample: ${sample}"

    echo "    Forward read file location:"
    ls /data1/data/Tag_2021Jun_2x300/${sample}_L001_R1_001.fastq.gz
    
    echo "    Reverse read file location:"
    ls /data1/data/Tag_2021Jun_2x300/${sample}_L001_R2_001.fastq.gz
    
    # just adding a blank line in between samples as they print out
    echo ""

done

If those were not pointing to the correct location, we would get a No such file or directory message.

So now that we are looping through our sample IDs properly, and we know how to point to our input read files based on those sample IDs, we can place our original cutadapt command in the body of our loop, and modify it as needed.

Remember this is what our single-sample command looked like:

# cutadapt -g GTGYCAGCMGCCGCGGTAA \
#          -G CCGYCAATTYMTTTRAGTTT \
#          -o test-R1.fq.gz \
#          -p test-R2.fq.gz \
#          --discard-untrimmed \
#          /data1/data/Tag_2021Jun_2x300/MLW-0521-35_S192_L001_R1_001.fastq.gz \
#          /data1/data/Tag_2021Jun_2x300/MLW-0521-35_S192_L001_R2_001.fastq.gz

The 2 things we need to change from our single-sample command are:

the forward read (-o) and reverse read (-p) output arguments
- we want these to go into our primer-trimmed-reads/ directory we made
- and they need to use our ${sample} variable in their names in order to be unique for each sample
the positional arguments at the end that represent our input forward and reverse read files need to be specified with the path (address) to their location using the ${sample} variable as we tested above

And there's 1 thing we will add for convenience:

instead of letting the cutadapt log info from each command be printed to the terminal (and lost after a little bit), we can redirect the output to a file so we have that information saved
we want to use the >> redirector in this case so that we are appending to the file, rather than overwriting it each iteration of the loop (which is what the single > would do)
then after the file name, we want to add this odd part: 2>&1
- it's beyond our scope here, but programs can output information in two different "channels" (called "stdout" and "stderr")
- this odd notation at the end says to capture both in the file we just specified

So our final loop would look something like this:

for sample in $(cat unique-sample-IDs.txt)
do

    echo "On sample: ${sample}"

    cutadapt -g GTGYCAGCMGCCGCGGTAA \
             -G CCGYCAATTYMTTTRAGTTT \
             -o primer-trimmed-reads/${sample}-trimmed-R1.fastq.gz \
             -p primer-trimmed-reads/${sample}-trimmed-R2.fastq.gz \
             --discard-untrimmed \
             /data1/data/Tag_2021Jun_2x300/${sample}_L001_R1_001.fastq.gz \
             /data1/data/Tag_2021Jun_2x300/${sample}_L001_R2_001.fastq.gz \
             >> cutadapt-primer-trimming-output.txt 2>&1

done

And when we run it, it will print out each sample as it works on them (doesn't take long).

We can look at the output log file with the less program:

less cutadapt-primer-trimming-output.txt

And we can see some info about the first sample right away. Like we saw with the test one we did above, about 98.5% of reads were written out (so primers were found and trimmed).

Instead of scrolling through and looking for this value for all of the samples, we can easily pull that info out into a table. Remember to exit less, we can press the q key.

Remember grep is our pattern finder that will grab each line that matches what we search for. So let's look for a word that is only in the lines that we want to pull out of the file, like this line:

"Pairs written (passing filters): 34,261 (98.5%)"

Here, we'll grep for "written":

grep "written" cutadapt-primer-trimming-output.txt

But that's also grabbing another line about the bps being written, instead of just the one we want about how many read-pairs were written. If this weren't immediately apparent, we can also tell by quickly counting the number of lines returned that there are more than the number of samples we expect (which is 23 in this case):

grep "written" cutadapt-primer-trimming-output.txt | wc -l
# 46

So let's try just "Pairs":

grep "Pairs" cutadapt-primer-trimming-output.txt | wc -l
# 23

grep "Pairs" cutadapt-primer-trimming-output.txt | head -n 3
# Pairs written (passing filters):        34,261 (98.5%)
# Pairs written (passing filters):        46,107 (98.7%)
# Pairs written (passing filters):        72,626 (98.8%)

That gets the lines we want. And remember we can use cut to split based on columns defined by any single-character delimiter we want. So we can cut out our percent if we use the "(" symbol as a delimiter and specify the appropriate "column" we want:

grep "Pairs" cutadapt-primer-trimming-output.txt | cut -f 3 -d "(" | head -n 3
# 98.5%)
# 98.7%)
# 98.8%)

That leaves us with the trailing ")" though. One way we can delete that character is with the tr command, adding the -d flag which specifies to delete all occurrences of the given character (rather than replacing it with something else):

grep "Pairs" cutadapt-primer-trimming-output.txt | cut -f 3 -d "(" | tr -d ")" | head -n 3
# 98.5%
# 98.7%
# 98.8%

Now, it would be nice to have the sample names with these also. Since this was done in a loop based on the 'unique-sample-IDs.txt' file, we know these are all in the same order. So we can just combine them with the paste command.

This could all be done in one line, but we'll break it apart first by putting our percent values into a file. Here we can use our single > redirector that let's us write the output to a file instead just printing it to the terminal:

grep "Pairs" cutadapt-primer-trimming-output.txt | cut -f 3 -d "(" | tr -d ")" > percent-reads-retained.tmp

head -n 3 percent-reads-retained.tmp
# 98.5%
# 98.7%
# 98.8%

And we can stick that file together with our unique-sample-IDs.txt file with paste like so:

paste unique-sample-IDs.txt percent-reads-retained.tmp \
      > percent-reads-retained-after-primer-removal.tsv

# removing the intermediate file we made
rm percent-reads-retained.tmp

# looking at the table we made
cat percent-reads-retained-after-primer-removal.tsv
# MLW-0521-05_S186	98.5%
# MLW-0521-10_S187	98.7%
# MLW-0521-14_S188	98.8%
# MLW-0521-18_S189	99.0%
# MLW-0521-22_S190	99.1%
# MLW-0521-27_S191	99.2%
# MLW-0521-35_S192	98.9%
# MLW-0919-00_S171	98.8%
# MLW-0919-10_S172	98.2%
# MLW-0919-16_S173	98.9%
# MLW-0919-17_S174	98.9%
# MLW-0919-18_S175	99.0%
# MLW-0919-20_S176	98.9%
# MLW-0919-30_S177	99.0%
# MLW-0920-05_S178	97.1%
# MLW-0920-10_S179	98.9%
# MLW-0920-15_S180	97.8%
# MLW-0920-18_S181	98.4%
# MLW-0920-20_S182	98.1%
# MLW-0920-22_S183	98.5%
# MLW-0920-25_S184	98.8%
# MLW-0920-30_S185	99.0%
# MLW-DNAexBlank-50cyc_S193	37.7%

All look pretty good except the "blank" sample, which also has something weird going on with read-length (some are different lengths than others). Which we can see by taking a peek at it:

zcat /data1/data/Tag_2021Jun_2x300/MLW-DNAexBlank-50cyc_S193_L001_R1_001.fastq.gz | head
# @M06418:160:000000000-JRNKN:1:1101:14614:1004 1:N:0:ACTCGCTA+CTCTCTAT
# NTGTCAGCAGCCGCGGTGGAGCTATCAAAGAAATTGACGG
# +
# #8BCCGGGGGGGGGGGFGFGFGFGGGGGGGGGGGGGGGGG
# @M06418:160:000000000-JRNKN:1:1101:10420:1031 1:N:0:ACTCGCTA+CTCTCTAT
# NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
# +
# ###################################
# @M06418:160:000000000-JRNKN:1:1101:9519:1076 1:N:0:ACTCGCTA+CTCTCTAT
# GTGTCAGCAGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATCACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTGGCAAGCTAGAGTACGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAA

Let's look at the cutadapt log for this sample in the less program. Once the program is open, we can search for specific text if we want by first pressing the / key, then typing/pasting what we want to look for and hitting return. Here we'll search for the "blank" sample name: MLW-DNAexBlank-50cyc_S193

less cutadapt-primer-trimming-output.txt
# then press `/` to activate the search/text-entry window
# paste in our target sample name: MLW-DNAexBlank-50cyc_S193
# press the return key

Here is a the primary summary output of that sample:

# === Summary ===
# 
# Total read pairs processed:             38,355
#   Read 1 with adapter:                  14,927 (38.9%)
#   Read 2 with adapter:                  24,072 (62.8%)
# Pairs written (passing filters):        14,454 (37.7%)

We can see from this that most of the forward reads don't have the primer we are looking for, and only about 38% of the reverse did. But we still have about 14,000 read-pairs for which the primers were successfully found.

Remember to exit less, we can press the q key.

Since we still have ~14,000 reads, and since this is a "blank", I'm not going to worry that something else weird is going on and we'll just ask Dan about it when we get to this point 🙂

Into R we go!

The program we're using to recover our actual amplicon sequences is dada2, which runs in R. Now that we have trimmed our primers, we're able to hop into R and do the rest of the processing there.

To access R/RStudio on our setup, we want to run the following command in the server terminal:

# we do need to be in the same conda environment as above
# so run this if you're not or aren't sure
conda activate r_dada2

# and this establishes the RStudio connection for us
rstudio-server-wrapper.sh

This should print a link to our terminal that we can go to in our browser to access our RStudio environment :+1:

Setting up our R environment

We want to do our work in the text editor in the top-left window of RStudio. From here we can conveniently work on our code and run single sections/lines as we want.

After it loads up in our browser, click 'File' -> 'New File' -> 'R Script':

And that will pop open the top-left window for us if it isn't there already.

This should be where we copy-and-paste/type most of our code.

And the first thing we should do is change to the location we've been working in, here is how we can do that in R:

# this is just getting our unique user ID
user_id <- system("echo $USER", intern = TRUE)

# this is setting the working directory to the location we were just working in
setwd(paste0("/export/data1/ge246lab/GeoBio22/", user_id, "/16S-demo/"))

# we can check our current working directory with
getwd()

# list.files() is like `ls` at the command-line
# we can use that here to see the files are here that we expect
list.files()
list.files("primer-trimmed-reads/")

Now that we're in our appropriate working directory, it's a good idea to save our Rscript (this text document). We can do 'File' -> 'Save':

And name it something like 'dada2-processing.R'. The ".R" extension is just convention for a R script.

Next, we are loading the dada2 package and printing out the version it is:

library(dada2)
packageVersion("dada2")
# 1.20.0

Now we are setting up some variables that will make running the following commands easier:

# reading in our sample IDs into an R object
samples <- scan("unique-sample-IDs.txt", what = "character")
samples # will print them out

The dada2 commands are built to run on multiple samples at once. To make that easier, we are going to build out variables that hold the full paths to the read files:

# creating a variable that holds the paths to all the forward reads
forward_reads <- paste0("primer-trimmed-reads/", samples, "-trimmed-R1.fastq.gz")
forward_reads # too peek at it

# and one for all the reverse reads
reverse_reads <- paste0("primer-trimmed-reads/", samples, "-trimmed-R2.fastq.gz")

# here making a directory to hold the output quality-filtered reads
dir.create("quality-filtered-reads")

# And here creating variables holding what WILL be the paths to
# all the quality-filtered reads when we actually make those files.
    # These files don't exist yet at this point, but these variables will be the output arguments
    # we give a command in a bit to make them
filtered_forward_reads <- paste0("quality-filtered-reads/", samples, "-trimmed-filtered-R1.fastq.gz")
filtered_reverse_reads <- paste0("quality-filtered-reads/", samples, "-trimmed-filtered-R2.fastq.gz")

Quality-trimming/filtering

We can get an overview summary image of the quality of our reads with the following function, to start we are just going to look at one forward and one reverse read file:

Forward reads of one sample

plotQualityProfile(forward_reads[1])

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

On these plots, the bases are along the x-axis, and the quality score on the y-axis. The black underlying heatmap shows the frequency of each score at each base position, the green line is the mean quality score at that base position, the orange is the median, and the dashed orange lines show the quartiles. The red line at the bottom shows what percentage of reads are that length.

In Phred talk:

Phred value	Error probability
40	1 in 10,000
30	1 in 1,000
20	1 in 100
10	1 in 10

So these look great, towards the end of the reads, we start to get a mean crossing a quality score of 30 (1 in 1,000 error probability).

Reverse reads of one sample

plotQualityProfile(reverse_reads[1])

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

As is common, the reverse reads drop in quality towards the end of the reads. At around 200 bps we start to get a mean quality score dropping below 30 (1 in 1,000 error probability).

In a case like this, I would consider cutting these reverse reads off at around 200 bps IF we have enough overlap to do so (which we do here, as discussed next).

Our primers target positions 515-926 (relative to positions in a reference 16S gene), so nominally we are trying to span a fragment that is 926 - 515 = 411 bases long. We have reads that are reading 300 bps from each side of this fragment, so without any trimming our overlap would be about:

515 + 300 = getting to position 815 with the forward read
926 - 300 = getting to position 626 with the reverse read
815 - 626 = 189 bases overlapping (again, just nominally)

Or, thought of a different way:

600 (combined reads' length) - 411 (distance we're spanning) = 189 bps overlap

We need to be careful when trimming that we don't trim off so much that the reads won't overlap anymore, and will therefore fail to merge later:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

We want at least about 12 bases to be overlapping for our merge step to be able to work well (the default for dada2's mergePairs() command). Treating that as ~20, just to allow some room for variation, means we can potentially cut about at most 170 bases combined between the forward and reverse reads, and still have enough overlap to merge our reads successfully.

This varies by dataset based on the primers used and the length of the reads, so it is something that should be considered for each different combination of those. If we over-trim, and our reads no longer overlap each other at all, then we cannot merge them and will have problems later.

All forward reads
We can plot all samples' forward reads at once like so (will take about a minute):

plotQualityProfile(forward_reads)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

This can let us take a quick look to see if any particular sample seems to be drastically different from the rest.

All reverse reads

plotQualityProfile(reverse_reads)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

There aren't any samples that jump out looking very different.

Moving forward with trimming. Based on what we looked at above, I'd go with trimming the forward back to 250, and the reverse back to 200, which means we're trimming 150 bps total (which is less than our upper limit of 170 as decided above).

Here's our quality-filtering command, which should take about a minute, and we'll break it down next:

filtered_out <- filterAndTrim(forward_reads,
                              filtered_forward_reads,
                              reverse_reads,
                              filtered_reverse_reads,
                              maxEE = c(2,2),
                              truncLen = c(250, 200),
                              multithread = 4)

CODE BREAKDOWN

filtered_out <- - this is where we are specifying the variable to hold the regular output of whatever we are about to run

filterAndTrim(...) - here is the primary function we are calling

forward_reads - this first positional argument is the variable that holds the paths to our input forward reads

filtered_forward_reads - this second positional argument is the variable that holds the paths to where we want the output forward reads to be written

reverse_reads - this third positional argument is the variable that holds the paths to our input reverse reads

filtered_reverse_reads - this fourth positional argument is the variable that holds the paths to where we want the output reverse reads to be written

maxEE = c(2,2) - this is a quality-filtering threshold being applied based on the expected errors and in this case we are saying we want to drop a read if it is has an expected possibility of having 2 erroneous base calls

we are specifying for both the forward and reverse reads by providing it as c(2,2)

truncLen = c(250, 200) - this is telling the program to right off the bat trim the forward read to 250 bps (cutting off the last 50), and the reverse read to 200 bps (cutting off the last 100)

multithread = 4 - here we are telling it to run things in parallel using up to 4 threads

That command wrote out the new read files to the locations we had specified in the filtered_forward_reads and filtered_reverse_reads objects, we can see them if we run this:

list.files("quality-filtered-reads/")

And it also stored some info in the R object filtered_out, telling how many reads we had at the start and how many made it through the quality-filtering we ran:

filtered_out

And we can see they all maintained just under 80% of total reads:

filtered_out[, 2] / filtered_out[, 1] * 100

And most are in the tens of thousands of reads, so I think this is fine to move forward with. If we were losing more reads than we wanted, we could revisit how we are doing the quality-trimming. There are no one-size-fits-all protocols for most of these things. All things being equal, we want the highest quality data we can have. But we also want some data, and there may be times when a sequencing run came back in poorer shape than these, and we might want to lower our stringency in the balance between "the best data we can get" and "working with what we have".

Let's take a look at the quality-filtered quality profiles:

Quality-filtered forward reads of one sample

plotQualityProfile(filtered_forward_reads[1])

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Looks great.

Quality-filtered reverse reads of one sample

plotQualityProfile(filtered_reverse_reads[1])

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Also looks much better :+1:

And we can also again run them all together:

All quality-filtered forward reads

plotQualityProfile(filtered_forward_reads)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

No sample stands out.

All quality-filtered reverse reads

plotQualityProfile(filtered_reverse_reads)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

And all reverse read profiles look similar too :+1:

Generating an error model of our data

Next up is generating our error model by learning the specific error-signature of our dataset. Each sequencing run, even when all goes well, will have its own subtle variations to its error profile. This step tries to assess that for both the forward and reverse reads, and then dada2 incorporates that information when it is trying to distinguish actual biologically derived sequences from sequences generated through sequencing error.

We run this separately on the forward and reverse reads, and should take about 4-5 minutes for each, forward and reverse, as done here:

err_forward_reads <- learnErrors(filtered_forward_reads, multithread = 4)
err_reverse_reads <- learnErrors(filtered_reverse_reads, multithread = 4)

Here the first positional argument is the variable holding the paths to all of our filtered reads, and the second is saying to run things in parallel where possible with up to 4 threads.

If we had a problem running this, we can just load my data object with the following lines:

load("/export/data1/ge246lab/GeoBio22/mikelee/16S-demo-forward-err.RData")
load("/export/data1/ge246lab/GeoBio22/mikelee/16S-demo-reverse-err.RData")

After that's done, the developers have incorporated a plotting function to visualize how well the estimated error rates match up with the observed. The plots look very similar between forward and reverse, so here is just the reverse:

plotErrors(err_forward_reads)
plotErrors(err_reverse_reads)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Ben goes into how to assess this a bit here. Each plot shows, for a given base-swap, how the observed error rates (the points plotted) compare to estimated error rates (the black line). And we want those to match up pretty well (as they do here). I've yet to see one of these that really looked any different from this or Ben's example, so I don't really know how "off" things would be when we should actually have a concern. I'd just write the dada2 developers if I ran into anything strange ¯\_(ツ)_/¯

Inferring ASVs

Here is where we finally get to doing what dada2 was born to do, and that is to try to infer true biological sequences in our data. dada2 does this by incorporating the consensus quality profiles and abundances of each unique read, along with the error profiles it modeled, and then figuring out if each sequence is more likely to be of biological origin or more likely to be spurious. If interested, you can read more about the details of this in the paper of course or looking through the dada2 site.

This takes about 2-3 minutes to run for each:

dada_forward <- dada(filtered_forward_reads,
                     err = err_forward_reads, 
                     pool = "pseudo", 
                     multithread = 4)

dada_reverse <- dada(filtered_reverse_reads, 
                     err = err_reverse_reads, 
                     pool = "pseudo", 
                     multithread = 4)

CODE BREAKDOWN

dada_forward <- - this is where we are specifying the variable to hold the regular output of what we are running that follows

dada(...) - the primary function we are calling

filtered_forward_reads - this first positional argument is the variable that holds the paths to our input forward reads (our filtered reads in this case)

err = err_forward_reads - here we are passing the variable that holds the output error profile for the forward reads to the err argument

pool = "pseudo" - this is a method to share data across different samples to help when trying to infer true biological sequences (can read more about it here)

multithead = 4 - telling the program to run things in parallel where possible, using up to 4 threads

Merging forward and reverse reads

Now dada2 merges the forward and reverse sequences to reconstruct our full target amplicon. By default, it requires the overlapping region to be identical between the two, which is a good way to go when we are using a single-nucleotide resolution approach like this.

merged_amplicons <- mergePairs(dada_forward, 
                               filtered_forward_reads, 
                               dada_reverse,
                               filtered_reverse_reads)

CODE BREAKDOWN

merged_amplicons <- - this is where we are specifying the variable to hold the regular output of what we are running that follows

mergePairs(...) - the primary function we are calling

dada_forward - this first positional argument is the variable holding the object we created with the dada() function above that we ran on the forward reads

filtered_forward_reads - this second positional argument is the variable that holds the paths to our input forward reads (our filtered reads in this case)

dada_reverse - this third positional argument is the variable holding the object we created with the dada() function above that we ran on the reverse reads

filtered_reverse_reads - this fourth positional argument is the variable that holds the paths to our input reverse reads (our filtered reads in this case)

Generating a count table

Now we can generate a count table with the makeSequenceTable() function. This is one of the typical main outputs from processing an amplicon dataset. It is counts of how many times each unique sequence (ASVs here) were detected in each sample.

seqtab <- makeSequenceTable(merged_amplicons)
dim(seqtab) # 23 12787
    # there are 12,787 unique sequences across our dataset

We can see from the size of seqtab that there are 12,787 unique sequences across our dataset. But it's not very friendly to look at in its current form because the actual sequences are our rownames – so we'll make a more manageable table in a couple steps.

Chimera removal

dada2 identifies likely chimeras by ranking all the unique ASVs in order of abundance, and then seeing if there are any lower-abundance sequences that can be made exactly by mixing left and right portions of two of the more-abundant ones. These are then removed:

seqtab.nochim <- removeBimeraDenovo(seqtab, verbose = TRUE)
# Identified 11428 bimeras out of 12787 input sequences.

So, it's not unusual to have a lot of these identified, but being up around 90% as is the case here is getting closer to the unusual side of things. I've looked into this before, and did again after seeing what this dataset looked like. The dada2 github issues forum has quite a few posts from people asking about high numbers of chimeras being identified, and even some as high as 97% of unique sequences (like this one) sometimes happen. Ben (dada2 developer) notes in that thread that particularly amplicons that span multiple hypervariable regions (which ours do here, spanning V4V5) can be prone to produce many chimeras like this. So though this is a bit on the high end of what's typical in terms of unique sequences, what's more important is what we are losing in terms of abundance, as is noted on the dada2 FAQ:

It is common for chimeric sequences to be a majority (even large majority) of the total number of unique sequence variants inferred by the dada(...) algorithm, but they should not be a majority of the sequencing reads. That is, there may be many chimeras, but they are relatively low abundance.

If we want to load this instead of letting it finish, we can do so by running this line in R:

load("/export/data1/ge246lab/GeoBio22/mikelee/16S-demo-all-objects.RData")

Although we lost almost 90% of our unique sequences, we don't know if those particular sequences held a lot in terms of abundance yet. Here is one quick way we can look at that:

sum(seqtab.nochim) / sum(seqtab)

In terms of abundance, we are retaining about 67% of reads following the chimera removal.

A common computational problem that can lead to high chimeras is if the primers weren't properly removed, but that's not the case with this dataset. So given that there are reported cases of even higher proportions of unique sequences being identified as chimeras, and that we are keeping about 67% of the total reads, I would move forward with things here.

Making an overview of counts throughout the process

The developers' dada2 tutorial provides an example of a nice, quick way to pull out how many reads were dropped at various points of the pipeline. This can serve as a jumping off point if we’re left with too few sequences at the end to help point us towards where we should start digging into where they are being dropped. Here’s a slightly modified version (don't mind that this may look super-busy and confusing, like most things, if broken down piece-by-piece it's less intimidating):

# making a little helper function
getN <- function(x) sum(getUniques(x))

# making a summary table
summary_tab <- data.frame(row.names = samples,
                          dada2_input = filtered_out[,1],
                          filtered = filtered_out[,2],
                          dada_f = sapply(dada_forward, getN),
                          dada_r = sapply(dada_reverse, getN), 
                          merged = sapply(merged_amplicons, getN),
                          nonchim = rowSums(seqtab.nochim),
                          final_perc_reads_retained = round(rowSums(seqtab.nochim) / filtered_out[,1] * 100, 1))

summary_tab
#                           dada2_input filtered dada_f dada_r merged nonchim final_perc_reads_retained
# MLW-0521-05_S186                34261    27535  26734  26982  23129   12647                      36.9
# MLW-0521-10_S187                46107    35570  34892  35144  31613   25540                      55.4
# MLW-0521-14_S188                72626    55083  54153  54474  49291   37863                      52.1
# MLW-0521-18_S189                31465    23974  23479  23686  21229   16263                      51.7
# MLW-0521-22_S190                27918    21382  20911  21037  18952   14674                      52.6
# MLW-0521-27_S191                41204    31823  31192  31340  27960   20251                      49.1
# MLW-0521-35_S192                15663    11963  11508  11621  10218    7583                      48.4
# MLW-0919-00_S171                30890    23924  22966  23244  19035   10654                      34.5
# MLW-0919-10_S172                34766    26129  24822  25365  20403   12409                      35.7
# MLW-0919-16_S173                35605    27440  25411  26222  20207   13626                      38.3
# MLW-0919-17_S174                24339    18634  17106  17645  13447    9157                      37.6
# MLW-0919-18_S175                21234    16415  15108  15618  11810    7881                      37.1
# MLW-0919-20_S176                42513    32689  30115  31010  23271   14706                      34.6
# MLW-0919-30_S177                13083    10190   9085   9569   6895    4535                      34.7
# MLW-0920-05_S178                50767    39545  38109  38501  32188   20208                      39.8
# MLW-0920-10_S179                75420    59536  57397  57967  48258   30263                      40.1
# MLW-0920-15_S180                46817    36330  34175  35116  26347   15163                      32.4
# MLW-0920-18_S181                34188    27061  24840  25795  17895   10249                      30.0
# MLW-0920-20_S182                 8756     6838   6217   6366   4548    2948                      33.7
# MLW-0920-22_S183                11071     8754   7849   8142   5473    3445                      31.1
# MLW-0920-25_S184                12461     9758   8894   9172   6487    4077                      32.7
# MLW-0920-30_S185                12769    10061   9427   9621   7101    4696                      36.8
# MLW-DNAexBlank-50cyc_S193       14454    11520  11473  11425  10974    9203                      63.7

And we can see a breakdown of the number of reads/sequences at each stage, and the final percent retained by the end.

And here's a way we could write out that table if we wanted to have it outside of R:

write.table(summary_tab, "read-count-tracking.tsv", 
            quote = FALSE, sep = "\t", 
            col.names = NA)

CODE BREAKDOWN

write.table(...) - the primary function we are calling

summary_tab - this first positional argument is the variable name holding the object we want to write out

"read-count-tracking.tsv" - this second possitional argument is specifying the name of the file we want to create/write to

quote = FALSE - this is saying we don't want the fields to be surrounded in quotations

col.names = NA - this little nuanced part is needed just because of how we have our object table (specifically because it has the sample IDs as rownames, rather than being a column). To write the table out and bump over the column names (which start with 'dada2_input', we need to set col.names to NA like this), otherwise the 'dada2_input' column name would be on top of our sample IDs

After that command, that table now exists as a file on our system, and no longer just as a temporary object in R.

Assigning taxonomy

To assign taxonomy, we are going to use the DECIPHER package. There are some DECIPHER-formatted databases available here, which is where the SILVA v138 comes from that we will use below.

The below is commented out because this is the most computationally intensive step and would take maybe a half hour or longer with all of us running it at the same time. So we are just going to load the resultant object from a previous run. But here is what the code was, and how I saved the output object:

# # downloading DECIPHER-formatted SILVA v138 reference
# download.file(url = "http://www2.decipher.codes/Classification/TrainingSets/SILVA_SSU_r138_2019.RData", 
#               destfile = "SILVA_SSU_r138_2019.RData")
# 
# # loading reference taxonomy object
# load("SILVA_SSU_r138_2019.RData")
# 
# # loading DECIPHER
# library(DECIPHER)
# packageVersion("DECIPHER") # v2.20.0
# 
# # creating DNAStringSet object of our ASVs
# dna <- DNAStringSet(getSequences(seqtab.nochim))
# 
# # and classifying
# tax_info <- IdTaxa(test = dna, 
#                    trainingSet = trainingSet, 
#                    strand = "both", 
#                    processors = 4)
# # here is how we can save the R object if we want
# # and next is read it in
# save(tax_info, file = "decipher-tax-obj.RData")

This will read in the tax_info object we would have created with the above:

load("/export/data1/ge246lab/GeoBio22/mikelee/decipher-tax-obj.RData")

Extracting the standard goods from dada2

The typical standard outputs form amplicon processing are a fasta file of our unique ASVs, a count table showing how many times each unique ASV was detected in each sample, and a taxonomy table linking our ASV IDs to their assigned taxonomy. Here is one way we can generate those files from our dada2 objects in R.

# giving our seq headers more manageable names (ASV_1, ASV_2...)
asv_seqs <- colnames(seqtab.nochim)
asv_headers <- vector(dim(seqtab.nochim)[2], mode = "character")

for (i in 1:dim(seqtab.nochim)[2]) {
  asv_headers[i] <- paste(">ASV", i, sep = "_")
}

# making and writing out a fasta of our final ASV seqs:
asv_fasta <- c(rbind(asv_headers, asv_seqs))
write(asv_fasta, "ASVs.fasta")

# count table:
asv_tab <- t(seqtab.nochim)
colnames(asv_tab) <- samples
row.names(asv_tab) <- sub(">", "", asv_headers)
write.table(asv_tab, "ASVs-counts.tsv", sep = "\t", quote = FALSE, col.names = NA)

# tax table:
# creating table of taxonomy and setting any that are unclassified as "NA"
    # this part def gets busy, and i would need to break it down to remember
    # what everything is doing/how it works
    # so don't worry about it too much right now, it's more R-heavy than
    # we're getting into currently

ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species")
asv_tax <- t(sapply(tax_info, function(x) {
  m <- match(ranks, x$rank)
  taxa <- x$taxon[m]
  taxa[startsWith(taxa, "unclassified_")] <- NA
  taxa
}))
colnames(asv_tax) <- ranks
rownames(asv_tax) <- gsub(pattern = ">", replacement = "", x = asv_headers)

write.table(asv_tax, "ASVs-taxonomy.tsv", sep = "\t", quote = FALSE, col.names = NA)

Now those output files are no longer confined to just the world of R either :+1:

Removing likely contaminants

In addtion to dada2, @bejcal et al. have also created a widely used program for removing contaminants based on incorporated blanks called decontam (Nicole Davis et al. publication here). As usual, they have also provided excellent documentation and have a vignette here showing an example of doing this from a phyloseq object and discussing the various ways their program can be implemented (such as incorporating DNA concentrations if available). Here, we will apply it without DNA concentrations – instead using ASVs detected in the incorporated blanks – starting from our count table generated above.

library(decontam)
packageVersion("decontam") # 1.12.0

For decontam to work on our data, we need to provide it with our count table, currently stored in the "asv_tab" variable, and we need to give it a logical (TRUE/FALSE) vector that tells it which samples are "blanks". Here is making that vector and running the program:

colnames(asv_tab) # our blank sample is the last one, the 23rd sample
# so here making a vector with 22 "FALSE" in a row, and then one "TRUE" at the end
vector_for_decontam <- c(rep(FALSE, 22), TRUE)

contam_df <- isContaminant(t(asv_tab), neg = vector_for_decontam)

table(contam_df$contaminant)

It seems none were identified as contaminants, as there would be a count for "TRUE" there if any were.

Let's look at where the counts are for the "blank" sample:

# it's not as easy to manipulating matrix objects in R, so making a dataframe version first
asv_df <- data.frame(asv_tab)

# filtering and manipulating dataframes is also much easier with the tidyverse package, so loading that
library(tidyverse)

# now checking for the column name of the blank sample
colnames(asv_df)
# and sorting based on that and looking at the first few
asv_df %>% arrange(desc(MLW.DNAexBlank.50cyc_S193)) %>% head(n = 15)

And it seems none of the ASVs from the blank are in any of the actual samples. So we don't need to worry about removing any sequences from our data in this case, but if we had some, there is example code here demonstrating a way that we could.

How to exit R/RStudio

First we would want to save our R script. We can do that with keyboard shortcuts while in the active RStudio browser, or we can click the disk icon at the top of the page.

When we close R, usually all of our variables and objects in there will go away - as they are only stored in memory. Lots of times this is fine, but if we wanted to, we could save them all to a file like this:

save.image(file = "demo-all-objects.RData")

Which will write the data to that file in our current working directory (".RData" is just the conventional extension for this format).

We can get rid of all our current environment information by running the following:

rm(list = ls())

And then in a future R session, we could load them back into the active environment with this command:

load("demo-all-objects.RData")

For us for now, to exit R/Rstudio on our system, we can click 'File' -> 'Quit Session…' at the bottom:

This will give us an "R Session Ended" window, and we can then close the browser tab.

And back at our terminal that is connected to the server and has been kindly connecting us to RStudio this whole time, we can press ctrl + c to get our regular prompt back :+1:

Peaking at our standard outputs

Back in our terminal window, we can see the files we've created as the primary outputs from the processing.

We are probably still in the correct location, but we can run this in case we aren't:

cd /export/data1/ge246lab/GeoBio22/${USER}/16S-demo

Fasta file of ASV seqs

head -n 4 ASVs.fasta
# >ASV_1
# GACGGAGGATGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCTGTAGGTGGTTTACTCAGTCCGCTGTGAAAGATCAGGGCTTCACCCTGAACACGCGGTGGAAACCGGTAGACTTGAGTACGGTAGGGGCAGAGGGAATTCCCGGTGGAGCGGTGAAATGCGTAGAGATCGGGAAGAACACCAATAGCGAAAGCACTCTGCTGGGCCGTAACTGACACTGAGAGACGAAAGCTAGGGTAGCGAATGGGATTAGATACCCCAGTAGTCCTAGCCGTAAACGATGGATACTAAGTGCTGTGCATTCGACCAGTACTGTAGCTAACGCGTTAAGTATCCCGCCTGGGGAGTATGCTCGCAAGAGTG
# >ASV_2
# GACGGAGGGGGCGAGCGTTGTTCGGAAGGACTGGGCGTACAGGGCGCGTAGGCGGAAGGGTACGGGGTCGAGGGGGAGCGGGCATTCCCCGCGCCCCTTCGGCCCAAGGGCCCTTCTGGGAGGCGTCGGGAGGGGTCTGGAATGCCCCGTGGAGCGGTGAAATGCGCAGAGATGGGGCGGAACACCGAAGGCGAAGGCAGGACTCCCGCGACGGCCTGACGCTGAGGCGCGAAAGCGTGGGGAGCGACGAGGATTAGAGACCCTCCTAGTCCACGCCCTAAACGAGGAGTGCTGGGGTTCCGGGCCGATTCGGTCCTGGGATCCCCGGTTCACGCGCGAGCACTCCGCCTGGGGAGTACGGTCGCAAGGCTG

Taxonomy table

# the 'column' part helps keep the columns aligned for readability
head ASVs-taxonomy.tsv | column -ts $'\t'
#        domain    phylum            class                order              family              genus                    species
# ASV_1  NA        NA                NA                   NA                 NA                  NA                       NA
# ASV_2  NA        NA                NA                   NA                 NA                  NA                       NA
# ASV_3  Bacteria  Cyanobacteria     Cyanobacteriia       Synechococcales    Cyanobiaceae        Synechococcus MBIC10613  NA
# ASV_4  Bacteria  Actinobacteriota  Actinobacteria       Nitriliruptorales  Nitriliruptoraceae  NA                       NA
# ASV_5  Bacteria  Spirochaetota     Spirochaetia         Spirochaetales     Spirochaetaceae     Alkalispirochaeta        NA
# ASV_6  Bacteria  Bacteroidota      Rhodothermia         Balneolales        Balneolaceae        Rhodohalobacter          NA
# ASV_7  NA        NA                NA                   NA                 NA                  NA                       NA
# ASV_8  Bacteria  Proteobacteria    Gammaproteobacteria  NA                 NA                  NA                       NA
# ASV_9  Bacteria  Proteobacteria    Alphaproteobacteria  Rhodobacterales    Rhodobacteraceae    NA                       NA

Counts table

head ASVs-counts.tsv | cut -f 1-5 | column -ts $'\t'
#        MLW-0521-05_S186  MLW-0521-10_S187  MLW-0521-14_S188  MLW-0521-18_S189
# ASV_1  32                8579              19313             7345
# ASV_2  9                 5516              11209             4828
# ASV_3  4                 57                33                14
# ASV_4  3108              2593              1022              1034
# ASV_5  2924              862               872               251
# ASV_6  559               759               288               226
# ASV_7  1                 454               843               362
# ASV_8  1177              747               196               187
# ASV_9  150               816               444               98

And with those primary outputs in hand, we're ready to finally move on to analysis!