Qiime2 Microbiome Analysis Tutorial

Main page
Enterotype Analysis

Before attempting this analysis, be sure you are comfortable working in linux on a HPC cluster. Review the linux tutorial if necessary.

Some other useful resources for microbiome analysis will be the Qiime2 Moving Pictures Tutorial (versions: 2021.4, 2023.5) and my analysis of Lyndell's cownose ray gut contents.

A note on using `screen`

Anytime you'll run a command in linux that might take a while to complete or otherwise be problematic, it's useful to run screen. Starting a screen is like creating a new instance of your linux session, nested within the original one. To start it just type screen. The screen will clear, and you'll see the same prompt. Here you can run some command and while it's running you can return to the original session. Type [control]-[A] and then [D] to "disconnect" from the current screen. Now you're back where you started. To recover the screen again, enter screen -r. If you have multiple screens running at the same time, you'll be prompted to specify the one you want.

There are several suitable screens on:
	145901.pts-1.n26	(Detached)
	145764.pts-1.n26	(Detached)
Type "screen [-d] -r [pid.]tty.host" to resume one of them.

In this example we might return to the first one on the list by entering screen -r 145901

You can also start a screen with a name you specify as in screen -S microbiome.analysis

Screen is useful because you can keep time-consuming processes running without "breaking the pipe", for example if you close your laptop or go to sleep for the night.

To end a screen session, just type exit from within it. It's good practice to clean-up screens when you're done with them. Also avoid creating "nested screens". Don't run screen when you're already in a screen. That can get confusing, because [control]-[A] [D] will return you to the "base" session.

Data source

In 2017, our lab collected 168 Bombus workers from across Maine and isolated DNA in order to assess their gut microbiomes. Samples were prepped for 16S and 18S targetted libraries at the Dalhousie University Integrated Micorbiome Resource (IMR).

The path to the raw data on nscc is /export/groups/drangeli/Bombus.landscape/
Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the IMR from amplicons targetting 16S v4-v5.

515F GTGYCAGCMGCCGCGGTAA
926R CCGYCAATTYMTTTRAGTTT
The product is 410 bp, resulting in a 190 bp overlap.

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Parada AE, Needham DM, Fuhrman JA. 2016. Every base matters: assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. Environ Microbiol 18(5):1403-14. Link

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Walters W, Hyde ER, Berg-Lyons D, Ackermann G, Humphrey G, Parada A, Gilbert JA, Jansson JK, Caporaso JG, Fuhrman JA, Apprill A, Knight B. 2015. Improved bacterial 16S rRNA gene (V4 and V4-5) and fungal internal transcribed spacer marker gene primers for microbial community surveys. mSystems 1(1):e00009-15. Link

And 16S v6-v8.

B969F ACGCGHNRAACCTTACC
BA1406R ACGGGCRGTGWGTRCAA
The product is 450 bp, resulting in a 150 bp overlap.

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Comeau AM, Li WKW, Tremblay J-É, Carmack EC, Lovejoy C (2011) Arctic Ocean Microbial Community Structure before and after the 2007 Record Sea Ice Minimum. PLoS ONE 6(11): e27492. Link

Using Singularity

Note that the singularity container doesn't initially have access to the folder you're working in, e.g. /export/groups/drangeli. The data folder must be "bound" within the container using the --bind arguement. This creates a folder at the root of the container, /data, that contains the data in the working directoty (that is, the output of pwd, such as /export/groups/drangeli/beeSeq.project/16s.v4v5/. The /data folder only exists within the container. So paths in the qiime call refer to files using the /data/ path.

As of summer 2023, singularity seems unable to bind a specific path. The --bind arguement seems to have no effect. Instead singularity defaults to binding the user's home folder. While this is inconvenient, it means you simply need to work in that folder. So, the modified alias might be alias qiime singularity run /export/groups/singularity/qiime_202104.simg qiime I'm hoping to find a better solution soon!

Start by running screen. Then define an alias to run Qiime2 via singularity. The alias will stay active as long as the screen instance is maintained.

alias qiime singularity run --bind "$PWD":/data /export/groups/singularity/qiime_202104.simg qiime

Get out of the screen (and leave it running) by typing the keys CTRL+A, then D. When you're finally done using the screen, enter the command exit.

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

More on Singularity and data mounting
There's another way to mount the data folder in Singularity that doesn't require it to be a named folder within the container. The argument --bind "$PWD":/export/home/drangeli will replace the starting folder in the container with the contents of the current directory outside the container. This might be convenient, because commands wouldn't need to explicitly refer to that folder. However, I've stuck with the explicit \data\ binding because it seems safer.

I promise to clean this up later!

Read processing

For the 16S v4-v5 reads, I'm working in nscc:/export/groups/drangeli/beeSeq.project/16s.v4v5/. All the steps will be mirrored for the v6-v8 data in ../16s.v6v8/.

I moved the older processing files into a sub-folder summer.2018.analysis/.

Raw reads are in the raw.reads folder. I've previously run FastQC on all the raw data files. Nothing merited exclusion.

Import the data into Qiime2 data structures and summarize. In updating this workflow, I'm refering to my notes from 2018 and to the current Moving Pictures tutortial from the Qiime2 docs.

qiime tools import\
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path /data/raw.reads \
  --output-path /data/16s.v4.demux.qza \
  --input-format 'CasavaOneEightSingleLanePerSampleDirFmt' 

qiime demux summarize \
  --i-data /data/16s.v4.demux.qza \
  --o-visualization /data/16s.v4.demux.qzv

Download the qzv visualization files to the local machine as follows.

scp nscc:/export/groups/drangeli/beeSeq.project/16s.v4v5/16s.v4.demux.qzv .

Drag and drop the qzv file into a web browser at https://view.qiime2.org/

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Denoise

The next step is dada2, a tool for detecting and correcting Illumina sequence data. It also filters out reads from phiX and chimeric sequences. Here's an example call.

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs /data/16s.v4.demux.qza \
  --p-trunc-len-f 265 --p-trunc-len-r 160 \
  --p-trim-left-f 34  --p-trim-left-r 34 \
  --o-representative-sequences /data/16s.v4.rep.seqs.qza \
  --o-table /data/16s.v4.feature.table.qza \
  --o-denoising-stats /data/16s.v4.denoise.stats.qza \
  --p-n-threads 16
  
qiime metadata tabulate \
  --m-input-file /data/16s.v4.denoise.stats.qza \
  --o-visualization /data/16s.v4.denoise.stats.qzv

Then export the feature table to a more flexible format (tab-seperated values).

qiime tools export  \
  --input-path /data/16s.v4.feature.table.qza \
  --output-path /data

singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v4v5/:/data /export/groups/singularity/qiime_202104.simg \
  biom convert \
    -i /data/feature-table.biom \
    -o /data/16s.v4.feature.table.tsv \
    --to-tsv

I ran the dada2 denoise process under a range of truncation parameters using a bash script to produce the feature.table.qza files and a summary tsv file. I used the total number of features (presumptive OTUs) as the metric to choose the best parameters.

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

the bash script

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

useful hints for bash scripting

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

These results from dada2 are slightly different than I found in 2018. The optimal parameters identified then produce slightly different numbers of OTUs now (385 features versus 375 in the past). The input data is exactly the same, and the denoise process is determinative (not stochastic), so the discrepency must be due to updates in the dada2 denoise algorithm. The parameters identified as optimal this time are slightly higher for the forward truncation and slightly lower for the reverse read truncation. Therefore the program is using relatively more data from the forward reads. The forward reads are higher quality (which is typical for paired-end Illumina sequences), so that seems good.

These parameters (maximizing feature and read numbers) correspond to at least 75% of reads with PHRED quality scores of 34 or better.

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

I'm not entirely sure whether it's best to proceed with a feature table based on maximizing OTUs or total counts. If you look at the parameter space, there's a knife-edge drop-off in both metrics. That corresponds to the truncation values that start to erode the amplicon (410 bp for the 16s v4-v5 sequence). It's interesting that OTUs are maximized right at that drop off and are favored by more forward read sequence, but more reads (counts) pass filter just inside that boundary and with slightly more balanced contribution of forward and reverse reads. For now I'll move ahead with both files.

The "feature table" is a matrix of counts for each feature (OTU) in each sample. This can be imported into R for plotting and further analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to those IDs using the taxonomy.tsv file which is generated later.

The commands below generate a list of the features, with their UUID, DNA sequence, and a link to BLAST search. This is useful as a record.

qiime feature-table tabulate-seqs \
  --i-data /data/16s.v4.maxOTUs.rep.seqs.qza \
  --o-visualization /data/16s.v4.maxOTUs.rep.seqs.qzv

qiime feature-table tabulate-seqs \
  --i-data /data/16s.v4.maxcts.rep.seqs.qza \
  --o-visualization /data/16s.v4.maxcts.rep.seqs.qzv

The feature-table summarize command provides summary statistics and groups features on metadata category.

qiime feature-table summarize \
  --i-table /data/16s.v4.maxOTUs.feature.table.qza \
  --o-visualization /data/16s.v4.maxOTUs.feature.table.qzv \
  --m-sample-metadata-file /data/metadata.beeSeq.microbiota.tsv

qiime feature-table summarize \
  --i-table /data/16s.v4.maxcts.feature.table.qza \
  --o-visualization /data/16s.v4.maxcts.feature.table.qzv \
  --m-sample-metadata-file /data/metadata.beeSeq.microbiota.tsv

Diversity analysis

Generate trees for phylogenetic diversity analyses.

qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences /data/16s.v4.maxOTUs.rep.seqs.qza \
  --o-alignment /data/16s.v4.maxOTUs.aligned-rep-seqs.qza \
  --o-masked-alignment /data/16s.v4.maxOTUs.masked-aligned-rep-seqs.qza \
  --o-tree /data/16s.v4.maxOTUs.unrooted-tree.qza \
  --o-rooted-tree /data/16s.v4.maxOTUs.rooted-tree.qza

qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences /data/16s.v4.maxcts.rep.seqs.qza \
  --o-alignment /data/16s.v4.maxcts.aligned-rep-seqs.qza \
  --o-masked-alignment /data/16s.v4.maxcts.masked-aligned-rep-seqs.qza \
  --o-tree /data/16s.v4.maxcts.unrooted-tree.qza \
  --o-rooted-tree /data/16s.v4.maxcts.rooted-tree.qza

Determine the optimal rarefaction level by looking at the feature.table.qzv visualizations. I chose the maximum sampling depth that would retain all day-25 samples from treatments A-D.

qiime diversity core-metrics-phylogenetic \
  --i-phylogeny /data/16s.v4.maxOTUs.rooted-tree.qza \
  --i-table /data/16s.v4.maxOTUs.feature.table.qza \
  --p-sampling-depth 6017 \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --output-dir /data/core-metrics-results.maxOTUs

qiime diversity alpha-group-significance \
  --i-alpha-diversity /data/core-metrics-results.maxOTUs/faith_pd_vector.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/core-metrics-results.maxOTUs/faith-pd-group-significance.qzv

qiime diversity alpha-group-significance \
  --i-alpha-diversity /data/core-metrics-results.maxOTUs/evenness_vector.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/core-metrics-results.maxOTUs/evenness-group-significance.qzv
  
qiime diversity beta-group-significance \
  --i-distance-matrix /data/core-metrics-results.maxOTUs/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --m-metadata-column hive_day \
  --o-visualization /data/core-metrics-results.maxOTUs/unweighted-unifrac-hive_day-significance.qzv \
  --p-pairwise
  
  
qiime diversity core-metrics-phylogenetic \
  --i-phylogeny /data/16s.v4.maxcts.rooted-tree.qza \
  --i-table /data/16s.v4.maxcts.feature.table.qza \
  --p-sampling-depth 6045 \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --output-dir /data/core-metrics-results.maxcts
  
qiime diversity alpha-group-significance \
  --i-alpha-diversity /data/core-metrics-results.maxcts/faith_pd_vector.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/core-metrics-results.maxcts/faith-pd-group-significance.qzv

qiime diversity alpha-group-significance \
  --i-alpha-diversity /data/core-metrics-results.maxcts/evenness_vector.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/core-metrics-results.maxcts/evenness-group-significance.qzv  
  
qiime diversity beta-group-significance \
  --i-distance-matrix /data/core-metrics-results.maxcts/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --m-metadata-column hive_day \
  --o-visualization /data/core-metrics-results.maxcts/unweighted-unifrac-hive_day-significance.qzv \
  --p-pairwise

Qiime2’s core diversity metrics include:

Alpha diversity

Shannon’s diversity index (a quantitative measure of community richness)
Observed OTUs (a qualitative measure of community richness)
Faith’s Phylogenetic Diversity (a qualitiative measure of community richness that incorporates phylogenetic relationships between the features)
Evenness (or Pielou’s Evenness; a measure of community evenness)

Beta diversity

Jaccard distance (a qualitative measure of community dissimilarity)
Bray-Curtis distance (a quantitative measure of community dissimilarity)
unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)
weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)

Feature classification

Naive Bayes classifiers have been built from the SILVA release 138 dataset, which was released in December 2019 and is currently the most recent major release. One was provided by the Qiime2 developers.

The Qiime2 docs reccommend building your classifier with sequences aligned to your amplicon. Despite many attempts I was unable to get the SILVA dataset into a format that would be accepted by Qiime. The classifier they provide works quite well however, and I think it's adequate to our needs.

cd /export/groups/drangeli/marker.ref.DBs/silva138/
wget "https://data.qiime2.org/2021.4/common/silva-138-99-nb-classifier.qza"
cd /export/groups/drangeli/beeSeq.project/16s.v4v5/
cp ../../marker.ref.DBs/silva138/silva-138-99-nb-classifier.qza .

qiime feature-classifier classify-sklearn \
  --i-reads /data/16s.v4.maxOTUs.rep.seqs.qza \
  --i-classifier /data/silva-138-99-nb-classifier.qza \
  --o-classification /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
  --p-n-jobs -3

qiime metadata tabulate \
  --m-input-file /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
  --o-visualization /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qzv

qiime tools export  \
  --input-path /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
  --output-path /data/. 
mv taxonomy.tsv 16s.v4.maxOTUs.silva138.taxonomy.calls.tsv

cut -f 3 16s.v4.maxOTUs.silva138.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk

The median posterior support for taxonomy assignments is 0.976.

qiime taxa barplot \
  --i-table /data/16s.v4.maxOTUs.feature.table.qza \
  --i-taxonomy /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/16s.v4.maxOTUs.silva138.taxa.barplot.qzv

qiime feature-classifier classify-sklearn \
  --i-reads /data/16s.v4.maxcts.rep.seqs.qza \
  --i-classifier /data/silva-138-99-nb-classifier.qza \
  --o-classification /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
  --p-n-jobs -3

qiime metadata tabulate \
  --m-input-file /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
  --o-visualization /data/16s.v4.maxcts.silva138.taxonomy.calls.qzv

qiime tools export  \
  --input-path /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
  --output-path /data/. 
mv taxonomy.tsv 16s.v4.maxcts.silva138.taxonomy.calls.tsv

cut -f 3 16s.v4.maxcts.silva138.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk

The median posterior support for taxonomy assignments is 0.984.

qiime taxa barplot \
  --i-table /data/16s.v4.maxcts.feature.table.qza \
  --i-taxonomy /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/16s.v4.maxcts.silva138.taxa.barplot.qzv

Inspecting the most common OTUs, the classifications look sensible. Snodgrassella, Candidatus Schmidhempelia, and Gilliamella perdominate.

Differential abundance analysis using ANCOM

Start by filtering out the free-foraging (treatment F) samples, and just going ahead with the day 25 samples.

qiime feature-table filter-samples \
  --i-table /data/16s.v4.maxOTUs.feature.table.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --p-where "day='25' AND NOT treatment='F'" \
  --o-filtered-table /data/16s.v4.maxOTUs.d25AD.feature.table.qza

Calculate pseudocounts (counts +1) to avoid log-of-zero errors.

qiime composition add-pseudocount \
  --i-table /data/16s.v4.maxOTUs.d25AD.feature.table.qza \
  --o-composition-table /data/16s.v4.maxOTUs.d25AD.comp.table.qza

ANCOM (ANalysis of COMposition) is a statistical method for the comparisons of feature abundance in sample groups. It should be robust to distributional assumptions and scaling to large sample sizes. ANCOM also assumes that relatively few (< ~ 25%) of the features are changing between groups.

qiime composition ancom \
  --i-table /data/16s.v4.maxOTUs.d25AD.comp.table.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment \
  --o-visualization /data/16s.v4.maxOTUs.d25AD.ancom.by.treatment.qzv

One feature is identified as significantly different: aa8b914828a465c360b7a079334746d8, classifed only to the genus level as Saccharibacter by our Bayesian classifier, but a BLASTn search of the sequence to NCBI has a 99% identity match to Saccharibacter floricola.

I repeated the ANCOM analysis using the "maxcts" dataset and got almost precisely the same results: This OTU is the only one highlighted by ANCOM and the counts differ by only a few.

16S v6-v8

I ran all the same steps as above using the 16s.v6v8 dataset too.

dada2 denoise optimization

This region yeilds far more features/OTUs and reads passing filter. The v6-v8 region was also the focus of the 2019 manuscript. It may make more sense to focus on these data in the final analysis.

Rarefaction

Again, I chose sampling depths to retain all the day-25 samples in treatments A-D.

max OTUs: 4965
max counts: 6075

Classification

max OTUs: 0.9428 median posterior support for classifications
max counts: 0.969

The more I think about it, 16s v6-v8 filtered to maximize reads (16s.v6.maxcts) seems like the best dataset to proceed with.

ANCOM by treatment

One significantly different OTU: 699d78936fbf1cc18ab30cede7baad9e classified as Saccharibacter floricola with 0.756 posterior support.

Differential abundance analysis with gneiss

Another way to examine differences among treatments is the use of balance trees as implemented by the gneiss algorithm Morton et al. 2017. Qiime has an instructive tutorial on its application of gneiss.

I'll restrict this analysis to the day-25 data. So the metadata file will need to be filtered down to those samples.

grep '^[ABCD]' metadata.beeSeq.microbiota.tsv > tmp
head -n 1 metadata.beeSeq.microbiota.tsv > metadata.beeSeq.microbiota.d25AD.tsv
grep '_day25_' tmp >> metadata.beeSeq.microbiota.d25AD.tsv

The functinality of qiime gneiss appers to have been reduced since earlier versions. It's no longer possible to have the plug-in output the taxa on either side of a particular "balance" or node in the tree. Although the tree and balance data can be exported. The OLS regression function also seems to have disappeared. The Qiime2 forum says this was done to make it compatible with Songbird, although Songbird has also been eliminated from the most recent versions of Qiime2. I'll try it all anyway.

Gneiss can run based on trees created by clustering features based on their own correlations (correlation-clustering), or in relation to some numerical factor (gradient-clustering).

qiime gneiss correlation-clustering \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --o-clustering /data/16s.v6.maxcts.d25AD.hierarchy.qza
  
qiime tools export \
  --input-path /data/16s.v6.maxcts.d25AD.hierarchy.qza \
  --output-path /data/.

mv tree.nwk 16s.v6.maxcts.d25AD.hierarchy.tre

qiime gneiss ilr-hierarchical \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --i-tree /data/16s.v6.maxcts.d25AD.hierarchy.qza \
  --o-balances /data/16s.v6.maxcts.d25AD.balances.qza

qiime tools export \
  --input-path /data/16s.v6.maxcts.d25AD.balances.qza \
  --output-path /data/.

singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \
  biom convert \
    -i /data/feature-table.biom \
    -o /data/16s.v6.maxcts.d25AD.gneiss.balances.tsv \
    --to-tsv

sed -i '1d' 16s.v6.maxcts.d25AD.gneiss.balances.tsv

qiime gneiss dendrogram-heatmap \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --i-tree /data/16s.v6.maxcts.d25AD.hierarchy.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
  --m-metadata-column treatment \
  --p-color-map seismic \
  --o-visualization /data/16s.v6.maxcts.d25AD.gneiss.by.treatment.qzv

The Saccharibacter floricola feature 699d78936fbf1cc18ab30cede7baad9e, highlighted by ANCOM, appears with 4 other features on one side of the y2 split. Most of the other features in the dataset are on the other side of that split. The additional 4 features are all Enterobacteriaceae, which the SILVA classifier failed to resolve to a lower taxonomic level. BLAST has multiple perfect matches to each of them.

feature UUID	BLAST result
d3286dc2d6be57a96881c9f6a43b2859	Citrobacter werkmanii; Klebsiella oxytoca; several uncultured bacteria
8fa34849d5dded61d203db8249907103	Citrobacter freundii strain RTC1; Enterobacter sp. TS4; Uncultured bacterium AMW_D6
8d625c592171b66b877523cc46ac3963	Klebsiella or Citrobacter
8026cfce27b83d70eb073a10947f3f94	Klebsiella or Citrobacter

This result matches the group of OTUs identified by Gneiss in the previous analysis, which contained Saccharibacter sp. and 6 unidentified Enterobacteriaceae.

I also tried the gradient clustering method.

qiime gneiss gradient-clustering \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --m-gradient-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
  --m-gradient-column dose_ppb \
  --o-clustering /data/16s.v6.maxcts.d25AD.grad.clust.qza

qiime tools export \
  --input-path /data/16s.v6.maxcts.d25AD.grad.clust.qza \
  --output-path /data/.

mv tree.nwk 16s.v6.maxcts.d25AD.grad.clust.tre
  
qiime gneiss ilr-hierarchical \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --i-tree /data/16s.v6.maxcts.d25AD.grad.clust.qza \
  --o-balances /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.qza

qiime tools export \
  --input-path /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.qza \
  --output-path /data/.

singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \
  biom convert \
    -i /data/feature-table.biom \
    -o /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.tsv \
    --to-tsv

sed -i '1d' 16s.v6.maxcts.d25AD.gneiss.grad.balances.tsv

qiime gneiss dendrogram-heatmap \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --i-tree /data/16s.v6.maxcts.d25AD.grad.clust.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
  --m-metadata-column treatment \
  --p-color-map seismic \
  --o-visualization /data/16s.v6.maxcts.d25AD.gneiss.grad.by.dose.qzv

From looking at the visualization, there doesn't seem to be an obvious split, where a large block of features are associated with any of the treatment groups in a consistent way. However, the Saccharibacter floricola feature 699d78936fbf1cc18ab30cede7baad9e, highlighted by ANCOM, appears with several other features around the y9 split. 8fa34849d5dded61d203db8249907103 from the correlation gneiss analysis also appears here.

feature UUID	Classification or BLAST result
1a196db6e5bdd828fec6dd3e5dcbef1c	Lactobacillus bombicola
f1459b5093b227402d67fe3d593d4e44	Lactobacillus bombicola
48a57dc7a4e32a52c0ed2aac00d97c14	Bifidobacterium bohemicum
595d46a49f01ca56f91b8947ff9cc272	Bifidobacterium bohemicum
8026cfce27b83d70eb073a10947f3f94	Klebsiella or Citrobacter
0eb3327ed1fcfec6377770463795cd5f	Asaia

Songbird

Songbird is an algorithm to rank features based on their log-fold differences in abundance with respect to various metadata factors (Morton et al. 2019). It allows for complex multinomial regression models with continuous and categorical factors. The order of factors n the model does not matter, except that differentials will be reported for the first factor, controlling for influences of the later factors. (This is similar to the logic of a linear mixed model, where the first term is a fixed effect and other terms are random effects.)

For reasons that remain unclear, Songbird is only implemented in Qiime2 versions 2019.7 through 2020.6. Randy Downer has set-up a separate container for version 2020.6. While this is an implementation of Songbird that accompanied Qiime2, it's actually the stand-alone version of Songbird I'll be calling. As before, I'll define an alias to call the singularity container with Songbird.

alias songbird singularity run --bind "$PWD":/data /export/groups/singularity/qiime_songbird_202006.simg songbird

Because we'll be running this program outside the Qiime2 environment, we need to export the feature table file from Qiime2's qza format into the biom format expected by Songbird.

qiime tools export  \
  --input-path /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --output-path /data

mv feature-table.biom 16s.v6.maxcts.d25AD.feature.table.biom

I'll try the obvious analysis first: looking for dose effects, controlling for the influence of hive.

songbird multinomial \
  --input-biom /data/16s.v6.maxcts.d25AD.feature.table.biom \
  --metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
  --formula "dose_ppb+hive" \
  --epochs 10000 \
  --differential-prior 1 \
  --summary-interval 1 \
  --summary-dir /data/songbird.16s.v6.maxcts.d25AD.dose.hive

Compare this against a null model.

songbird multinomial \
  --input-biom /data/16s.v6.maxcts.d25AD.feature.table.biom \
  --metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
  --formula "1" \
  --epochs 10000 \
  --differential-prior 1 \
  --summary-interval 1 \
  --summary-dir /data/songbird.16s.v6.maxcts.d25AD.null

These runs completed in just a few minutes. To interpret the fit of the models, it's necessary to use TensorBoard to view the machine learning logs, which are sadly kept in binary outfield that can't otherwise be accessed.

TensorBoard is run via Jupyter notebooks. Randy did a lot of trouble-shooting to figure out how to do this. First log into https://jupyter.colby.edu/. In the upper right select New, then terminal. A terminal window will open in a new tab, within JupyterHub. Create a link to the directory containing the Songbird output. For example, ln -s /research/drangeli research. Go back to the main JupyterHub tab and refresh. An item called research (in this example) should now appear. Clicking on it will provide the contents of /research/drangeli/ in JupyterHub. Click through the directory structure util you're in the Songbird output folder. (JupyterHub can't access /export/groups/. So copy any work from there to /research/)

Next, create a new python3 notebook, and enter the following code. (The % characters are required.)

%load_ext tensorboard
%env TENSORBOARD_BINARY=/usr/local/anaconda/bin/tensorboard

Run that and create a second code block.

%tensorboard --logdir . --bind_all --path_prefix /tensorboard

Running this will give a blank image. Run it again. This time the error message will include a port number. Open a new browser tab to http://jupyter.colby.edu:<portnumber>/tensorboard It should produce a page looking like this.

Based on the Songbird docs for interpreting model fits both the cross validation accuracy (cv_error) and loss plots should decay exponentially and plateau. Loss is the squared difference of true values from predicted ones (square(y_true - y_pred)) and should ideally approach zero. And a good model should outperform the null model!

In TensorBoard, it's possible to download these logged values as a CSV file.

It's arguable whether the hive factor is necessary. Let's run another model without it.

songbird multinomial \
  --input-biom /data/16s.v6.maxcts.d25AD.feature.table.biom \
  --metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
  --formula "dose_ppb" \
  --epochs 10000 \
  --differential-prior 1 \
  --summary-interval 1 \
  --summary-dir /data/songbird.16s.v6.maxcts.d25AD.dose

Our metadata has several other factors. Since songbird doesn't take too long to run, I'll try more elaborate models to see if I run into overfitting.

head -n 1 metadata.beeSeq.microbiota.d25AD.tsv

#SampleID       treatment       dose_ppb        treatment_day   hive_day        hive    collection_rep  plate   row     column well     day     RNAconc RNAabsratio     DNAconc DNAabsratio     description

In these images I've overlaid the stable-phase means, judged by eye. The dose+hive model appears to be the best. It's accuracy and loss look indistinguishable from a model adding plate, a batch effect referring to the 96-well plate a DNA samples was stored and shipped on. Adding DNAabsratio, the DNA quality metric of A₂₆₀/A₂₈₀, has clearly resulted in overfitting, as indicated by the rising values of cv_error over time and the greater range in values for loss. Just using dose provides a small but noticeable improvement over the null model, by about 5 units of cv_error. This implies that hive actually has a much larger effect on microbial composition than does imidacloprid dose.

Here are some other models I tried.

The factor hive alone provides a very good explanation of the composition. Adding dose or treatment might improve the model just slightly, but of course it's what we're most interested in. There appears to be little difference between using dose or treatment, which is just a categorical version of the dose factor.

I started wondering about the fact that because hives are nested within dose treatments, much of the variance in composition may be captured by hive rather than a factor conveying dose. Songbird is not like a Type I ANOVA, where variance is partitioned to the first factor in the model, and later factors account for remaining variance. So, I tried two categorical, binary factors dosed, which separates the control treatment from all dosed treatments, and high_dose which distinguishes the highest-dose treatment from all others, including the zero-dose control. These models may be marginal improvements over the dose_pbb model, but clearly hive needs to be accounted for.

While the Songbird docs don't discuss complex models, I simply tried using hive as a nested factor and found it works. Running the nested model longer and with a lower differential prior did not change the values at stability. (We 4 hives per treatment, so there are 15 hive terms (plus one for dose and an intercept). There are 5 individual samples per hive, and that's not the 10-to-1 ratio of samples to model terms that is recommended. So while the metrics look okay, I am concerned about overfitting.)

Next, we can examine the differentials. Here they are from the dose + hive model (10,000 epochs; differential prior of 1).

The plot on the left shows differentials for 32 OTUs based on imidacloprid dose in a model accounting for hive-specific influences. Several Lactobacillus are negatively associated with dose. Several Enterobacter species are positively associated. Saccharibacter floricola, identified by ANCOM and Gneiss as correlated with imidacloprid dose, does not emerge as strongly affected in this analysis. Using only dose as the predictor of abundance identified the same OTUs at the extremes, but there's more variation in the intermediate OTU ranks.

Next is the differential plot from the nested model (100,000 epochs; differential prior of 0.5).

Conclusions

It's clear that Songbird detects very little signal associated with treatment. It may be the case that the signal here just isn't strong enough to use Songbird for this purpose. It may be better to highlight the findings of ANCOM and Gneiss, both of which highlight Saccharibacter floricola as differently expressed.

It may also be useful to take the feature table into R, and look at composition at a higher taxonomic level, such as order or phylum.

Resuming work in August 2022

It's been a while, working on other projects. Reviewing everything here, I'm satisfied that this previous work is still valid. I will proceed as suggested above.

Count sums at higher taxonomic levels

In order to look at higher taxonomic levels, it will be necessary to sum OTU counts based on their taxonomic classifications. This will be easiest in R. First, the feature table and classifications can be exported from the Qiime data files.

cd /export/groups/drangeli/beeSeq.project/16s.v6v8

screen -S beeSeq

alias qiime singularity run --bind "$PWD":/data /export/groups/singularity/qiime_202104.simg qiime

qiime tools export  \
  --input-path /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --output-path /data

singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \
  biom convert \
    -i /data/feature-table.biom \
    -o /data/16s.v6.maxcts.d25AD.feature.table.tsv \
    --to-tsv

The files 16s.v6.maxcts.d25AD.feature.table.tsv and 16s.v6.maxcts.silva138.taxonomy.calls.tsv were copied locally for import into R.

The script for this analysis is counts.by.taxonomic.level.R. Counts were normalized by the depth (colSums) for each sample. I checked all taxonomic levels. There were no dramatic patterns at higher taxonomic levels that weren't driven strongly by a few obvious species-level OTUs, with the exception of Enterobacteriaceae, where several species were all negatively affected by imidacloprid.

I check the trends for each of the OTUs highlighted by Songbird. That algorithm is largely influenced by non-zero values. So, many of those didn't really seem different enough. In addition to the OTUs highlighted by ANCOM, Gneiss and Songbird, I picked out some others that seemed to have trends. Most of them are negatively dose-dependent. Saccharibacter floricola definitely shows the most dramatic dose-dependent effect.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

I also looked for correlations among the normalized counts for each of these taxa, using Kendall's rank correlation, which should be robust to zero-inflated data such as these. The highest correlations were among the Fructobacillus variants, suggesting that the split of these into separate taxa is not particularly useful. However, the two Snodgrassella strains have a very strong negative correlation, which is interesting, but perhaps beside the point of our experiment. Saccharibacter floricola was most strongly correlated with several of the unclassified Enterobacteriaceae, Fructobacillis, and Pseudomonas, and negatively correlated with Asaia, Lactobacillis bombicola, and the uncultured Bifidobacterium.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Relative abundance of OTUs-of-interest over time and by treatment

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

A note on using screen

Data source

Using Singularity

Read processing

Denoise

Diversity analysis

Feature classification

Differential abundance analysis using ANCOM

16S v6-v8

dada2 denoise optimization

Rarefaction

Classification

ANCOM by treatment

Differential abundance analysis with gneiss

Songbird

Conclusions

Count sums at higher taxonomic levels

Relative abundance of OTUs-of-interest over time and by treatment

Read more

Summer 2025 Lab Goals

Micro-CT scanning

RNA Synthesis

Reading List

A note on using `screen`