Before attempting this analysis, be sure you are comfortable working in linux on a HPC cluster. Review the linux tutorial if necessary.
Some other useful resources for microbiome analysis will be the Qiime2 Moving Pictures Tutorial (versions: 2021.4, 2023.5) and my analysis of Lyndell's cownose ray gut contents.
screen
Anytime you'll run a command in linux that might take a while to complete or otherwise be problematic, it's useful to run screen
. Starting a screen is like creating a new instance of your linux session, nested within the original one. To start it just type screen
. The screen will clear, and you'll see the same prompt. Here you can run some command and while it's running you can return to the original session. Type [control]-[A] and then [D] to "disconnect" from the current screen. Now you're back where you started. To recover the screen again, enter screen -r
. If you have multiple screens running at the same time, you'll be prompted to specify the one you want.
There are several suitable screens on:
145901.pts-1.n26 (Detached)
145764.pts-1.n26 (Detached)
Type "screen [-d] -r [pid.]tty.host" to resume one of them.
In this example we might return to the first one on the list by entering screen -r 145901
You can also start a screen with a name you specify as in screen -S microbiome.analysis
Screen is useful because you can keep time-consuming processes running without "breaking the pipe", for example if you close your laptop or go to sleep for the night.
To end a screen session, just type exit
from within it. It's good practice to clean-up screens when you're done with them. Also avoid creating "nested screens". Don't run screen
when you're already in a screen. That can get confusing, because [control]-[A] [D] will return you to the "base" session.
In 2017, our lab collected 168 Bombus workers from across Maine and isolated DNA in order to assess their gut microbiomes. Samples were prepped for 16S and 18S targetted libraries at the Dalhousie University Integrated Micorbiome Resource (IMR).
The path to the raw data on nscc is /export/groups/drangeli/Bombus.landscape/
Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the IMR from amplicons targetting 16S v4-v5.
GTGYCAGCMGCCGCGGTAA
CCGYCAATTYMTTTRAGTTT
Parada AE, Needham DM, Fuhrman JA. 2016. Every base matters: assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. Environ Microbiol 18(5):1403-14. LinkImage Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Walters W, Hyde ER, Berg-Lyons D, Ackermann G, Humphrey G, Parada A, Gilbert JA, Jansson JK, Caporaso JG, Fuhrman JA, Apprill A, Knight B. 2015. Improved bacterial 16S rRNA gene (V4 and V4-5) and fungal internal transcribed spacer marker gene primers for microbial community surveys. mSystems 1(1):e00009-15. LinkImage Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
And 16S v6-v8.
ACGCGHNRAACCTTACC
ACGGGCRGTGWGTRCAA
Comeau AM, Li WKW, Tremblay J-É, Carmack EC, Lovejoy C (2011) Arctic Ocean Microbial Community Structure before and after the 2007 Record Sea Ice Minimum. PLoS ONE 6(11): e27492. LinkImage Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Note that the singularity container doesn't initially have access to the folder you're working in, e.g. /export/groups/drangeli
. The data folder must be "bound" within the container using the --bind
arguement. This creates a folder at the root of the container, /data
, that contains the data in the working directoty (that is, the output of pwd
, such as /export/groups/drangeli/beeSeq.project/16s.v4v5/
. The /data
folder only exists within the container. So paths in the qiime
call refer to files using the /data/
path.
As of summer 2023, singularity seems unable to bind a specific path. The --bind
arguement seems to have no effect. Instead singularity defaults to binding the user's home folder. While this is inconvenient, it means you simply need to work in that folder. So, the modified alias might be alias qiime singularity run /export/groups/singularity/qiime_202104.simg qiime
I'm hoping to find a better solution soon!
Start by running screen
. Then define an alias to run Qiime2 via singularity. The alias will stay active as long as the screen instance is maintained.
alias qiime singularity run --bind "$PWD":/data /export/groups/singularity/qiime_202104.simg qiime
Get out of the screen (and leave it running) by typing the keys CTRL
+A
, then D
. When you're finally done using the screen, enter the command exit
.
More on Singularity and data mountingImage Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
There's another way to mount the data folder in Singularity that doesn't require it to be a named folder within the container. The argument--bind "$PWD":/export/home/drangeli
will replace the starting folder in the container with the contents of the current directory outside the container. This might be convenient, because commands wouldn't need to explicitly refer to that folder. However, I've stuck with the explicit\data\
binding because it seems safer.
I promise to clean this up later!
For the 16S v4-v5 reads, I'm working in nscc:/export/groups/drangeli/beeSeq.project/16s.v4v5/
. All the steps will be mirrored for the v6-v8 data in ../16s.v6v8/
.
I moved the older processing files into a sub-folder summer.2018.analysis/
.
Raw reads are in the raw.reads
folder. I've previously run FastQC on all the raw data files. Nothing merited exclusion.
Import the data into Qiime2 data structures and summarize. In updating this workflow, I'm refering to my notes from 2018 and to the current Moving Pictures tutortial from the Qiime2 docs.
qiime tools import\
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path /data/raw.reads \
--output-path /data/16s.v4.demux.qza \
--input-format 'CasavaOneEightSingleLanePerSampleDirFmt'
qiime demux summarize \
--i-data /data/16s.v4.demux.qza \
--o-visualization /data/16s.v4.demux.qzv
Download the qzv
visualization files to the local machine as follows.
scp nscc:/export/groups/drangeli/beeSeq.project/16s.v4v5/16s.v4.demux.qzv .
Drag and drop the qzv
file into a web browser at https://view.qiime2.org/
The next step is dada2
, a tool for detecting and correcting Illumina sequence data. It also filters out reads from phiX and chimeric sequences. Here's an example call.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs /data/16s.v4.demux.qza \
--p-trunc-len-f 265 --p-trunc-len-r 160 \
--p-trim-left-f 34 --p-trim-left-r 34 \
--o-representative-sequences /data/16s.v4.rep.seqs.qza \
--o-table /data/16s.v4.feature.table.qza \
--o-denoising-stats /data/16s.v4.denoise.stats.qza \
--p-n-threads 16
qiime metadata tabulate \
--m-input-file /data/16s.v4.denoise.stats.qza \
--o-visualization /data/16s.v4.denoise.stats.qzv
Then export the feature table to a more flexible format (tab-seperated values).
qiime tools export \
--input-path /data/16s.v4.feature.table.qza \
--output-path /data
singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v4v5/:/data /export/groups/singularity/qiime_202104.simg \
biom convert \
-i /data/feature-table.biom \
-o /data/16s.v4.feature.table.tsv \
--to-tsv
I ran the dada2 denoise
process under a range of truncation parameters using a bash script to produce the feature.table.qza
files and a summary tsv
file. I used the total number of features (presumptive OTUs) as the metric to choose the best parameters.
the bash scriptImage Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
useful hints for bash scriptingImage Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
These results from dada2
are slightly different than I found in 2018. The optimal parameters identified then produce slightly different numbers of OTUs now (385 features versus 375 in the past). The input data is exactly the same, and the denoise process is determinative (not stochastic), so the discrepency must be due to updates in the dada2 denoise
algorithm. The parameters identified as optimal this time are slightly higher for the forward truncation and slightly lower for the reverse read truncation. Therefore the program is using relatively more data from the forward reads. The forward reads are higher quality (which is typical for paired-end Illumina sequences), so that seems good.
These parameters (maximizing feature and read numbers) correspond to at least 75% of reads with PHRED quality scores of 34 or better.
I'm not entirely sure whether it's best to proceed with a feature table based on maximizing OTUs or total counts. If you look at the parameter space, there's a knife-edge drop-off in both metrics. That corresponds to the truncation values that start to erode the amplicon (410 bp for the 16s v4-v5 sequence). It's interesting that OTUs are maximized right at that drop off and are favored by more forward read sequence, but more reads (counts) pass filter just inside that boundary and with slightly more balanced contribution of forward and reverse reads. For now I'll move ahead with both files.Image Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
The "feature table" is a matrix of counts for each feature (OTU) in each sample. This can be imported into R for plotting and further analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to those IDs using the taxonomy.tsv
file which is generated later.
The commands below generate a list of the features, with their UUID, DNA sequence, and a link to BLAST search. This is useful as a record.
qiime feature-table tabulate-seqs \
--i-data /data/16s.v4.maxOTUs.rep.seqs.qza \
--o-visualization /data/16s.v4.maxOTUs.rep.seqs.qzv
qiime feature-table tabulate-seqs \
--i-data /data/16s.v4.maxcts.rep.seqs.qza \
--o-visualization /data/16s.v4.maxcts.rep.seqs.qzv
The feature-table summarize
command provides summary statistics and groups features on metadata category.
qiime feature-table summarize \
--i-table /data/16s.v4.maxOTUs.feature.table.qza \
--o-visualization /data/16s.v4.maxOTUs.feature.table.qzv \
--m-sample-metadata-file /data/metadata.beeSeq.microbiota.tsv
qiime feature-table summarize \
--i-table /data/16s.v4.maxcts.feature.table.qza \
--o-visualization /data/16s.v4.maxcts.feature.table.qzv \
--m-sample-metadata-file /data/metadata.beeSeq.microbiota.tsv
Generate trees for phylogenetic diversity analyses.
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences /data/16s.v4.maxOTUs.rep.seqs.qza \
--o-alignment /data/16s.v4.maxOTUs.aligned-rep-seqs.qza \
--o-masked-alignment /data/16s.v4.maxOTUs.masked-aligned-rep-seqs.qza \
--o-tree /data/16s.v4.maxOTUs.unrooted-tree.qza \
--o-rooted-tree /data/16s.v4.maxOTUs.rooted-tree.qza
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences /data/16s.v4.maxcts.rep.seqs.qza \
--o-alignment /data/16s.v4.maxcts.aligned-rep-seqs.qza \
--o-masked-alignment /data/16s.v4.maxcts.masked-aligned-rep-seqs.qza \
--o-tree /data/16s.v4.maxcts.unrooted-tree.qza \
--o-rooted-tree /data/16s.v4.maxcts.rooted-tree.qza
Determine the optimal rarefaction level by looking at the feature.table.qzv
visualizations. I chose the maximum sampling depth that would retain all day-25 samples from treatments A-D.
qiime diversity core-metrics-phylogenetic \
--i-phylogeny /data/16s.v4.maxOTUs.rooted-tree.qza \
--i-table /data/16s.v4.maxOTUs.feature.table.qza \
--p-sampling-depth 6017 \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--output-dir /data/core-metrics-results.maxOTUs
qiime diversity alpha-group-significance \
--i-alpha-diversity /data/core-metrics-results.maxOTUs/faith_pd_vector.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/core-metrics-results.maxOTUs/faith-pd-group-significance.qzv
qiime diversity alpha-group-significance \
--i-alpha-diversity /data/core-metrics-results.maxOTUs/evenness_vector.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/core-metrics-results.maxOTUs/evenness-group-significance.qzv
qiime diversity beta-group-significance \
--i-distance-matrix /data/core-metrics-results.maxOTUs/unweighted_unifrac_distance_matrix.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--m-metadata-column hive_day \
--o-visualization /data/core-metrics-results.maxOTUs/unweighted-unifrac-hive_day-significance.qzv \
--p-pairwise
qiime diversity core-metrics-phylogenetic \
--i-phylogeny /data/16s.v4.maxcts.rooted-tree.qza \
--i-table /data/16s.v4.maxcts.feature.table.qza \
--p-sampling-depth 6045 \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--output-dir /data/core-metrics-results.maxcts
qiime diversity alpha-group-significance \
--i-alpha-diversity /data/core-metrics-results.maxcts/faith_pd_vector.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/core-metrics-results.maxcts/faith-pd-group-significance.qzv
qiime diversity alpha-group-significance \
--i-alpha-diversity /data/core-metrics-results.maxcts/evenness_vector.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/core-metrics-results.maxcts/evenness-group-significance.qzv
qiime diversity beta-group-significance \
--i-distance-matrix /data/core-metrics-results.maxcts/unweighted_unifrac_distance_matrix.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--m-metadata-column hive_day \
--o-visualization /data/core-metrics-results.maxcts/unweighted-unifrac-hive_day-significance.qzv \
--p-pairwise
Qiime2’s core diversity metrics include:
Alpha diversity
Beta diversity
Naive Bayes classifiers have been built from the SILVA release 138 dataset, which was released in December 2019 and is currently the most recent major release. One was provided by the Qiime2 developers.
The Qiime2 docs reccommend building your classifier with sequences aligned to your amplicon. Despite many attempts I was unable to get the SILVA dataset into a format that would be accepted by Qiime. The classifier they provide works quite well however, and I think it's adequate to our needs.
cd /export/groups/drangeli/marker.ref.DBs/silva138/
wget "https://data.qiime2.org/2021.4/common/silva-138-99-nb-classifier.qza"
cd /export/groups/drangeli/beeSeq.project/16s.v4v5/
cp ../../marker.ref.DBs/silva138/silva-138-99-nb-classifier.qza .
qiime feature-classifier classify-sklearn \
--i-reads /data/16s.v4.maxOTUs.rep.seqs.qza \
--i-classifier /data/silva-138-99-nb-classifier.qza \
--o-classification /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
--p-n-jobs -3
qiime metadata tabulate \
--m-input-file /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
--o-visualization /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qzv
qiime tools export \
--input-path /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
--output-path /data/.
mv taxonomy.tsv 16s.v4.maxOTUs.silva138.taxonomy.calls.tsv
cut -f 3 16s.v4.maxOTUs.silva138.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk
The median posterior support for taxonomy assignments is 0.976.
qiime taxa barplot \
--i-table /data/16s.v4.maxOTUs.feature.table.qza \
--i-taxonomy /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/16s.v4.maxOTUs.silva138.taxa.barplot.qzv
qiime feature-classifier classify-sklearn \
--i-reads /data/16s.v4.maxcts.rep.seqs.qza \
--i-classifier /data/silva-138-99-nb-classifier.qza \
--o-classification /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
--p-n-jobs -3
qiime metadata tabulate \
--m-input-file /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
--o-visualization /data/16s.v4.maxcts.silva138.taxonomy.calls.qzv
qiime tools export \
--input-path /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
--output-path /data/.
mv taxonomy.tsv 16s.v4.maxcts.silva138.taxonomy.calls.tsv
cut -f 3 16s.v4.maxcts.silva138.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk
The median posterior support for taxonomy assignments is 0.984.
qiime taxa barplot \
--i-table /data/16s.v4.maxcts.feature.table.qza \
--i-taxonomy /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/16s.v4.maxcts.silva138.taxa.barplot.qzv
Inspecting the most common OTUs, the classifications look sensible. Snodgrassella, Candidatus Schmidhempelia, and Gilliamella perdominate.
Start by filtering out the free-foraging (treatment F) samples, and just going ahead with the day 25 samples.
qiime feature-table filter-samples \
--i-table /data/16s.v4.maxOTUs.feature.table.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--p-where "day='25' AND NOT treatment='F'" \
--o-filtered-table /data/16s.v4.maxOTUs.d25AD.feature.table.qza
Calculate pseudocounts (counts +1) to avoid log-of-zero errors.
qiime composition add-pseudocount \
--i-table /data/16s.v4.maxOTUs.d25AD.feature.table.qza \
--o-composition-table /data/16s.v4.maxOTUs.d25AD.comp.table.qza
ANCOM (ANalysis of COMposition) is a statistical method for the comparisons of feature abundance in sample groups. It should be robust to distributional assumptions and scaling to large sample sizes. ANCOM also assumes that relatively few (< ~ 25%) of the features are changing between groups.
qiime composition ancom \
--i-table /data/16s.v4.maxOTUs.d25AD.comp.table.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--m-metadata-column treatment \
--o-visualization /data/16s.v4.maxOTUs.d25AD.ancom.by.treatment.qzv
One feature is identified as significantly different: aa8b914828a465c360b7a079334746d8
, classifed only to the genus level as Saccharibacter by our Bayesian classifier, but a BLASTn search of the sequence to NCBI has a 99% identity match to Saccharibacter floricola.
I repeated the ANCOM analysis using the "maxcts" dataset and got almost precisely the same results: This OTU is the only one highlighted by ANCOM and the counts differ by only a few.
I ran all the same steps as above using the 16s.v6v8
dataset too.
This region yeilds far more features/OTUs and reads passing filter. The v6-v8 region was also the focus of the 2019 manuscript. It may make more sense to focus on these data in the final analysis.
Again, I chose sampling depths to retain all the day-25 samples in treatments A-D.
The more I think about it, 16s v6-v8 filtered to maximize reads (16s.v6.maxcts
) seems like the best dataset to proceed with.
One significantly different OTU: 699d78936fbf1cc18ab30cede7baad9e
classified as Saccharibacter floricola with 0.756 posterior support.
Another way to examine differences among treatments is the use of balance trees as implemented by the gneiss
algorithm Morton et al. 2017. Qiime has an instructive tutorial on its application of gneiss.
I'll restrict this analysis to the day-25 data. So the metadata file will need to be filtered down to those samples.
grep '^[ABCD]' metadata.beeSeq.microbiota.tsv > tmp
head -n 1 metadata.beeSeq.microbiota.tsv > metadata.beeSeq.microbiota.d25AD.tsv
grep '_day25_' tmp >> metadata.beeSeq.microbiota.d25AD.tsv
The functinality of qiime gneiss
appers to have been reduced since earlier versions. It's no longer possible to have the plug-in output the taxa on either side of a particular "balance" or node in the tree. Although the tree and balance data can be exported. The OLS regression function also seems to have disappeared. The Qiime2 forum says this was done to make it compatible with Songbird, although Songbird has also been eliminated from the most recent versions of Qiime2. I'll try it all anyway.
Gneiss can run based on trees created by clustering features based on their own correlations (correlation-clustering
), or in relation to some numerical factor (gradient-clustering
).
qiime gneiss correlation-clustering \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--o-clustering /data/16s.v6.maxcts.d25AD.hierarchy.qza
qiime tools export \
--input-path /data/16s.v6.maxcts.d25AD.hierarchy.qza \
--output-path /data/.
mv tree.nwk 16s.v6.maxcts.d25AD.hierarchy.tre
qiime gneiss ilr-hierarchical \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--i-tree /data/16s.v6.maxcts.d25AD.hierarchy.qza \
--o-balances /data/16s.v6.maxcts.d25AD.balances.qza
qiime tools export \
--input-path /data/16s.v6.maxcts.d25AD.balances.qza \
--output-path /data/.
singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \
biom convert \
-i /data/feature-table.biom \
-o /data/16s.v6.maxcts.d25AD.gneiss.balances.tsv \
--to-tsv
sed -i '1d' 16s.v6.maxcts.d25AD.gneiss.balances.tsv
qiime gneiss dendrogram-heatmap \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--i-tree /data/16s.v6.maxcts.d25AD.hierarchy.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
--m-metadata-column treatment \
--p-color-map seismic \
--o-visualization /data/16s.v6.maxcts.d25AD.gneiss.by.treatment.qzv
The Saccharibacter floricola feature 699d78936fbf1cc18ab30cede7baad9e
, highlighted by ANCOM, appears with 4 other features on one side of the y2
split. Most of the other features in the dataset are on the other side of that split. The additional 4 features are all Enterobacteriaceae, which the SILVA classifier failed to resolve to a lower taxonomic level. BLAST has multiple perfect matches to each of them.
feature UUID | BLAST result |
---|---|
d3286dc2d6be57a96881c9f6a43b2859 | Citrobacter werkmanii; Klebsiella oxytoca; several uncultured bacteria |
8fa34849d5dded61d203db8249907103 | Citrobacter freundii strain RTC1; Enterobacter sp. TS4; Uncultured bacterium AMW_D6 |
8d625c592171b66b877523cc46ac3963 | Klebsiella or Citrobacter |
8026cfce27b83d70eb073a10947f3f94 | Klebsiella or Citrobacter |
This result matches the group of OTUs identified by Gneiss in the previous analysis, which contained Saccharibacter sp. and 6 unidentified Enterobacteriaceae.
I also tried the gradient clustering method.
qiime gneiss gradient-clustering \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--m-gradient-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
--m-gradient-column dose_ppb \
--o-clustering /data/16s.v6.maxcts.d25AD.grad.clust.qza
qiime tools export \
--input-path /data/16s.v6.maxcts.d25AD.grad.clust.qza \
--output-path /data/.
mv tree.nwk 16s.v6.maxcts.d25AD.grad.clust.tre
qiime gneiss ilr-hierarchical \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--i-tree /data/16s.v6.maxcts.d25AD.grad.clust.qza \
--o-balances /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.qza
qiime tools export \
--input-path /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.qza \
--output-path /data/.
singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \
biom convert \
-i /data/feature-table.biom \
-o /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.tsv \
--to-tsv
sed -i '1d' 16s.v6.maxcts.d25AD.gneiss.grad.balances.tsv
qiime gneiss dendrogram-heatmap \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--i-tree /data/16s.v6.maxcts.d25AD.grad.clust.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
--m-metadata-column treatment \
--p-color-map seismic \
--o-visualization /data/16s.v6.maxcts.d25AD.gneiss.grad.by.dose.qzv
From looking at the visualization, there doesn't seem to be an obvious split, where a large block of features are associated with any of the treatment groups in a consistent way. However, the Saccharibacter floricola feature 699d78936fbf1cc18ab30cede7baad9e
, highlighted by ANCOM, appears with several other features around the y9
split. 8fa34849d5dded61d203db8249907103
from the correlation gneiss analysis also appears here.
feature UUID | Classification or BLAST result |
---|---|
1a196db6e5bdd828fec6dd3e5dcbef1c | Lactobacillus bombicola |
f1459b5093b227402d67fe3d593d4e44 | Lactobacillus bombicola |
48a57dc7a4e32a52c0ed2aac00d97c14 | Bifidobacterium bohemicum |
595d46a49f01ca56f91b8947ff9cc272 | Bifidobacterium bohemicum |
8026cfce27b83d70eb073a10947f3f94 | Klebsiella or Citrobacter |
0eb3327ed1fcfec6377770463795cd5f | Asaia |
Songbird is an algorithm to rank features based on their log-fold differences in abundance with respect to various metadata factors (Morton et al. 2019). It allows for complex multinomial regression models with continuous and categorical factors. The order of factors n the model does not matter, except that differentials will be reported for the first factor, controlling for influences of the later factors. (This is similar to the logic of a linear mixed model, where the first term is a fixed effect and other terms are random effects.)
For reasons that remain unclear, Songbird is only implemented in Qiime2 versions 2019.7 through 2020.6. Randy Downer has set-up a separate container for version 2020.6. While this is an implementation of Songbird that accompanied Qiime2, it's actually the stand-alone version of Songbird I'll be calling. As before, I'll define an alias to call the singularity container with Songbird.
alias songbird singularity run --bind "$PWD":/data /export/groups/singularity/qiime_songbird_202006.simg songbird
Because we'll be running this program outside the Qiime2 environment, we need to export the feature table file from Qiime2's qza
format into the biom
format expected by Songbird.
qiime tools export \
--input-path /data/16s.v6.maxcts.d25AD.feature.table.qza \
--output-path /data
mv feature-table.biom 16s.v6.maxcts.d25AD.feature.table.biom
I'll try the obvious analysis first: looking for dose effects, controlling for the influence of hive.
songbird multinomial \
--input-biom /data/16s.v6.maxcts.d25AD.feature.table.biom \
--metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
--formula "dose_ppb+hive" \
--epochs 10000 \
--differential-prior 1 \
--summary-interval 1 \
--summary-dir /data/songbird.16s.v6.maxcts.d25AD.dose.hive
Compare this against a null model.
songbird multinomial \
--input-biom /data/16s.v6.maxcts.d25AD.feature.table.biom \
--metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
--formula "1" \
--epochs 10000 \
--differential-prior 1 \
--summary-interval 1 \
--summary-dir /data/songbird.16s.v6.maxcts.d25AD.null
These runs completed in just a few minutes. To interpret the fit of the models, it's necessary to use TensorBoard to view the machine learning logs, which are sadly kept in binary outfield that can't otherwise be accessed.
TensorBoard is run via Jupyter notebooks. Randy did a lot of trouble-shooting to figure out how to do this. First log into https://jupyter.colby.edu/. In the upper right select New
, then terminal
. A terminal window will open in a new tab, within JupyterHub. Create a link to the directory containing the Songbird output. For example, ln -s /research/drangeli research
. Go back to the main JupyterHub tab and refresh. An item called research
(in this example) should now appear. Clicking on it will provide the contents of /research/drangeli/
in JupyterHub. Click through the directory structure util you're in the Songbird output folder. (JupyterHub can't access /export/groups/
. So copy any work from there to /research/
)
Next, create a new python3 notebook, and enter the following code. (The %
characters are required.)
%load_ext tensorboard
%env TENSORBOARD_BINARY=/usr/local/anaconda/bin/tensorboard
Run that and create a second code block.
%tensorboard --logdir . --bind_all --path_prefix /tensorboard
Running this will give a blank image. Run it again. This time the error message will include a port number. Open a new browser tab to http://jupyter.colby.edu:<portnumber>/tensorboard
It should produce a page looking like this.
Based on the Songbird docs for interpreting model fits both the cross validation accuracy (cv_error
) and loss plots should decay exponentially and plateau. Loss is the squared difference of true values from predicted ones (square(y_true - y_pred)
) and should ideally approach zero. And a good model should outperform the null model!
In TensorBoard, it's possible to download these logged values as a CSV file.
It's arguable whether the hive
factor is necessary. Let's run another model without it.
songbird multinomial \
--input-biom /data/16s.v6.maxcts.d25AD.feature.table.biom \
--metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
--formula "dose_ppb" \
--epochs 10000 \
--differential-prior 1 \
--summary-interval 1 \
--summary-dir /data/songbird.16s.v6.maxcts.d25AD.dose
Our metadata has several other factors. Since songbird doesn't take too long to run, I'll try more elaborate models to see if I run into overfitting.
head -n 1 metadata.beeSeq.microbiota.d25AD.tsv
#SampleID treatment dose_ppb treatment_day hive_day hive collection_rep plate row column well day RNAconc RNAabsratio DNAconc DNAabsratio description
In these images I've overlaid the stable-phase means, judged by eye. The dose+hive
model appears to be the best. It's accuracy and loss look indistinguishable from a model adding plate
, a batch effect referring to the 96-well plate a DNA samples was stored and shipped on. Adding DNAabsratio
, the DNA quality metric of A260/A280, has clearly resulted in overfitting, as indicated by the rising values of cv_error over time and the greater range in values for loss. Just using dose
provides a small but noticeable improvement over the null model, by about 5 units of cv_error. This implies that hive
actually has a much larger effect on microbial composition than does imidacloprid dose.
Here are some other models I tried.
The factor hive
alone provides a very good explanation of the composition. Adding dose
or treatment
might improve the model just slightly, but of course it's what we're most interested in. There appears to be little difference between using dose
or treatment
, which is just a categorical version of the dose factor.
I started wondering about the fact that because hives are nested within dose treatments, much of the variance in composition may be captured by hive
rather than a factor conveying dose. Songbird is not like a Type I ANOVA, where variance is partitioned to the first factor in the model, and later factors account for remaining variance. So, I tried two categorical, binary factors dosed
, which separates the control treatment from all dosed treatments, and high_dose
which distinguishes the highest-dose treatment from all others, including the zero-dose control. These models may be marginal improvements over the dose_pbb
model, but clearly hive
needs to be accounted for.
While the Songbird docs don't discuss complex models, I simply tried using hive
as a nested factor and found it works. Running the nested model longer and with a lower differential prior did not change the values at stability. (We 4 hives per treatment, so there are 15 hive terms (plus one for dose and an intercept). There are 5 individual samples per hive, and that's not the 10-to-1 ratio of samples to model terms that is recommended. So while the metrics look okay, I am concerned about overfitting.)
Next, we can examine the differentials. Here they are from the dose + hive
model (10,000 epochs; differential prior of 1).
The plot on the left shows differentials for 32 OTUs based on imidacloprid dose in a model accounting for hive-specific influences. Several Lactobacillus are negatively associated with dose. Several Enterobacter species are positively associated. Saccharibacter floricola, identified by ANCOM and Gneiss as correlated with imidacloprid dose, does not emerge as strongly affected in this analysis. Using only dose as the predictor of abundance identified the same OTUs at the extremes, but there's more variation in the intermediate OTU ranks.
Next is the differential plot from the nested model (100,000 epochs; differential prior of 0.5).
It's clear that Songbird detects very little signal associated with treatment. It may be the case that the signal here just isn't strong enough to use Songbird for this purpose. It may be better to highlight the findings of ANCOM and Gneiss, both of which highlight Saccharibacter floricola as differently expressed.
It may also be useful to take the feature table into R, and look at composition at a higher taxonomic level, such as order or phylum.
Resuming work in August 2022
It's been a while, working on other projects. Reviewing everything here, I'm satisfied that this previous work is still valid. I will proceed as suggested above.
In order to look at higher taxonomic levels, it will be necessary to sum OTU counts based on their taxonomic classifications. This will be easiest in R. First, the feature table and classifications can be exported from the Qiime data files.
cd /export/groups/drangeli/beeSeq.project/16s.v6v8
screen -S beeSeq
alias qiime singularity run --bind "$PWD":/data /export/groups/singularity/qiime_202104.simg qiime
qiime tools export \
--input-path /data/16s.v6.maxcts.d25AD.feature.table.qza \
--output-path /data
singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \
biom convert \
-i /data/feature-table.biom \
-o /data/16s.v6.maxcts.d25AD.feature.table.tsv \
--to-tsv
The files 16s.v6.maxcts.d25AD.feature.table.tsv
and 16s.v6.maxcts.silva138.taxonomy.calls.tsv
were copied locally for import into R.
The script for this analysis is counts.by.taxonomic.level.R
. Counts were normalized by the depth (colSums
) for each sample. I checked all taxonomic levels. There were no dramatic patterns at higher taxonomic levels that weren't driven strongly by a few obvious species-level OTUs, with the exception of Enterobacteriaceae, where several species were all negatively affected by imidacloprid.
I check the trends for each of the OTUs highlighted by Songbird. That algorithm is largely influenced by non-zero values. So, many of those didn't really seem different enough. In addition to the OTUs highlighted by ANCOM, Gneiss and Songbird, I picked out some others that seemed to have trends. Most of them are negatively dose-dependent. Saccharibacter floricola definitely shows the most dramatic dose-dependent effect.
I also looked for correlations among the normalized counts for each of these taxa, using Kendall's rank correlation, which should be robust to zero-inflated data such as these. The highest correlations were among the Fructobacillus variants, suggesting that the split of these into separate taxa is not particularly useful. However, the two Snodgrassella strains have a very strong negative correlation, which is interesting, but perhaps beside the point of our experiment. Saccharibacter floricola was most strongly correlated with several of the unclassified Enterobacteriaceae, Fructobacillis, and Pseudomonas, and negatively correlated with Asaia, Lactobacillis bombicola, and the uncultured Bifidobacterium.