> [Main page](https://hackmd.io/s/BJ4FiV_W7) > [Enterotype Analysis](https://hackmd.io/riUEPiNuRKC48BcsMBBAPA?view) Before attempting this analysis, be sure you are comfortable working in linux on a HPC cluster. Review the [linux tutorial](https://hackmd.io/EU18COvqTJOGEHEOkWmnHQ) if necessary. Some other useful resources for microbiome analysis will be the Qiime2 Moving Pictures Tutorial (versions: [2021.4](https://docs.qiime2.org/2021.4/tutorials/moving-pictures/), [2023.5](https://docs.qiime2.org/2023.5/tutorials/moving-pictures/)) and my analysis of [Lyndell's cownose ray gut contents](https://hackmd.io/s/Skp1GUtb7). #### A note on using `screen` Anytime you'll run a command in linux that might take a while to complete or otherwise be problematic, it's useful to run `screen`. Starting a screen is like creating a new instance of your linux session, nested within the original one. To start it just type `screen`. The screen will clear, and you'll see the same prompt. Here you can run some command and while it's running you can return to the original session. Type [control]-[A] and then [D] to "disconnect" from the current screen. Now you're back where you started. To recover the screen again, enter `screen -r`. If you have multiple screens running at the same time, you'll be prompted to specify the one you want. ```bash There are several suitable screens on: 145901.pts-1.n26 (Detached) 145764.pts-1.n26 (Detached) Type "screen [-d] -r [pid.]tty.host" to resume one of them. ``` In this example we might return to the first one on the list by entering `screen -r 145901` You can also start a screen with a name you specify as in `screen -S microbiome.analysis` Screen is useful because you can keep time-consuming processes running without "breaking the pipe", for example if you close your laptop or go to sleep for the night. To end a screen session, just type `exit` from within it. It's good practice to clean-up screens when you're done with them. Also avoid creating "nested screens". Don't run `screen` when you're already in a screen. That can get confusing, because [control]-[A] [D] will return you to the "base" session. ## Data source In 2017, our lab collected 168 *Bombus* workers from across Maine and isolated DNA in order to assess their gut microbiomes. Samples were prepped for 16S and 18S targetted libraries at the [Dalhousie University Integrated Micorbiome Resource (IMR)](http://cgeb-imr.ca/index.html). The path to the raw data on **nscc** is `/export/groups/drangeli/Bombus.landscape/` Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the [IMR](http://cgeb-imr.ca/index.html) from amplicons targetting 16S v4-v5. * 515F `GTGYCAGCMGCCGCGGTAA` * 926R `CCGYCAATTYMTTTRAGTTT` * The product is **410 bp**, resulting in a 190 bp overlap. > :information_source: Parada AE, Needham DM, Fuhrman JA. 2016. Every base matters: assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. *Environ Microbiol* 18(5):1403-14. [Link](http://dx.doi.org/10.1111/1462-2920.13023) > > :information_source: Walters W, Hyde ER, Berg-Lyons D, Ackermann G, Humphrey G, Parada A, Gilbert JA, Jansson JK, Caporaso JG, Fuhrman JA, Apprill A, Knight B. 2015. Improved bacterial 16S rRNA gene (V4 and V4-5) and fungal internal transcribed spacer marker gene primers for microbial community surveys. *mSystems* 1(1):e00009-15. [Link](https://msystems.asm.org/content/1/1/e00009-5) And 16S v6-v8. * B969F `ACGCGHNRAACCTTACC` * BA1406R `ACGGGCRGTGWGTRCAA` * The product is **450 bp**, resulting in a 150 bp overlap. > :information_source: Comeau AM, Li WKW, Tremblay J-É, Carmack EC, Lovejoy C (2011) Arctic Ocean Microbial Community Structure before and after the 2007 Record Sea Ice Minimum. *PLoS ONE* 6(11): e27492. [Link](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0027492) ## Using Singularity Note that the singularity container doesn't initially have access to the folder you're working in, e.g. `/export/groups/drangeli`. The data folder must be "bound" within the container using the `--bind` arguement. This creates a folder at the root of the container, `/data`, that contains the data in the working directoty (that is, the output of `pwd`, such as `/export/groups/drangeli/beeSeq.project/16s.v4v5/`. The `/data` folder only exists within the container. So paths in the `qiime` call refer to files using the `/data/` path. :::warning As of summer 2023, singularity seems unable to bind a specific path. The `--bind` arguement seems to have no effect. Instead singularity defaults to binding the user's home folder. While this is inconvenient, it means you simply need to work in that folder. So, the modified alias might be `alias qiime singularity run /export/groups/singularity/qiime_202104.simg qiime` I'm hoping to find a better solution soon! ::: Start by running `screen`. Then define an alias to run [Qiime2](https://docs.qiime2.org/) via [singularity](https://sylabs.io/guides/3.5/user-guide/introduction.html). The alias will stay active as long as the screen instance is maintained. ```bash alias qiime singularity run --bind "$PWD":/data /export/groups/singularity/qiime_202104.simg qiime ``` Get out of the screen (and leave it running) by typing the keys `CTRL`+`A`, then `D`. When you're finally done using the screen, enter the command `exit`. > :warning: **More on Singularity and data mounting** > There's another way to mount the data folder in Singularity that doesn't require it to be a named folder within the container. The argument `--bind "$PWD":/export/home/drangeli` will replace the starting folder in the container with the contents of the current directory *outside* the container. This might be convenient, because commands wouldn't need to explicitly refer to that folder. However, I've stuck with the explicit `\data\` binding because it seems safer. ::: danger I promise to clean this up later! ::: ## Read processing For the 16S v4-v5 reads, I'm working in `nscc:/export/groups/drangeli/beeSeq.project/16s.v4v5/`. All the steps will be mirrored for the v6-v8 data in `../16s.v6v8/`. I moved the older processing files into a sub-folder `summer.2018.analysis/`. Raw reads are in the `raw.reads` folder. I've previously run FastQC on all the raw data files. Nothing merited exclusion. Import the data into Qiime2 data structures and summarize. In updating this workflow, I'm refering to [my notes from 2018](https://hackmd.io/ws-zHwGKSYicgUyU9zl-Qg?both#Data-processing) and to [the current *Moving Pictures* tutortial from the Qiime2 docs](https://docs.qiime2.org/2021.4/tutorials/moving-pictures/#obtaining-and-importing-data). ```bash qiime tools import\ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path /data/raw.reads \ --output-path /data/16s.v4.demux.qza \ --input-format 'CasavaOneEightSingleLanePerSampleDirFmt' qiime demux summarize \ --i-data /data/16s.v4.demux.qza \ --o-visualization /data/16s.v4.demux.qzv ``` Download the `qzv` visualization files to the local machine as follows. ```bash scp nscc:/export/groups/drangeli/beeSeq.project/16s.v4v5/16s.v4.demux.qzv . ``` Drag and drop the `qzv` file into a web browser at https://view.qiime2.org/ ![](https://i.imgur.com/rC6KMrN.png) ## Denoise The next step is `dada2`, a tool for detecting and correcting Illumina sequence data. It also filters out reads from phiX and chimeric sequences. Here's an example call. ```bash qiime dada2 denoise-paired \ --i-demultiplexed-seqs /data/16s.v4.demux.qza \ --p-trunc-len-f 265 --p-trunc-len-r 160 \ --p-trim-left-f 34 --p-trim-left-r 34 \ --o-representative-sequences /data/16s.v4.rep.seqs.qza \ --o-table /data/16s.v4.feature.table.qza \ --o-denoising-stats /data/16s.v4.denoise.stats.qza \ --p-n-threads 16 qiime metadata tabulate \ --m-input-file /data/16s.v4.denoise.stats.qza \ --o-visualization /data/16s.v4.denoise.stats.qzv ``` Then export the feature table to a more flexible format (tab-seperated values). ```bash qiime tools export \ --input-path /data/16s.v4.feature.table.qza \ --output-path /data singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v4v5/:/data /export/groups/singularity/qiime_202104.simg \ biom convert \ -i /data/feature-table.biom \ -o /data/16s.v4.feature.table.tsv \ --to-tsv ``` I ran the `dada2 denoise` process under a range of truncation parameters using a [bash script](https://github.com/aphanotus/BeeSeq/blob/main/dada2.filter.optimization.sh) to produce the `feature.table.qza` files and a summary `tsv` file. I used the total number of features (presumptive OTUs) as the metric to choose the best parameters. > :information_source: [the bash script](https://github.com/aphanotus/BeeSeq/blob/main/dada2.filter.optimization.sh) > > :information_source: [useful hints for bash scripting](https://devhints.io/bash) ![](https://i.imgur.com/3ShPQAT.jpg) These results from `dada2` are slightly different than I found in 2018. The optimal parameters identified then produce slightly different numbers of OTUs now (385 features versus 375 in the past). The input data is exactly the same, and the denoise process is determinative (not stochastic), so the discrepency must be due to updates in the `dada2 denoise` algorithm. The parameters identified as optimal this time are slightly higher for the forward truncation and slightly lower for the reverse read truncation. Therefore the program is using relatively more data from the forward reads. The forward reads are higher quality (which is typical for paired-end Illumina sequences), so that seems good. These parameters (maximizing feature and read numbers) correspond to at least 75% of reads with PHRED quality scores of 34 or better. > :warning: I'm not entirely sure whether it's best to proceed with a feature table based on maximizing OTUs or total counts. If you look at the parameter space, there's a knife-edge drop-off in both metrics. That corresponds to the truncation values that start to erode the amplicon (410 bp for the 16s v4-v5 sequence). It's interesting that OTUs are maximized right at that drop off and are favored by more forward read sequence, but more reads (counts) pass filter just inside that boundary and with slightly more balanced contribution of forward and reverse reads. For now I'll move ahead with both files. The "feature table" is a matrix of counts for each feature (OTU) in each sample. This can be imported into R for plotting and further analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to those IDs using the `taxonomy.tsv` file which is generated later. The commands below generate a list of the features, with their UUID, DNA sequence, and a link to BLAST search. This is useful as a record. ```bash qiime feature-table tabulate-seqs \ --i-data /data/16s.v4.maxOTUs.rep.seqs.qza \ --o-visualization /data/16s.v4.maxOTUs.rep.seqs.qzv qiime feature-table tabulate-seqs \ --i-data /data/16s.v4.maxcts.rep.seqs.qza \ --o-visualization /data/16s.v4.maxcts.rep.seqs.qzv ``` The `feature-table summarize` command provides summary statistics and groups features on metadata category. ```bash qiime feature-table summarize \ --i-table /data/16s.v4.maxOTUs.feature.table.qza \ --o-visualization /data/16s.v4.maxOTUs.feature.table.qzv \ --m-sample-metadata-file /data/metadata.beeSeq.microbiota.tsv qiime feature-table summarize \ --i-table /data/16s.v4.maxcts.feature.table.qza \ --o-visualization /data/16s.v4.maxcts.feature.table.qzv \ --m-sample-metadata-file /data/metadata.beeSeq.microbiota.tsv ``` ## Diversity analysis Generate trees for phylogenetic diversity analyses. ```bash qiime phylogeny align-to-tree-mafft-fasttree \ --i-sequences /data/16s.v4.maxOTUs.rep.seqs.qza \ --o-alignment /data/16s.v4.maxOTUs.aligned-rep-seqs.qza \ --o-masked-alignment /data/16s.v4.maxOTUs.masked-aligned-rep-seqs.qza \ --o-tree /data/16s.v4.maxOTUs.unrooted-tree.qza \ --o-rooted-tree /data/16s.v4.maxOTUs.rooted-tree.qza qiime phylogeny align-to-tree-mafft-fasttree \ --i-sequences /data/16s.v4.maxcts.rep.seqs.qza \ --o-alignment /data/16s.v4.maxcts.aligned-rep-seqs.qza \ --o-masked-alignment /data/16s.v4.maxcts.masked-aligned-rep-seqs.qza \ --o-tree /data/16s.v4.maxcts.unrooted-tree.qza \ --o-rooted-tree /data/16s.v4.maxcts.rooted-tree.qza ``` Determine the optimal rarefaction level by looking at the `feature.table.qzv` visualizations. I chose the maximum sampling depth that would retain all day-25 samples from treatments A-D. ```bash qiime diversity core-metrics-phylogenetic \ --i-phylogeny /data/16s.v4.maxOTUs.rooted-tree.qza \ --i-table /data/16s.v4.maxOTUs.feature.table.qza \ --p-sampling-depth 6017 \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --output-dir /data/core-metrics-results.maxOTUs qiime diversity alpha-group-significance \ --i-alpha-diversity /data/core-metrics-results.maxOTUs/faith_pd_vector.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --o-visualization /data/core-metrics-results.maxOTUs/faith-pd-group-significance.qzv qiime diversity alpha-group-significance \ --i-alpha-diversity /data/core-metrics-results.maxOTUs/evenness_vector.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --o-visualization /data/core-metrics-results.maxOTUs/evenness-group-significance.qzv qiime diversity beta-group-significance \ --i-distance-matrix /data/core-metrics-results.maxOTUs/unweighted_unifrac_distance_matrix.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --m-metadata-column hive_day \ --o-visualization /data/core-metrics-results.maxOTUs/unweighted-unifrac-hive_day-significance.qzv \ --p-pairwise qiime diversity core-metrics-phylogenetic \ --i-phylogeny /data/16s.v4.maxcts.rooted-tree.qza \ --i-table /data/16s.v4.maxcts.feature.table.qza \ --p-sampling-depth 6045 \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --output-dir /data/core-metrics-results.maxcts qiime diversity alpha-group-significance \ --i-alpha-diversity /data/core-metrics-results.maxcts/faith_pd_vector.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --o-visualization /data/core-metrics-results.maxcts/faith-pd-group-significance.qzv qiime diversity alpha-group-significance \ --i-alpha-diversity /data/core-metrics-results.maxcts/evenness_vector.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --o-visualization /data/core-metrics-results.maxcts/evenness-group-significance.qzv qiime diversity beta-group-significance \ --i-distance-matrix /data/core-metrics-results.maxcts/unweighted_unifrac_distance_matrix.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --m-metadata-column hive_day \ --o-visualization /data/core-metrics-results.maxcts/unweighted-unifrac-hive_day-significance.qzv \ --p-pairwise ``` Qiime2’s core diversity metrics include: **Alpha diversity** - Shannon’s diversity index (a quantitative measure of community richness) - Observed OTUs (a qualitative measure of community richness) - Faith’s Phylogenetic Diversity (a qualitiative measure of community richness that incorporates phylogenetic relationships between the features) - Evenness (or Pielou’s Evenness; a measure of community evenness) **Beta diversity** - Jaccard distance (a qualitative measure of community dissimilarity) - Bray-Curtis distance (a quantitative measure of community dissimilarity) - unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features) - weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features) ## Feature classification Naive Bayes classifiers have been built from the [SILVA release 138](https://www.arb-silva.de/documentation/release-138/) dataset, which was released in December 2019 and is currently the most recent major release. One was [provided by the Qiime2 developers](https://docs.qiime2.org/2021.4/data-resources/). The Qiime2 docs reccommend building your classifier with sequences aligned to your amplicon. Despite many attempts I was unable to get the SILVA dataset into a format that would be accepted by Qiime. The classifier they provide works quite well however, and I think it's adequate to our needs. ```bash cd /export/groups/drangeli/marker.ref.DBs/silva138/ wget "https://data.qiime2.org/2021.4/common/silva-138-99-nb-classifier.qza" cd /export/groups/drangeli/beeSeq.project/16s.v4v5/ cp ../../marker.ref.DBs/silva138/silva-138-99-nb-classifier.qza . qiime feature-classifier classify-sklearn \ --i-reads /data/16s.v4.maxOTUs.rep.seqs.qza \ --i-classifier /data/silva-138-99-nb-classifier.qza \ --o-classification /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \ --p-n-jobs -3 qiime metadata tabulate \ --m-input-file /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \ --o-visualization /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qzv qiime tools export \ --input-path /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \ --output-path /data/. mv taxonomy.tsv 16s.v4.maxOTUs.silva138.taxonomy.calls.tsv cut -f 3 16s.v4.maxOTUs.silva138.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk ``` The median posterior support for taxonomy assignments is 0.976. ```bash qiime taxa barplot \ --i-table /data/16s.v4.maxOTUs.feature.table.qza \ --i-taxonomy /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --o-visualization /data/16s.v4.maxOTUs.silva138.taxa.barplot.qzv ``` ```bash qiime feature-classifier classify-sklearn \ --i-reads /data/16s.v4.maxcts.rep.seqs.qza \ --i-classifier /data/silva-138-99-nb-classifier.qza \ --o-classification /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \ --p-n-jobs -3 qiime metadata tabulate \ --m-input-file /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \ --o-visualization /data/16s.v4.maxcts.silva138.taxonomy.calls.qzv qiime tools export \ --input-path /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \ --output-path /data/. mv taxonomy.tsv 16s.v4.maxcts.silva138.taxonomy.calls.tsv cut -f 3 16s.v4.maxcts.silva138.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk ``` The median posterior support for taxonomy assignments is 0.984. ```bash qiime taxa barplot \ --i-table /data/16s.v4.maxcts.feature.table.qza \ --i-taxonomy /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --o-visualization /data/16s.v4.maxcts.silva138.taxa.barplot.qzv ``` Inspecting the most common OTUs, the classifications look sensible. *Snodgrassella*, *Candidatus Schmidhempelia*, and *Gilliamella* perdominate. ## Differential abundance analysis using ANCOM Start by filtering out the free-foraging (treatment F) samples, and just going ahead with the day 25 samples. ```bash qiime feature-table filter-samples \ --i-table /data/16s.v4.maxOTUs.feature.table.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --p-where "day='25' AND NOT treatment='F'" \ --o-filtered-table /data/16s.v4.maxOTUs.d25AD.feature.table.qza ``` Calculate pseudocounts (counts +1) to avoid log-of-zero errors. ```bash qiime composition add-pseudocount \ --i-table /data/16s.v4.maxOTUs.d25AD.feature.table.qza \ --o-composition-table /data/16s.v4.maxOTUs.d25AD.comp.table.qza ``` [ANCOM](https://www.ncbi.nlm.nih.gov/pubmed/26028277) (ANalysis of COMposition) is a statistical method for the comparisons of feature abundance in sample groups. It should be robust to distributional assumptions and scaling to large sample sizes. ANCOM also assumes that relatively few (< ~ 25%) of the features are changing between groups. ```bash qiime composition ancom \ --i-table /data/16s.v4.maxOTUs.d25AD.comp.table.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \ --m-metadata-column treatment \ --o-visualization /data/16s.v4.maxOTUs.d25AD.ancom.by.treatment.qzv ``` One feature is identified as significantly different: `aa8b914828a465c360b7a079334746d8`, classifed only to the genus level as *Saccharibacter* by our Bayesian classifier, but a [BLASTn search](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=CGTTGCTCGGAATGACTGGGCGTAAAGGGCGCGTAGGCGGTTTATACAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTTGATACGTATTGACTAGAGTCCGAGAGAGGATTGCGGAATTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTTGCGAAGGCGGCAATCTGGCTCGGAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGTGTGCTGGATGTTGGGAAGCTTAGCTTTTCAGTGTCGGAGCTAACGCGTTAAGCACACCGCCTGGGGAGTA) of the sequence to NCBI has a 99% identity match to *[Saccharibacter floricola](https://en.wikipedia.org/wiki/Saccharibacter_floricola)*. ![](https://i.imgur.com/w3F3WXC.jpg) I repeated the ANCOM analysis using the "maxcts" dataset and got almost precisely the same results: This OTU is the only one highlighted by ANCOM and the counts differ by only a few. ## 16S v6-v8 I ran all the same steps as above using the `16s.v6v8` dataset too. ### dada2 denoise optimization ![](https://i.imgur.com/N58p8YQ.jpg) This region yeilds far more features/OTUs and reads passing filter. The v6-v8 region was also the focus of the 2019 manuscript. It may make more sense to focus on these data in the final analysis. ### Rarefaction Again, I chose sampling depths to retain all the day-25 samples in treatments A-D. - max OTUs: 4965 - max counts: 6075 ### Classification - max OTUs: 0.9428 median posterior support for classifications - max counts: 0.969 **The more I think about it, 16s v6-v8 filtered to maximize reads (`16s.v6.maxcts`) seems like the best dataset to proceed with.** ## ANCOM by treatment One significantly different OTU: `699d78936fbf1cc18ab30cede7baad9e` classified as *Saccharibacter floricola* with 0.756 posterior support. ![](https://i.imgur.com/Fq8GQY1.jpg) ## Differential abundance analysis with gneiss Another way to examine differences among treatments is the use of balance trees as implemented by the `gneiss` algorithm [Morton et al. 2017](https://msystems.asm.org/content/2/1/e00162-16). Qiime has an instructive [tutorial](https://docs.qiime2.org/2021.4/tutorials/gneiss/) on its application of gneiss. I'll restrict this analysis to the day-25 data. So the metadata file will need to be filtered down to those samples. ```bash grep '^[ABCD]' metadata.beeSeq.microbiota.tsv > tmp head -n 1 metadata.beeSeq.microbiota.tsv > metadata.beeSeq.microbiota.d25AD.tsv grep '_day25_' tmp >> metadata.beeSeq.microbiota.d25AD.tsv ``` The functinality of `qiime gneiss` appers to have been reduced since earlier versions. It's no longer possible to have the plug-in output the taxa on either side of a particular "balance" or node in the tree. Although the tree and balance data can be exported. The OLS regression function also seems to have disappeared. The Qiime2 forum says this was done to make it compatible with Songbird, although Songbird has also been eliminated from the most recent versions of Qiime2. I'll try it all anyway. Gneiss can run based on trees created by clustering features based on their own correlations (`correlation-clustering`), or in relation to some numerical factor (`gradient-clustering`). ```bash qiime gneiss correlation-clustering \ --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \ --o-clustering /data/16s.v6.maxcts.d25AD.hierarchy.qza qiime tools export \ --input-path /data/16s.v6.maxcts.d25AD.hierarchy.qza \ --output-path /data/. mv tree.nwk 16s.v6.maxcts.d25AD.hierarchy.tre qiime gneiss ilr-hierarchical \ --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \ --i-tree /data/16s.v6.maxcts.d25AD.hierarchy.qza \ --o-balances /data/16s.v6.maxcts.d25AD.balances.qza qiime tools export \ --input-path /data/16s.v6.maxcts.d25AD.balances.qza \ --output-path /data/. singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \ biom convert \ -i /data/feature-table.biom \ -o /data/16s.v6.maxcts.d25AD.gneiss.balances.tsv \ --to-tsv sed -i '1d' 16s.v6.maxcts.d25AD.gneiss.balances.tsv qiime gneiss dendrogram-heatmap \ --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \ --i-tree /data/16s.v6.maxcts.d25AD.hierarchy.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \ --m-metadata-column treatment \ --p-color-map seismic \ --o-visualization /data/16s.v6.maxcts.d25AD.gneiss.by.treatment.qzv ``` The *Saccharibacter floricola* feature `699d78936fbf1cc18ab30cede7baad9e`, highlighted by ANCOM, appears with 4 other features on one side of the `y2` split. Most of the other features in the dataset are on the other side of that split. The additional 4 features are all Enterobacteriaceae, which the SILVA classifier failed to resolve to a lower taxonomic level. BLAST has multiple perfect matches to each of them. | feature UUID | BLAST result | | :----------------------------------------------------------- | :----------------------------------------------------------- | | [d3286dc2d6be57a96881c9f6a43b2859](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GAATTTGGCAGAGATGCCTTAGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGATTCGGTCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGCATATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTATGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGT) | *Citrobacter werkmanii*; *Klebsiella oxytoca*; several uncultured bacteria | | [8fa34849d5dded61d203db8249907103](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GAATTTGGCAGAGATGCCTTAGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTTCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGCATATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTATGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGT) | *Citrobacter freundii* strain RTC1; *Enterobacter* sp. TS4; Uncultured bacterium AMW_D6 | | [8d625c592171b66b877523cc46ac3963](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GAACTTAGCAGAGATGCTTTGGTGCCTTCGGGAACTCTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGATTCGGTCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGCATATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTATGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGT) | *Klebsiella* or *Citrobacter* | | [8026cfce27b83d70eb073a10947f3f94](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GAACTTAGCAGAGATGCTTTGGTGCCTTCGGGAACTCTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGCATATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTATGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGT) | *Klebsiella* or *Citrobacter* | ![](https://i.imgur.com/JFzZdfK.png) This result matches the group of OTUs identified by Gneiss in the previous analysis, which contained *Saccharibacter* sp. and 6 unidentified Enterobacteriaceae. I also tried the gradient clustering method. ```bash qiime gneiss gradient-clustering \ --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \ --m-gradient-file /data/metadata.beeSeq.microbiota.d25AD.tsv \ --m-gradient-column dose_ppb \ --o-clustering /data/16s.v6.maxcts.d25AD.grad.clust.qza qiime tools export \ --input-path /data/16s.v6.maxcts.d25AD.grad.clust.qza \ --output-path /data/. mv tree.nwk 16s.v6.maxcts.d25AD.grad.clust.tre qiime gneiss ilr-hierarchical \ --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \ --i-tree /data/16s.v6.maxcts.d25AD.grad.clust.qza \ --o-balances /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.qza qiime tools export \ --input-path /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.qza \ --output-path /data/. singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \ biom convert \ -i /data/feature-table.biom \ -o /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.tsv \ --to-tsv sed -i '1d' 16s.v6.maxcts.d25AD.gneiss.grad.balances.tsv qiime gneiss dendrogram-heatmap \ --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \ --i-tree /data/16s.v6.maxcts.d25AD.grad.clust.qza \ --m-metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \ --m-metadata-column treatment \ --p-color-map seismic \ --o-visualization /data/16s.v6.maxcts.d25AD.gneiss.grad.by.dose.qzv ``` From looking at the visualization, there doesn't seem to be an obvious split, where a large block of features are associated with any of the treatment groups in a consistent way. However, the *Saccharibacter floricola* feature `699d78936fbf1cc18ab30cede7baad9e`, highlighted by ANCOM, appears with several other features around the `y9` split. `8fa34849d5dded61d203db8249907103` from the correlation gneiss analysis also appears here. | feature UUID | Classification or BLAST result | | :----------------------------------------------------------- | :----------------------------- | | 1a196db6e5bdd828fec6dd3e5dcbef1c | *Lactobacillus bombicola* | | f1459b5093b227402d67fe3d593d4e44 | *Lactobacillus bombicola* | | [48a57dc7a4e32a52c0ed2aac00d97c14](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GGATCGCGGCAGAGATGTCGTTTCCCTTCGGGGCCGGTTCACAGGTGGTGCATGGTCGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTGTGTTGCCAGCGGGTTATGCCGGGAACTCACAAGGGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAGATCATCATGCCCCTTACGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAACGGGATGCGACATGGCGACATGGAGCGGATCCCTTAAAACCGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGGCGGAGTCGCTAGTAATCGCGGATCAGCAACGCCGCGGT) | *Bifidobacterium bohemicum* | | [595d46a49f01ca56f91b8947ff9cc272](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GGATCGCGGCAGAGATGTCGTTTCCCTTCGGGGCCGGTTCACAGGTGGTGCATGGTCGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTGTGTTGCCAGCGGGTTATGCCGGGAACTCACAAGGGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAGATCATCATGCCCCTTACGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAACGGGATGCGACATGGTGACATGGAGCGGATCCCTTAAAACCGGTCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGGCGGAGTCGCTAGTAATCGCGGATCAGCAACGCCGCGGT) | *Bifidobacterium bohemicum* | | 8026cfce27b83d70eb073a10947f3f94 | *Klebsiella* or *Citrobacter* | | 0eb3327ed1fcfec6377770463795cd5f | *Asaia* | ![](https://i.imgur.com/sEpb4VJ.png) ## Songbird Songbird is an algorithm to rank features based on their log-fold differences in abundance with respect to various metadata factors ([Morton et al. 2019](https://www.nature.com/articles/s41467-019-10656-5)). It allows for complex multinomial regression [models](https://github.com/biocore/songbird#3-specifying-a-formula-) with continuous and categorical factors. The order of factors n the model does not matter, except that differentials will be reported for the first factor, controlling for influences of the later factors. (This is similar to the logic of a linear mixed model, where the first term is a fixed effect and other terms are random effects.) For reasons that remain unclear, Songbird is only implemented in Qiime2 versions 2019.7 through 2020.6. Randy Downer has set-up a separate container for version 2020.6. While this is an implementation of Songbird that accompanied Qiime2, it's actually the stand-alone version of Songbird I'll be calling. As before, I'll define an alias to call the singularity container with Songbird. ```bash alias songbird singularity run --bind "$PWD":/data /export/groups/singularity/qiime_songbird_202006.simg songbird ``` Because we'll be running this program outside the Qiime2 environment, we need to export the feature table file from Qiime2's `qza` format into the `biom` format expected by Songbird. ```bash qiime tools export \ --input-path /data/16s.v6.maxcts.d25AD.feature.table.qza \ --output-path /data mv feature-table.biom 16s.v6.maxcts.d25AD.feature.table.biom ``` I'll try the obvious analysis first: looking for dose effects, controlling for the influence of hive. ```bash songbird multinomial \ --input-biom /data/16s.v6.maxcts.d25AD.feature.table.biom \ --metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \ --formula "dose_ppb+hive" \ --epochs 10000 \ --differential-prior 1 \ --summary-interval 1 \ --summary-dir /data/songbird.16s.v6.maxcts.d25AD.dose.hive ``` Compare this against a null model. ```bash songbird multinomial \ --input-biom /data/16s.v6.maxcts.d25AD.feature.table.biom \ --metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \ --formula "1" \ --epochs 10000 \ --differential-prior 1 \ --summary-interval 1 \ --summary-dir /data/songbird.16s.v6.maxcts.d25AD.null ``` These runs completed in just a few minutes. To interpret the fit of the models, it's necessary to use TensorBoard to view the machine learning logs, which are sadly kept in binary outfield that can't otherwise be accessed. TensorBoard is run via Jupyter notebooks. Randy did a lot of trouble-shooting to figure out how to do this. First log into https://jupyter.colby.edu/. In the upper right select `New`, then `terminal`. A terminal window will open in a new tab, within JupyterHub. Create a link to the directory containing the Songbird output. For example, `ln -s /research/drangeli research`. Go back to the main JupyterHub tab and refresh. An item called `research` (in this example) should now appear. Clicking on it will provide the contents of `/research/drangeli/` in JupyterHub. Click through the directory structure util you're in the Songbird output folder. (JupyterHub can't access `/export/groups/`. So copy any work from there to `/research/`) Next, create a new python3 notebook, and enter the following code. (The `%` characters are required.) ```python %load_ext tensorboard %env TENSORBOARD_BINARY=/usr/local/anaconda/bin/tensorboard ``` Run that and create a second code block. ```python %tensorboard --logdir . --bind_all --path_prefix /tensorboard ``` Running this will give a blank image. Run it again. This time the error message will include a port number. Open a new browser tab to `http://jupyter.colby.edu:<portnumber>/tensorboard` It should produce a page looking like this. ![](https://i.imgur.com/Hj8vrHh.png) Based on the [Songbird docs for interpreting model fits](https://github.com/biocore/songbird#interpreting-model-fitting) both the cross validation accuracy (`cv_error`) and loss plots should decay exponentially and plateau. Loss is the squared difference of true values from predicted ones (`square(y_true - y_pred)`) and should ideally approach zero. And a good model should outperform the null model! ![](https://i.imgur.com/sfpCB2X.png) In TensorBoard, it's possible to download these logged values as a CSV file. It's arguable whether the `hive` factor is necessary. Let's run another model without it. ```bash songbird multinomial \ --input-biom /data/16s.v6.maxcts.d25AD.feature.table.biom \ --metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \ --formula "dose_ppb" \ --epochs 10000 \ --differential-prior 1 \ --summary-interval 1 \ --summary-dir /data/songbird.16s.v6.maxcts.d25AD.dose ``` Our metadata has several other factors. Since songbird doesn't take too long to run, I'll try more elaborate models to see if I run into overfitting. ```bash head -n 1 metadata.beeSeq.microbiota.d25AD.tsv ``` ``` #SampleID treatment dose_ppb treatment_day hive_day hive collection_rep plate row column well day RNAconc RNAabsratio DNAconc DNAabsratio description ``` ![](https://i.imgur.com/h1cXPBD.png) In these images I've overlaid the stable-phase means, judged by eye. The `dose+hive` model appears to be the best. It's accuracy and loss look indistinguishable from a model adding `plate`, a batch effect referring to the 96-well plate a DNA samples was stored and shipped on. Adding `DNAabsratio`, the DNA quality metric of A<sub>260</sub>/A<sub>280</sub>, has clearly resulted in overfitting, as indicated by the rising values of cv_error over time and the greater range in values for loss. Just using `dose` provides a small but noticeable improvement over the null model, by about 5 units of cv_error. This implies that `hive` actually has a much larger effect on microbial composition than does imidacloprid dose. Here are some other models I tried. ![](https://i.imgur.com/3jD3PP9.png) The factor `hive` alone provides a very good explanation of the composition. Adding `dose` or `treatment` might improve the model just slightly, but of course it's what we're most interested in. There appears to be little difference between using `dose` or `treatment`, which is just a categorical version of the dose factor. ![](https://i.imgur.com/uAXvHzT.png) I started wondering about the fact that because hives are nested within dose treatments, much of the variance in composition may be captured by `hive` rather than a factor conveying dose. Songbird is not like a Type I ANOVA, where variance is partitioned to the first factor in the model, and later factors account for remaining variance. So, I tried two categorical, binary factors `dosed`, which separates the control treatment from all dosed treatments, and `high_dose` which distinguishes the highest-dose treatment from all others, including the zero-dose control. These models may be marginal improvements over the `dose_pbb` model, but clearly `hive` needs to be accounted for. ![](https://i.imgur.com/wR1bhOX.png) While the Songbird docs don't discuss complex models, I simply tried using `hive` as a nested factor and found it works. Running the nested model longer and with a lower differential prior did not change the values at stability. (We 4 hives per treatment, so there are 15 hive terms (plus one for dose and an intercept). There are 5 individual samples per hive, and that's not the 10-to-1 ratio of samples to model terms that is [recommended](https://github.com/biocore/songbird#okay-so-how-should-i-adjust-parameters-to-get-my-model-to-fit-properly). So while the metrics look okay, I am concerned about overfitting.) Next, we can examine the differentials. Here they are from the `dose + hive` model (10,000 epochs; differential prior of 1). ![](https://i.imgur.com/MkddD0F.jpg) The plot on the left shows differentials for 32 OTUs based on imidacloprid dose in a model accounting for hive-specific influences. Several *Lactobacillus* are negatively associated with dose. Several *Enterobacter* species are positively associated. *Saccharibacter floricola*, identified by ANCOM and Gneiss as correlated with imidacloprid dose, does not emerge as strongly affected in this analysis. Using only dose as the predictor of abundance identified the same OTUs at the extremes, but there's more variation in the intermediate OTU ranks. Next is the differential plot from the nested model (100,000 epochs; differential prior of 0.5). ![](https://i.imgur.com/66JUAn2.jpg) ### Conclusions It's clear that Songbird detects very little signal associated with treatment. It may be the case that the signal here just isn't strong enough to use Songbird for this purpose. It may be better to highlight the findings of ANCOM and Gneiss, both of which highlight *Saccharibacter floricola* as differently expressed. It may also be useful to take the feature table into R, and look at composition at a higher taxonomic level, such as order or phylum. > **Resuming work in August 2022** > > It's been a while, working on other projects. Reviewing everything here, I'm satisfied that this previous work is still valid. I will proceed as suggested above. ## Count sums at higher taxonomic levels In order to look at higher taxonomic levels, it will be necessary to sum OTU counts based on their taxonomic classifications. This will be easiest in R. First, the feature table and classifications can be exported from the Qiime data files. ```bash cd /export/groups/drangeli/beeSeq.project/16s.v6v8 screen -S beeSeq alias qiime singularity run --bind "$PWD":/data /export/groups/singularity/qiime_202104.simg qiime qiime tools export \ --input-path /data/16s.v6.maxcts.d25AD.feature.table.qza \ --output-path /data singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \ biom convert \ -i /data/feature-table.biom \ -o /data/16s.v6.maxcts.d25AD.feature.table.tsv \ --to-tsv ``` The files `16s.v6.maxcts.d25AD.feature.table.tsv` and `16s.v6.maxcts.silva138.taxonomy.calls.tsv` were copied locally for import into R. The script for this analysis is [`counts.by.taxonomic.level.R`](https://github.com/aphanotus/BeeSeq/blob/main/counts.by.taxonomic.level.R). Counts were normalized by the depth (`colSums`) for each sample. I checked all taxonomic levels. There were no dramatic patterns at higher taxonomic levels that weren't driven strongly by a few obvious species-level OTUs, with the exception of Enterobacteriaceae, where several species were all negatively affected by imidacloprid. I check the trends for each of the OTUs highlighted by Songbird. That algorithm is largely influenced by non-zero values. So, many of those didn't really seem different enough. In addition to the OTUs highlighted by ANCOM, Gneiss and Songbird, I picked out some others that seemed to have trends. Most of them are negatively dose-dependent. *Saccharibacter floricola* definitely shows the most dramatic dose-dependent effect. ![](plots/species.of.interest.plot.jpg) I also looked for correlations among the normalized counts for each of these taxa, using [Kendall's rank correlation](https://www.statsdirect.com/help/nonparametric_methods/kendall_correlation.htm), which should be [robust to zero-inflated data](https://digitalcommons.wayne.edu/jmasm/vol6/iss2/17/) such as these. The highest correlations were among the *Fructobacillus* variants, suggesting that the split of these into separate taxa is not particularly useful. However, the two *Snodgrassella* strains have a very strong negative correlation, which is interesting, but perhaps beside the point of our experiment. *Saccharibacter floricola* was most strongly correlated with several of the unclassified Enterobacteriaceae, *Fructobacillis*, and *Pseudomonas*, and negatively correlated with *Asaia*, *Lactobacillis bombicola*, and the uncultured *Bifidobacterium*. ![](plots/otu.correlation.plots.png) ## Relative abundance of OTUs-of-interest over time and by treatment ![](plots/otus.oi.time.table.plot.jpg)