BeeSeq

Dave Angelini
25 August 2021

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

This page will document the workflow used to integrate gene expression and microbiome analysis from the Bombus impatiens experiments conducted by Dave Angelini, Will Simmons and Anne Averill.

The project has lain dormant for some time. I am taking it up again in August 2021, and will start from scratch in order to ensure the most current tools are used.

While any process is running on nscc, you can monitor it's progress indirectly by checking the CPU usage on node 26 at http://schupflab.colby.edu/mrtg/nscc/nscc-n26.load.html

Notes from previous versions of this project

  • March 2018: Microbiome analysis and inital RNA data processing
  • Fall 2019: transcriptome analysis

The workflow described here is meant to be a complete repeat or update to those steps.

Summary of the experiment

Our goal was to examine effects on gene expression and microbiome composition in the gut of the bumblebee Bombus impatiens after chronic sublethal exposure to the neonicotinoid imidacloprid. Colonies were isolated in semi-field enclosures and provided imidacloprid in pollen and sugar syrup for 25 days, which spans the developmental time of new workers. Treatments were chosen to bracket the concentrations of imidacloprid detected in the pollen and nectar of crop plants. Workers were then sampled from replicate colonies. The gut was removed and used for microbiome and gene expression analyses.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Analysis Workflow Plan

Microbiome

This workflow will be applied to three datasets: 16s v4-v5 MiSeq reads and reads from v6-v8. These should tell similar stories. In the end we might focus the manuscript on which ever has more reliable OTU classifications. We also have data from 18S, which in theory should reveal eukaryotic biota. However, databases for 18S classification are not as extensive and the data are confounded by overwhelming signaling from the bee itself.

Qiime2: import, demux

dada2 denoise

optimize trunc-len parameters

mafft, mask, FastTree

Silva classifier

rarefaction, diversity

add pseudocount

ANCOM (check Mark's workflow)

Songbird

MiSeq reads

demux seqs

feature table (inital)

feature table (final)

OTU tree

OTU classification

diversity metrics by group

comp table

differential abundance by group

Expression data

We have traditional full-length transcript RNAseq reads from 4 gut samples, and 3seq (3' end tag sequence) reads from 96 samples from the last time point in the experiment. Unfortunately, the current annotation of the B. impatiens gene set is not very extensive. In particular, our analysis would benefit from GO term annotation for each gene. Therefore, this workflow will include making those annotations with EnTAP.

3seq Data

Full-length RNAseq

Genome Annotation

EnTAP

bowtie2-build

makeblastdb

tBLASTn

fastq_screen

fastq_screen

trim

Trinity

TransDecoder

tBLASTn

BUSCO

EnTAP

kallisto index

gmap_build

trim

gmap_build

GMAP

kallisto

kallisto

ID matching

ID matching

ID matching

ID matching

gene models (FASTA)

gene models (GFF)

GO, KEGG, PAM, etc.

potential xenic mRNAs

indexed xenic mRNAs

xenic blastDB

contaminant info

filtered reads

raw reads

gut transcriptome

ORF predictions

coverage info

transcriptome index

GMAP index

raw reads

filtered reads

transcript-genome map

count matrix

annotated count matrix

The entire project is being run from nscc's node 26 at /export/groups/drangeli/beeSeq.project/

Microbiome

Data source

Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the Dalhousie University IMR, from amplicons targetting 16S v4-v5.

  • 515F GTGYCAGCMGCCGCGGTAA
  • 926R CCGYCAATTYMTTTRAGTTT
  • The product is 410 bp, resulting in a 190 bp overlap.

Parada AE, Needham DM, Fuhrman JA. 2016. Every base matters: assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. Environ Microbiol 18(5):1403-14. Link

Walters W, Hyde ER, Berg-Lyons D, Ackermann G, Humphrey G, Parada A, Gilbert JA, Jansson JK, Caporaso JG, Fuhrman JA, Apprill A, Knight B. 2015. Improved bacterial 16S rRNA gene (V4 and V4-5) and fungal internal transcribed spacer marker gene primers for microbial community surveys. mSystems 1(1):e00009-15. Link

And 16S v6-v8.

  • B969F ACGCGHNRAACCTTACC
  • BA1406R ACGGGCRGTGWGTRCAA
  • The product is 450 bp, resulting in a 150 bp overlap.

Comeau AM, Li WKW, Tremblay J-É, Carmack EC, Lovejoy C (2011) Arctic Ocean Microbial Community Structure before and after the 2007 Record Sea Ice Minimum. PLoS ONE 6(11): e27492. Link

Using Singularity

Note that the singularity container doesn't initially have access to /export/groups, where I have all the data. Randy tells me this is for security reasons. In any case, the data folder must be mounted or "bound" within the container using the --bind arguement. This creates a folder at the root, /data, that contains the data in the working directoty (that is, the output of pwd, such as /export/groups/drangeli/beeSeq.project/16s.v4v5/, but it only exists within the container. So paths in the qiime call refer to files using the /data/ path.

Start by running screen. Then define an alias to run Qiime2 via singularity. The alias will stay active as long as the screen instance is maintained.

alias qiime singularity run --bind "$PWD":/data /export/groups/singularity/qiime_202104.simg qiime

Get out of the screen (and leave it running) by typing the keys CTRL+A, then D. When you're finally done using the screen, enter the command exit.

More on Singularity and data mounting

There's another way to mount the data folder in Singularity that doesn't require it to be a named folder within the container. The argument --bind "$PWD":/export/home/drangeli will replace the starting folder in the container with the contents of the current directory outside the container. This might be convenient, because commands wouldn't need to explicitly refer to that folder. However, I've stuck with the explicit \data\ binding because it seems safer.

Data processing

For the 16S v4-v5 reads, I'm working in nscc:/export/groups/drangeli/beeSeq.project/16s.v4v5/. All the steps will be mirrored for the v6-v8 data in ../16s.v6v8/.

I moved the older processing files into a sub-folder summer.2018.analysis/.

Raw reads are in the raw.reads folder. I've previously run FastQC on all the raw data files. Nothing merited exclusion.

Import the data into Qiime2 data structures and summarize. In updating this workflow, I'm refering to my notes from 2018 and to the current Moving Pictures tutortial from the Qiime2 docs.

qiime tools import\
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path /data/raw.reads \
  --output-path /data/16s.v4.demux.qza \
  --input-format 'CasavaOneEightSingleLanePerSampleDirFmt' 

qiime demux summarize \
  --i-data /data/16s.v4.demux.qza \
  --o-visualization /data/16s.v4.demux.qzv

Download the qzv visualization files to the local machine as follows.

scp nscc:/export/groups/drangeli/beeSeq.project/16s.v4v5/16s.v4.demux.qzv .

Drag and drop the qzv file into a web browser at https://view.qiime2.org/

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Denoise

The next step is dada2, a tool for detecting and correcting Illumina sequence data. It also filters out reads from phiX and chimeric sequences. Here's an example call.

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs /data/16s.v4.demux.qza \
  --p-trunc-len-f 265 --p-trunc-len-r 160 \
  --p-trim-left-f 34  --p-trim-left-r 34 \
  --o-representative-sequences /data/16s.v4.rep.seqs.qza \
  --o-table /data/16s.v4.feature.table.qza \
  --o-denoising-stats /data/16s.v4.denoise.stats.qza \
  --p-n-threads 16
  
qiime metadata tabulate \
  --m-input-file /data/16s.v4.denoise.stats.qza \
  --o-visualization /data/16s.v4.denoise.stats.qzv  

Then export the feature table to a more flexible format (tab-seperated values).

qiime tools export  \
  --input-path /data/16s.v4.feature.table.qza \
  --output-path /data

singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v4v5/:/data /export/groups/singularity/qiime_202104.simg \
  biom convert \
    -i /data/feature-table.biom \
    -o /data/16s.v4.feature.table.tsv \
    --to-tsv

I ran the dada2 denoise process under a range of truncation parameters using a bash script to produce the feature.table.qza files and a summary tsv file. I used the total number of features (presumptive OTUs) as the metric to choose the best parameters.

These results from dada2 are slightly different than I found in 2018. The optimal parameters identified then produce slightly different numbers of OTUs now (385 features versus 375 in the past). The input data is exactly the same, and the denoise process is determinative (not stochastic), so the discrepency must be due to updates in the dada2 denoise algorithm. The parameters identified as optimal this time are slightly higher for the forward truncation and slightly lower for the reverse read truncation. Therefore the program is using relatively more data from the forward reads. The forward reads are higher quality (which is typical for paired-end Illumina sequences), so that seems good.

These parameters (maximizing feature and read numbers) correspond to at least 75% of reads with PHRED quality scores of 34 or better.

I'm not entirely sure whether it's best to proceed with a feature table based on maximizing OTUs or total counts. If you look at the parameter space, there's a knife-edge drop-off in both metrics. That corresponds to the truncation values that start to erode the amplicon (410 bp for the 16s v4-v5 sequence). It's interesting that OTUs are maximized right at that drop off and are favored by more forward read sequence, but more reads (counts) pass filter just inside that boundary and with slightly more balanced contribution of forward and reverse reads.

For now I'll move ahead with both files.

The "feature table" is a matrix of counts for each feature (OTU) in each sample. This can be imported into R for plotting and further analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to those IDs using the taxonomy.tsv file which is generated later.

The commands below generate a list of the features, with their UUID, DNA sequence, and a link to BLAST search. This is useful as a record.

qiime feature-table tabulate-seqs \
  --i-data /data/16s.v4.maxOTUs.rep.seqs.qza \
  --o-visualization /data/16s.v4.maxOTUs.rep.seqs.qzv

qiime feature-table tabulate-seqs \
  --i-data /data/16s.v4.maxcts.rep.seqs.qza \
  --o-visualization /data/16s.v4.maxcts.rep.seqs.qzv

The feature-table summarize command provides summary statistics and groups features on metadata category.

qiime feature-table summarize \
  --i-table /data/16s.v4.maxOTUs.feature.table.qza \
  --o-visualization /data/16s.v4.maxOTUs.feature.table.qzv \
  --m-sample-metadata-file /data/metadata.beeSeq.microbiota.tsv

qiime feature-table summarize \
  --i-table /data/16s.v4.maxcts.feature.table.qza \
  --o-visualization /data/16s.v4.maxcts.feature.table.qzv \
  --m-sample-metadata-file /data/metadata.beeSeq.microbiota.tsv

Diversity analysis

Generate trees for phylogenetic diversity analyses.

qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences /data/16s.v4.maxOTUs.rep.seqs.qza \
  --o-alignment /data/16s.v4.maxOTUs.aligned-rep-seqs.qza \
  --o-masked-alignment /data/16s.v4.maxOTUs.masked-aligned-rep-seqs.qza \
  --o-tree /data/16s.v4.maxOTUs.unrooted-tree.qza \
  --o-rooted-tree /data/16s.v4.maxOTUs.rooted-tree.qza

qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences /data/16s.v4.maxcts.rep.seqs.qza \
  --o-alignment /data/16s.v4.maxcts.aligned-rep-seqs.qza \
  --o-masked-alignment /data/16s.v4.maxcts.masked-aligned-rep-seqs.qza \
  --o-tree /data/16s.v4.maxcts.unrooted-tree.qza \
  --o-rooted-tree /data/16s.v4.maxcts.rooted-tree.qza

Determine the optimal rarefaction level by looking at the feature.table.qzv visualizations. I chose the maximum sampling depth that would retain all day-25 samples from treatments A-D.

qiime diversity core-metrics-phylogenetic \
  --i-phylogeny /data/16s.v4.maxOTUs.rooted-tree.qza \
  --i-table /data/16s.v4.maxOTUs.feature.table.qza \
  --p-sampling-depth 6017 \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --output-dir /data/core-metrics-results.maxOTUs

qiime diversity alpha-group-significance \
  --i-alpha-diversity /data/core-metrics-results.maxOTUs/faith_pd_vector.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/core-metrics-results.maxOTUs/faith-pd-group-significance.qzv

qiime diversity alpha-group-significance \
  --i-alpha-diversity /data/core-metrics-results.maxOTUs/evenness_vector.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/core-metrics-results.maxOTUs/evenness-group-significance.qzv
  
qiime diversity beta-group-significance \
  --i-distance-matrix /data/core-metrics-results.maxOTUs/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --m-metadata-column hive_day \
  --o-visualization /data/core-metrics-results.maxOTUs/unweighted-unifrac-hive_day-significance.qzv \
  --p-pairwise
  
  
qiime diversity core-metrics-phylogenetic \
  --i-phylogeny /data/16s.v4.maxcts.rooted-tree.qza \
  --i-table /data/16s.v4.maxcts.feature.table.qza \
  --p-sampling-depth 6045 \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --output-dir /data/core-metrics-results.maxcts
  
qiime diversity alpha-group-significance \
  --i-alpha-diversity /data/core-metrics-results.maxcts/faith_pd_vector.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/core-metrics-results.maxcts/faith-pd-group-significance.qzv

qiime diversity alpha-group-significance \
  --i-alpha-diversity /data/core-metrics-results.maxcts/evenness_vector.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/core-metrics-results.maxcts/evenness-group-significance.qzv  
  
qiime diversity beta-group-significance \
  --i-distance-matrix /data/core-metrics-results.maxcts/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --m-metadata-column hive_day \
  --o-visualization /data/core-metrics-results.maxcts/unweighted-unifrac-hive_day-significance.qzv \
  --p-pairwise  

Qiime2’s core diversity metrics include:

Alpha diversity

  • Shannon’s diversity index (a quantitative measure of community richness)
  • Observed OTUs (a qualitative measure of community richness)
  • Faith’s Phylogenetic Diversity (a qualitiative measure of community richness that incorporates phylogenetic relationships between the features)
  • Evenness (or Pielou’s Evenness; a measure of community evenness)

Beta diversity

  • Jaccard distance (a qualitative measure of community dissimilarity)
  • Bray-Curtis distance (a quantitative measure of community dissimilarity)
  • unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)
  • weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)

OTU Classification

Naive Bayes classifiers have been built from the SILVA release 138 dataset, which was released in December 2019 and is currently the most recent. One was provided by the Qiime2 developers.

The Qiime2 docs reccommend building your classifier with sequences aligned to your amplicon. Despite many attempts I was unable to get the SILVA dataset into a format that would be accepted by Qiime. The classifier they provide works quite well however, and I think it's adequate to our needs.

cd /export/groups/drangeli/marker.ref.DBs/silva138/
wget "https://data.qiime2.org/2021.4/common/silva-138-99-nb-classifier.qza"
cd /export/groups/drangeli/beeSeq.project/16s.v4v5/
cp ../../marker.ref.DBs/silva138/silva-138-99-nb-classifier.qza .

qiime feature-classifier classify-sklearn \
  --i-reads /data/16s.v4.maxOTUs.rep.seqs.qza \
  --i-classifier /data/silva-138-99-nb-classifier.qza \
  --o-classification /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
  --p-n-jobs -3

qiime metadata tabulate \
  --m-input-file /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
  --o-visualization /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qzv

qiime tools export  \
  --input-path /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
  --output-path /data/. 
mv taxonomy.tsv 16s.v4.maxOTUs.silva138.taxonomy.calls.tsv

cut -f 3 16s.v4.maxOTUs.silva138.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk

The median posterior support for taxonomy assignments is 0.976.

qiime taxa barplot \
  --i-table /data/16s.v4.maxOTUs.feature.table.qza \
  --i-taxonomy /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/16s.v4.maxOTUs.silva138.taxa.barplot.qzv
qiime feature-classifier classify-sklearn \
  --i-reads /data/16s.v4.maxcts.rep.seqs.qza \
  --i-classifier /data/silva-138-99-nb-classifier.qza \
  --o-classification /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
  --p-n-jobs -3

qiime metadata tabulate \
  --m-input-file /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
  --o-visualization /data/16s.v4.maxcts.silva138.taxonomy.calls.qzv

qiime tools export  \
  --input-path /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
  --output-path /data/. 
mv taxonomy.tsv 16s.v4.maxcts.silva138.taxonomy.calls.tsv

cut -f 3 16s.v4.maxcts.silva138.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk

The median posterior support for taxonomy assignments is 0.984.

qiime taxa barplot \
  --i-table /data/16s.v4.maxcts.feature.table.qza \
  --i-taxonomy /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --o-visualization /data/16s.v4.maxcts.silva138.taxa.barplot.qzv

Inspecting the most common OTUs, the classifications look sensible. Snodgrassella, Candidatus Schmidhempelia, and Gilliamella perdominate.

Differential abundance analysis using ANCOM

Start by filtering out the free-foraging (treatment F) samples, and just going ahead with the day 25 samples.

qiime feature-table filter-samples \
  --i-table /data/16s.v4.maxOTUs.feature.table.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --p-where "day='25' AND NOT treatment='F'" \
  --o-filtered-table /data/16s.v4.maxOTUs.d25AD.feature.table.qza

Calculate pseudocounts (counts +1) to avoid log-of-zero errors.

qiime composition add-pseudocount \
  --i-table /data/16s.v4.maxOTUs.d25AD.feature.table.qza \
  --o-composition-table /data/16s.v4.maxOTUs.d25AD.comp.table.qza

ANCOM (ANalysis of COMposition) is a statistical method for the comparisons of feature abundance in sample groups. It should be robust to distributional assumptions and scaling to large sample sizes. ANCOM also assumes that relatively few (< ~ 25%) of the features are changing between groups.

qiime composition ancom \
  --i-table /data/16s.v4.maxOTUs.d25AD.comp.table.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment \
  --o-visualization /data/16s.v4.maxOTUs.d25AD.ancom.by.treatment.qzv

One feature is identified as significantly different: aa8b914828a465c360b7a079334746d8, classifed only to the genus level as Saccharibacter by our Bayesian classifier, but a BLASTn search of the sequence to NCBI has a 99% identity match to Saccharibacter floricola.

I repeated the ANCOM analysis using the "maxcts" dataset and got almost precisely the same results: This OTU is the only one highlighted by ANCOM and the counts differ by only a few.

16S v6-v8

I ran all the same steps as above using the 16s.v6v8 dataset too.

dada2 denoise optimization

This region yeilds far more features/OTUs and reads passing filter. The v6-v8 region was also the focus of the 2019 manuscript. It may make more sense to focus on these data in the final analysis.

Rarefaction

Again, I chose sampling depths to retain all the day-25 samples in treatments A-D.

  • max OTUs: 4965
  • max counts: 6075

Classification

  • max OTUs: 0.9428 median posterior support for classifications
  • max counts: 0.969

The more I think about it, 16s v6-v8 filtered to maximize reads (16s.v6.maxcts) seems like the best dataset to proceed with.

ANCOM by treatment

One significantly different OTU: 699d78936fbf1cc18ab30cede7baad9e classified as Saccharibacter floricola with 0.756 posterior support.

Gneiss

Differential abundance analysis with gneiss

Another way to examine differences among treatments is the use of balance trees as implemented by the gneiss algorithm Morton et al. 2017. Qiime has an instructive tutorial on its application of gneiss.

I'll restrict this analysis to the day-25 data. So the metadata file will need to be filtered down to those samples.

grep '^[ABCD]' metadata.beeSeq.microbiota.tsv > tmp
head -n 1 metadata.beeSeq.microbiota.tsv > metadata.beeSeq.microbiota.d25AD.tsv
grep '_day25_' tmp >> metadata.beeSeq.microbiota.d25AD.tsv

The functinality of qiime gneiss appers to have been reduced since earlier versions. It's no longer possible to have the plug-in output the taxa on either side of a particular "balance" or node in the tree. Although the tree and balance data can be exported. The OLS regression function also seems to have disappeared. The Qiime2 forum says this was done to make it compatible with Songbird, although Songbird has also been eliminated from the most recent versions of Qiime2. I'll try it all anyway.

Gneiss can run based on trees created by clustering features based on their own correlations (correlation-clustering), or in relation to some numerical factor (gradient-clustering).

qiime gneiss correlation-clustering \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --o-clustering /data/16s.v6.maxcts.d25AD.hierarchy.qza
  
qiime tools export \
  --input-path /data/16s.v6.maxcts.d25AD.hierarchy.qza \
  --output-path /data/.

mv tree.nwk 16s.v6.maxcts.d25AD.hierarchy.tre

qiime gneiss ilr-hierarchical \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --i-tree /data/16s.v6.maxcts.d25AD.hierarchy.qza \
  --o-balances /data/16s.v6.maxcts.d25AD.balances.qza

qiime tools export \
  --input-path /data/16s.v6.maxcts.d25AD.balances.qza \
  --output-path /data/.

singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \
  biom convert \
    -i /data/feature-table.biom \
    -o /data/16s.v6.maxcts.d25AD.gneiss.balances.tsv \
    --to-tsv

sed -i '1d' 16s.v6.maxcts.d25AD.gneiss.balances.tsv

qiime gneiss dendrogram-heatmap \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --i-tree /data/16s.v6.maxcts.d25AD.hierarchy.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
  --m-metadata-column treatment \
  --p-color-map seismic \
  --o-visualization /data/16s.v6.maxcts.d25AD.gneiss.by.treatment.qzv

The Saccharibacter floricola feature 699d78936fbf1cc18ab30cede7baad9e, highlighted by ANCOM, appears with 4 other features on one side of the y2 split. Most of the other features in the dataset are on the other side of that split. The additional 4 features are all Enterobacteriaceae, which the SILVA classifier failed to resolve to a lower taxonomic level. BLAST has multiple perfect matches to each of them.

feature UUID BLAST result
d3286dc2d6be57a96881c9f6a43b2859 Citrobacter werkmanii; Klebsiella oxytoca; several uncultured bacteria
8fa34849d5dded61d203db8249907103 Citrobacter freundii strain RTC1; Enterobacter sp. TS4; Uncultured bacterium AMW_D6
8d625c592171b66b877523cc46ac3963 Klebsiella or Citrobacter
8026cfce27b83d70eb073a10947f3f94 Klebsiella or Citrobacter

This result matches the group of OTUs identified by Gneiss in the previous analysis, which contained Saccharibacter sp. and 6 unidentified Enterobacteriaceae.

I also tried the gradient clustering method.

qiime gneiss gradient-clustering \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --m-gradient-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
  --m-gradient-column dose_ppb \
  --o-clustering /data/16s.v6.maxcts.d25AD.grad.clust.qza

qiime tools export \
  --input-path /data/16s.v6.maxcts.d25AD.grad.clust.qza \
  --output-path /data/.

mv tree.nwk 16s.v6.maxcts.d25AD.grad.clust.tre
  
qiime gneiss ilr-hierarchical \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --i-tree /data/16s.v6.maxcts.d25AD.grad.clust.qza \
  --o-balances /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.qza

qiime tools export \
  --input-path /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.qza \
  --output-path /data/.

singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \
  biom convert \
    -i /data/feature-table.biom \
    -o /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.tsv \
    --to-tsv

sed -i '1d' 16s.v6.maxcts.d25AD.gneiss.grad.balances.tsv

qiime gneiss dendrogram-heatmap \
  --i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
  --i-tree /data/16s.v6.maxcts.d25AD.grad.clust.qza \
  --m-metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
  --m-metadata-column treatment \
  --p-color-map seismic \
  --o-visualization /data/16s.v6.maxcts.d25AD.gneiss.grad.by.dose.qzv

From looking at the visualization, there doesn't seem to be an obvious split, where a large block of features are associated with any of the treatment groups in a consistent way. However, the Saccharibacter floricola feature 699d78936fbf1cc18ab30cede7baad9e, highlighted by ANCOM, appears with several other features around the y9 split. 8fa34849d5dded61d203db8249907103 from the correlation gneiss analysis also appears here.

feature UUID Classification or BLAST result
1a196db6e5bdd828fec6dd3e5dcbef1c Lactobacillus bombicola
f1459b5093b227402d67fe3d593d4e44 Lactobacillus bombicola
48a57dc7a4e32a52c0ed2aac00d97c14 Bifidobacterium bohemicum
595d46a49f01ca56f91b8947ff9cc272 Bifidobacterium bohemicum
8026cfce27b83d70eb073a10947f3f94 Klebsiella or Citrobacter
0eb3327ed1fcfec6377770463795cd5f Asaia

Songbird

BOOKMARK: Working from here

https://github.com/biocore/songbird#2-using-songbird-through-qiime-2

pip install songbird

qiime dev refresh-cache

qiime songbird multinomial \
	--i-table redsea.biom.qza \
	--m-metadata-file data/redsea/redsea_metadata.txt \
	--p-formula "Depth+Temperature+Salinity+Oxygen+Fluorescence+Nitrate" \
	--p-epochs 10000 \
	--p-differential-prior 0.5 \
	--p-training-column Testing \
	--p-summary-interval 1 \
	--o-differentials differentials.qza \
	--o-regression-stats regression-stats.qza \
	--o-regression-biplot regression-biplot.qza

qiime songbird summarize-single \
	--i-regression-stats regression-stats.qza \
	--o-visualization regression-summary.qzv

see this section on interpreting model fitting for details on how to understand these plots.

Gene expression

Genome Annotation

The current gene set annotation fot B. impatiens does not include GO terms, which will be very helpful in making sense of the gene expression results.

EnTAP combines annotations from protein domain assignment, orthologous gene family assessment, Gene Ontology term assignment, and KEGG pathway annotation. And it weighs evidence from these different sources to make decisions about applying annotations to gene models. You can list the target organism in EnTAP and it will leverage NCBI taxonomy. To implement EnTAP, I'll be relying on notes from a workshop on functional genome annotation run by UConn and available on GitHub.

The entap_config.txt file should look like

diamond_exe_path=diamond
rsem_exe_path=/isg/shared/apps/EnTAP/0.9.0-beta/libs/RSEM-1.3.0
genemarkst_exe_path=perl /isg/shared/apps/EnTAP/0.9.0-beta/libs/gmst_linux_64/gmst.pl
eggnog_sql_database=/isg/shared/apps/EnTAP/0.9.0-beta/databases/databases/eggnog.db
eggnog_dmnd_database=/isg/shared/apps/EnTAP/0.9.0-beta/databases/bin/eggnog_proteins.dmnd
interpro_exe_path=interproscan.sh
entap_database_sql_path=
entap_database_bin_path=/isg/shared/apps/EnTAP/0.9.0-beta/databases/bin/entap_database.bin
entap_graphing_script=
EnTAP --runP \
-i ../../16_gfacts/mono_o/genes.fasta.faa \
-d /isg/shared/databases/Diamond/RefSeq/complete.protein.faa.205.dmnd \
-d /isg/shared/databases/Diamond/Uniprot/uniprot_sprot.dmnd \
--ontology 0 \
--threads 16

BOOKMARK: Working from here

The EnTAP log file has useful info on e.g. species from which annotations were drawn, % sequences without alignment or annotations

Full-length transcripts

Quality control was done previously using FastQC and trimming was done with Trimmomatic. There were 358,759,635 reads passing the filter criteria (HEADCROP:12 LEADING:3 TRAILING:3 SLIDINGWINDOW:3:15 MINLEN:50). I also did a pretty throrough examination of xenic or contaminant sequences in the pool of full-length reads. I'm satisied with those results.

Trinity assembly

Transcript assembly is a critical step. Since the 3seq reads will be mapped to transcripts in order to calculate counts for gene expression, the quality of the transcriptome assembly is important. Trinity is the go-to transcriptome assembler. Our 2019 analysis was done with an assenbly by Trinity-v2.4.0. The current version is v2.13.1. I don't know if the new version will produce a meaningfully different assembly. But I'll try it, and see if the number of predicted transcripts and genes differs much. If so (either up or down), the new version is likely to be more reliable.

Trinity --seqType fq --max_memory 64G --CPU 12 \
--single filtered_reads/*.fastq \
--output trinity.out.dir.211004 > Bi.RNAseq.Trinity.run.211004.log



BOOKMARK: Working from here