# BeeSeq
Dave Angelini
25 August 2021

This page will document the workflow used to integrate gene expression and microbiome analysis from the *Bombus impatiens* experiments conducted by Dave Angelini, Will Simmons and Anne Averill.
The project has lain dormant for some time. I am taking it up again in August 2021, and will start from scratch in order to ensure the most current tools are used.
:::info
While any process is running on nscc, you can monitor it's progress indirectly by checking the CPU usage on node 26 at http://schupflab.colby.edu/mrtg/nscc/nscc-n26.load.html
:::
## Notes from previous versions of this project
- [March 2018](https://hackmd.io/@aphanotus/BeeSeqProject): Microbiome analysis and inital RNA data processing
- [Fall 2019](https://hackmd.io/@aphanotus/BeeSeq20): transcriptome analysis
The workflow described here is meant to be a complete repeat or update to those steps.
## Summary of the experiment
Our goal was to examine effects on gene expression and microbiome composition in the gut of the bumblebee *Bombus impatiens* after chronic sublethal exposure to the neonicotinoid imidacloprid. Colonies were isolated in semi-field enclosures and provided imidacloprid in pollen and sugar syrup for 25 days, which spans the developmental time of new workers. Treatments were chosen to bracket the concentrations of imidacloprid detected in the pollen and nectar of crop plants. Workers were then sampled from replicate colonies. The gut was removed and used for microbiome and gene expression analyses.

## Analysis Workflow Plan
### Microbiome
This workflow will be applied to three datasets: 16s v4-v5 MiSeq reads and reads from v6-v8. These should tell similar stories. In the end we might focus the manuscript on which ever has more reliable OTU classifications. We also have data from 18S, which in theory should reveal eukaryotic biota. However, databases for 18S classification are not as extensive and the data are confounded by overwhelming signaling from the bee itself.
```mermaid
graph TD
A("MiSeq reads") ==>|Qiime2: import, demux| B(demux seqs)
B ==>|dada2 denoise| C("feature table (inital)")
C ==>|optimize trunc-len parameters| D("feature table (final)")
D ==>|mafft, mask, FastTree| tree(OTU tree)
D ==>|Silva classifier| E(OTU classification)
D ==>|rarefaction, diversity| div(diversity metrics by group)
D ==>|add pseudocount| F(comp table)
F ==>|"ANCOM (check Mark's workflow)"| G(differential abundance by group)
F ==>|"Songbird"| G
```
### Expression data
We have traditional full-length transcript RNAseq reads from 4 gut samples, and 3seq (3' end tag sequence) reads from 96 samples from the last time point in the experiment. Unfortunately, the [current annotation of the *B. impatiens* gene set](https://www.ncbi.nlm.nih.gov/genome/?term=txid132113[orgn]) is not very extensive. In particular, our analysis would benefit from [GO term annotation](http://geneontology.org/) for each gene. Therefore, this workflow will include making those annotations with [EnTAP](https://entap.readthedocs.io/en/v0.10.8-beta/Getting_Started/introduction.html).
```mermaid
graph LR
subgraph Genome Annotation
predictedgenes("gene models (FASTA)")
ncbi("gene models (GFF)") ==>|EnTAP| entap_output(GO, KEGG, PAM, etc.)
end
subgraph Full-length RNAseq
congen("potential xenic mRNAs") ==>|bowtie2-build| xcongen("indexed xenic mRNAs")
congen ==> |makeblastdb| blastcongen(xenic blastDB)
blastcongen ==>|tBLASTn| contam(contaminant info)
xcongen ==>|fastq_screen| contam
rna2 ==>|fastq_screen| contam(contaminant info)
rna(raw reads) ==>|trim| rna2(filtered reads)
rna2 ==>|Trinity| txome(gut transcriptome)
txome ==>|TransDecoder| cds(ORF predictions)
cds ==>|tBLASTn| contam
cds ==>|BUSCO| coverage(coverage info)
cds ==>|EnTAP| entap_output
txome ==>|kallisto index| txomeindex(transcriptome index)
predictedgenes ==>|gmap_build| geneindex(GMAP index)
end
subgraph 3seq Data
seqreads(raw reads) ==>|trim| seqreads2(filtered reads)
end
cds ==>|gmap_build| geneindex
geneindex ==>|GMAP| geneids(transcript-genome map)
txomeindex ==> |kallisto| counts(count matrix)
seqreads2 ==> |kallisto| counts
geneids ==>|ID matching| genects(annotated count matrix)
counts ==>|ID matching| genects
entap_output ==>|ID matching| genects
contam ==>|ID matching| genects
```
The entire project is being run from **`nscc`**'s node 26 at `/export/groups/drangeli/beeSeq.project/`
## Microbiome
### Data source
Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the [Dalhousie University IMR](http://cgeb-imr.ca/index.html), from amplicons targetting 16S v4-v5.
* 515F `GTGYCAGCMGCCGCGGTAA`
* 926R `CCGYCAATTYMTTTRAGTTT`
* The product is **410 bp**, resulting in a 190 bp overlap.
:::info
Parada AE, Needham DM, Fuhrman JA. 2016. Every base matters: assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. *Environ Microbiol* 18(5):1403-14. [Link](http://dx.doi.org/10.1111/1462-2920.13023)
:::
:::info
Walters W, Hyde ER, Berg-Lyons D, Ackermann G, Humphrey G, Parada A, Gilbert JA, Jansson JK, Caporaso JG, Fuhrman JA, Apprill A, Knight B. 2015. Improved bacterial 16S rRNA gene (V4 and V4-5) and fungal internal transcribed spacer marker gene primers for microbial community surveys. *mSystems* 1(1):e00009-15. [Link](https://msystems.asm.org/content/1/1/e00009-15)
:::
And 16S v6-v8.
* B969F `ACGCGHNRAACCTTACC`
* BA1406R `ACGGGCRGTGWGTRCAA`
* The product is **450 bp**, resulting in a 150 bp overlap.
:::info
Comeau AM, Li WKW, Tremblay J-É, Carmack EC, Lovejoy C (2011) Arctic Ocean Microbial Community Structure before and after the 2007 Record Sea Ice Minimum. *PLoS ONE* 6(11): e27492. [Link](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0027492)
:::
### Using Singularity
Note that the singularity container doesn't initially have access to `/export/groups`, where I have all the data. Randy tells me this is for security reasons. In any case, the data folder must be mounted or "bound" within the container using the `--bind` arguement. This creates a folder at the root, `/data`, that contains the data in the working directoty (that is, the output of `pwd`, such as `/export/groups/drangeli/beeSeq.project/16s.v4v5/`, but it only exists within the container. So paths in the `qiime` call refer to files using the `/data/` path.
Start by running `screen`. Then define an alias to run [Qiime2](https://docs.qiime2.org/) via [singularity](https://sylabs.io/guides/3.5/user-guide/introduction.html). The alias will stay active as long as the screen instance is maintained.
```bash
alias qiime singularity run --bind "$PWD":/data /export/groups/singularity/qiime_202104.simg qiime
```
Get out of the screen (and leave it running) by typing the keys `CTRL`+`A`, then `D`. When you're finally done using the screen, enter the command `exit`.
:::spoiler **More on Singularity and data mounting**
:::info
There's another way to mount the data folder in Singularity that doesn't require it to be a named folder within the container. The argument `--bind "$PWD":/export/home/drangeli` will replace the starting folder in the container with the contents of the current directory *outside* the container. This might be convenient, because commands wouldn't need to explicitly refer to that folder. However, I've stuck with the explicit `\data\` binding because it seems safer.
:::
### Data processing
For the 16S v4-v5 reads, I'm working in `nscc:/export/groups/drangeli/beeSeq.project/16s.v4v5/`. All the steps will be mirrored for the v6-v8 data in `../16s.v6v8/`.
I moved the older processing files into a sub-folder `summer.2018.analysis/`.
Raw reads are in the `raw.reads` folder. I've previously run FastQC on all the raw data files. Nothing merited exclusion.
Import the data into Qiime2 data structures and summarize. In updating this workflow, I'm refering to [my notes from 2018](https://hackmd.io/ws-zHwGKSYicgUyU9zl-Qg?both#Data-processing) and to [the current *Moving Pictures* tutortial from the Qiime2 docs](https://docs.qiime2.org/2021.4/tutorials/moving-pictures/#obtaining-and-importing-data).
```bash
qiime tools import\
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path /data/raw.reads \
--output-path /data/16s.v4.demux.qza \
--input-format 'CasavaOneEightSingleLanePerSampleDirFmt'
qiime demux summarize \
--i-data /data/16s.v4.demux.qza \
--o-visualization /data/16s.v4.demux.qzv
```
Download the `qzv` visualization files to the local machine as follows.
```bash
scp nscc:/export/groups/drangeli/beeSeq.project/16s.v4v5/16s.v4.demux.qzv .
```
Drag and drop the `qzv` file into a web browser at https://view.qiime2.org/

### Denoise
The next step is `dada2`, a tool for detecting and correcting Illumina sequence data. It also filters out reads from phiX and chimeric sequences. Here's an example call.
```bash
qiime dada2 denoise-paired \
--i-demultiplexed-seqs /data/16s.v4.demux.qza \
--p-trunc-len-f 265 --p-trunc-len-r 160 \
--p-trim-left-f 34 --p-trim-left-r 34 \
--o-representative-sequences /data/16s.v4.rep.seqs.qza \
--o-table /data/16s.v4.feature.table.qza \
--o-denoising-stats /data/16s.v4.denoise.stats.qza \
--p-n-threads 16
qiime metadata tabulate \
--m-input-file /data/16s.v4.denoise.stats.qza \
--o-visualization /data/16s.v4.denoise.stats.qzv
```
Then export the feature table to a more flexible format (tab-seperated values).
```bash
qiime tools export \
--input-path /data/16s.v4.feature.table.qza \
--output-path /data
singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v4v5/:/data /export/groups/singularity/qiime_202104.simg \
biom convert \
-i /data/feature-table.biom \
-o /data/16s.v4.feature.table.tsv \
--to-tsv
```
I ran the `dada2 denoise` process under a range of truncation parameters using a [bash script](https://github.com/aphanotus/BeeSeq/blob/main/dada2.filter.optimization.sh) to produce the `feature.table.qza` files and a summary `tsv` file. I used the total number of features (presumptive OTUs) as the metric to choose the best parameters.
:::spoiler
:::info
- [the bash script](https://github.com/aphanotus/BeeSeq/blob/main/dada2.filter.optimization.sh)
- [useful hints for bash scripting](https://devhints.io/bash)
:::

These results from `dada2` are slightly different than I found in 2018. The optimal parameters identified then produce slightly different numbers of OTUs now (385 features versus 375 in the past). The input data is exactly the same, and the denoise process is determinative (not stochastic), so the discrepency must be due to updates in the `dada2 denoise` algorithm. The parameters identified as optimal this time are slightly higher for the forward truncation and slightly lower for the reverse read truncation. Therefore the program is using relatively more data from the forward reads. The forward reads are higher quality (which is typical for paired-end Illumina sequences), so that seems good.
These parameters (maximizing feature and read numbers) correspond to at least 75% of reads with PHRED quality scores of 34 or better.
:::warning
I'm not entirely sure whether it's best to proceed with a feature table based on maximizing OTUs or total counts. If you look at the parameter space, there's a knife-edge drop-off in both metrics. That corresponds to the truncation values that start to erode the amplicon (410 bp for the 16s v4-v5 sequence). It's interesting that OTUs are maximized right at that drop off and are favored by more forward read sequence, but more reads (counts) pass filter just inside that boundary and with slightly more balanced contribution of forward and reverse reads.
For now I'll move ahead with both files.
:::
The "feature table" is a matrix of counts for each feature (OTU) in each sample. This can be imported into R for plotting and further analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to those IDs using the `taxonomy.tsv` file which is generated later.
The commands below generate a list of the features, with their UUID, DNA sequence, and a link to BLAST search. This is useful as a record.
```bash
qiime feature-table tabulate-seqs \
--i-data /data/16s.v4.maxOTUs.rep.seqs.qza \
--o-visualization /data/16s.v4.maxOTUs.rep.seqs.qzv
qiime feature-table tabulate-seqs \
--i-data /data/16s.v4.maxcts.rep.seqs.qza \
--o-visualization /data/16s.v4.maxcts.rep.seqs.qzv
```
The `feature-table summarize` command provides summary statistics and groups features on metadata category.
```bash
qiime feature-table summarize \
--i-table /data/16s.v4.maxOTUs.feature.table.qza \
--o-visualization /data/16s.v4.maxOTUs.feature.table.qzv \
--m-sample-metadata-file /data/metadata.beeSeq.microbiota.tsv
qiime feature-table summarize \
--i-table /data/16s.v4.maxcts.feature.table.qza \
--o-visualization /data/16s.v4.maxcts.feature.table.qzv \
--m-sample-metadata-file /data/metadata.beeSeq.microbiota.tsv
```
### Diversity analysis
Generate trees for phylogenetic diversity analyses.
```bash
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences /data/16s.v4.maxOTUs.rep.seqs.qza \
--o-alignment /data/16s.v4.maxOTUs.aligned-rep-seqs.qza \
--o-masked-alignment /data/16s.v4.maxOTUs.masked-aligned-rep-seqs.qza \
--o-tree /data/16s.v4.maxOTUs.unrooted-tree.qza \
--o-rooted-tree /data/16s.v4.maxOTUs.rooted-tree.qza
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences /data/16s.v4.maxcts.rep.seqs.qza \
--o-alignment /data/16s.v4.maxcts.aligned-rep-seqs.qza \
--o-masked-alignment /data/16s.v4.maxcts.masked-aligned-rep-seqs.qza \
--o-tree /data/16s.v4.maxcts.unrooted-tree.qza \
--o-rooted-tree /data/16s.v4.maxcts.rooted-tree.qza
```
Determine the optimal rarefaction level by looking at the `feature.table.qzv` visualizations. I chose the maximum sampling depth that would retain all day-25 samples from treatments A-D.
```bash
qiime diversity core-metrics-phylogenetic \
--i-phylogeny /data/16s.v4.maxOTUs.rooted-tree.qza \
--i-table /data/16s.v4.maxOTUs.feature.table.qza \
--p-sampling-depth 6017 \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--output-dir /data/core-metrics-results.maxOTUs
qiime diversity alpha-group-significance \
--i-alpha-diversity /data/core-metrics-results.maxOTUs/faith_pd_vector.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/core-metrics-results.maxOTUs/faith-pd-group-significance.qzv
qiime diversity alpha-group-significance \
--i-alpha-diversity /data/core-metrics-results.maxOTUs/evenness_vector.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/core-metrics-results.maxOTUs/evenness-group-significance.qzv
qiime diversity beta-group-significance \
--i-distance-matrix /data/core-metrics-results.maxOTUs/unweighted_unifrac_distance_matrix.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--m-metadata-column hive_day \
--o-visualization /data/core-metrics-results.maxOTUs/unweighted-unifrac-hive_day-significance.qzv \
--p-pairwise
qiime diversity core-metrics-phylogenetic \
--i-phylogeny /data/16s.v4.maxcts.rooted-tree.qza \
--i-table /data/16s.v4.maxcts.feature.table.qza \
--p-sampling-depth 6045 \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--output-dir /data/core-metrics-results.maxcts
qiime diversity alpha-group-significance \
--i-alpha-diversity /data/core-metrics-results.maxcts/faith_pd_vector.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/core-metrics-results.maxcts/faith-pd-group-significance.qzv
qiime diversity alpha-group-significance \
--i-alpha-diversity /data/core-metrics-results.maxcts/evenness_vector.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/core-metrics-results.maxcts/evenness-group-significance.qzv
qiime diversity beta-group-significance \
--i-distance-matrix /data/core-metrics-results.maxcts/unweighted_unifrac_distance_matrix.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--m-metadata-column hive_day \
--o-visualization /data/core-metrics-results.maxcts/unweighted-unifrac-hive_day-significance.qzv \
--p-pairwise
```
Qiime2’s core diversity metrics include:
**Alpha diversity**
- Shannon’s diversity index (a quantitative measure of community richness)
- Observed OTUs (a qualitative measure of community richness)
- Faith’s Phylogenetic Diversity (a qualitiative measure of community richness that incorporates phylogenetic relationships between the features)
- Evenness (or Pielou’s Evenness; a measure of community evenness)
**Beta diversity**
- Jaccard distance (a qualitative measure of community dissimilarity)
- Bray-Curtis distance (a quantitative measure of community dissimilarity)
- unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)
- weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)
### OTU Classification
Naive Bayes classifiers have been built from the [SILVA release 138](https://www.arb-silva.de/documentation/release-138/) dataset, which was released in December 2019 and is currently the most recent. One was [provided by the Qiime2 developers](https://docs.qiime2.org/2021.4/data-resources/).
The Qiime2 docs reccommend building your classifier with sequences aligned to your amplicon. Despite many attempts I was unable to get the SILVA dataset into a format that would be accepted by Qiime. The classifier they provide works quite well however, and I think it's adequate to our needs.
```bash
cd /export/groups/drangeli/marker.ref.DBs/silva138/
wget "https://data.qiime2.org/2021.4/common/silva-138-99-nb-classifier.qza"
cd /export/groups/drangeli/beeSeq.project/16s.v4v5/
cp ../../marker.ref.DBs/silva138/silva-138-99-nb-classifier.qza .
qiime feature-classifier classify-sklearn \
--i-reads /data/16s.v4.maxOTUs.rep.seqs.qza \
--i-classifier /data/silva-138-99-nb-classifier.qza \
--o-classification /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
--p-n-jobs -3
qiime metadata tabulate \
--m-input-file /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
--o-visualization /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qzv
qiime tools export \
--input-path /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
--output-path /data/.
mv taxonomy.tsv 16s.v4.maxOTUs.silva138.taxonomy.calls.tsv
cut -f 3 16s.v4.maxOTUs.silva138.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk
```
The median posterior support for taxonomy assignments is 0.976.
```bash
qiime taxa barplot \
--i-table /data/16s.v4.maxOTUs.feature.table.qza \
--i-taxonomy /data/16s.v4.maxOTUs.silva138.taxonomy.calls.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/16s.v4.maxOTUs.silva138.taxa.barplot.qzv
```
```bash
qiime feature-classifier classify-sklearn \
--i-reads /data/16s.v4.maxcts.rep.seqs.qza \
--i-classifier /data/silva-138-99-nb-classifier.qza \
--o-classification /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
--p-n-jobs -3
qiime metadata tabulate \
--m-input-file /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
--o-visualization /data/16s.v4.maxcts.silva138.taxonomy.calls.qzv
qiime tools export \
--input-path /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
--output-path /data/.
mv taxonomy.tsv 16s.v4.maxcts.silva138.taxonomy.calls.tsv
cut -f 3 16s.v4.maxcts.silva138.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk
```
The median posterior support for taxonomy assignments is 0.984.
```bash
qiime taxa barplot \
--i-table /data/16s.v4.maxcts.feature.table.qza \
--i-taxonomy /data/16s.v4.maxcts.silva138.taxonomy.calls.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--o-visualization /data/16s.v4.maxcts.silva138.taxa.barplot.qzv
```
Inspecting the most common OTUs, the classifications look sensible. *Snodgrassella*, *Candidatus Schmidhempelia*, and *Gilliamella* perdominate.
### Differential abundance analysis using ANCOM
Start by filtering out the free-foraging (treatment F) samples, and just going ahead with the day 25 samples.
```bash
qiime feature-table filter-samples \
--i-table /data/16s.v4.maxOTUs.feature.table.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--p-where "day='25' AND NOT treatment='F'" \
--o-filtered-table /data/16s.v4.maxOTUs.d25AD.feature.table.qza
```
Calculate pseudocounts (counts +1) to avoid log-of-zero errors.
```bash
qiime composition add-pseudocount \
--i-table /data/16s.v4.maxOTUs.d25AD.feature.table.qza \
--o-composition-table /data/16s.v4.maxOTUs.d25AD.comp.table.qza
```
[ANCOM](https://www.ncbi.nlm.nih.gov/pubmed/26028277) (ANalysis of COMposition) is a statistical method for the comparisons of feature abundance in sample groups. It should be robust to distributional assumptions and scaling to large sample sizes. ANCOM also assumes that relatively few (< ~ 25%) of the features are changing between groups.
```bash
qiime composition ancom \
--i-table /data/16s.v4.maxOTUs.d25AD.comp.table.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.tsv \
--m-metadata-column treatment \
--o-visualization /data/16s.v4.maxOTUs.d25AD.ancom.by.treatment.qzv
```
One feature is identified as significantly different: `aa8b914828a465c360b7a079334746d8`, classifed only to the genus level as *Saccharibacter* by our Bayesian classifier, but a [BLASTn search](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=CGTTGCTCGGAATGACTGGGCGTAAAGGGCGCGTAGGCGGTTTATACAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTTGATACGTATTGACTAGAGTCCGAGAGAGGATTGCGGAATTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTTGCGAAGGCGGCAATCTGGCTCGGAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGTGTGCTGGATGTTGGGAAGCTTAGCTTTTCAGTGTCGGAGCTAACGCGTTAAGCACACCGCCTGGGGAGTA) of the sequence to NCBI has a 99% identity match to *[Saccharibacter floricola](https://en.wikipedia.org/wiki/Saccharibacter_floricola)*.

I repeated the ANCOM analysis using the "maxcts" dataset and got almost precisely the same results: This OTU is the only one highlighted by ANCOM and the counts differ by only a few.
### 16S v6-v8
I ran all the same steps as above using the `16s.v6v8` dataset too.
#### dada2 denoise optimization

This region yeilds far more features/OTUs and reads passing filter. The v6-v8 region was also the focus of the 2019 manuscript. It may make more sense to focus on these data in the final analysis.
#### Rarefaction
Again, I chose sampling depths to retain all the day-25 samples in treatments A-D.
- max OTUs: 4965
- max counts: 6075
#### Classification
- max OTUs: 0.9428 median posterior support for classifications
- max counts: 0.969
The more I think about it, 16s v6-v8 filtered to maximize reads (`16s.v6.maxcts`) seems like the best dataset to proceed with.
### ANCOM by treatment
One significantly different OTU: `699d78936fbf1cc18ab30cede7baad9e` classified as *Saccharibacter floricola* with 0.756 posterior support.

### Gneiss
### Differential abundance analysis with gneiss
Another way to examine differences among treatments is the use of balance trees as implemented by the `gneiss` algorithm [Morton et al. 2017](https://msystems.asm.org/content/2/1/e00162-16). Qiime has an instructive [tutorial](https://docs.qiime2.org/2021.4/tutorials/gneiss/) on its application of gneiss.
I'll restrict this analysis to the day-25 data. So the metadata file will need to be filtered down to those samples.
```bash
grep '^[ABCD]' metadata.beeSeq.microbiota.tsv > tmp
head -n 1 metadata.beeSeq.microbiota.tsv > metadata.beeSeq.microbiota.d25AD.tsv
grep '_day25_' tmp >> metadata.beeSeq.microbiota.d25AD.tsv
```
The functinality of `qiime gneiss` appers to have been reduced since earlier versions. It's no longer possible to have the plug-in output the taxa on either side of a particular "balance" or node in the tree. Although the tree and balance data can be exported. The OLS regression function also seems to have disappeared. The Qiime2 forum says this was done to make it compatible with Songbird, although Songbird has also been eliminated from the most recent versions of Qiime2. I'll try it all anyway.
Gneiss can run based on trees created by clustering features based on their own correlations (`correlation-clustering`), or in relation to some numerical factor (`gradient-clustering`).
```bash
qiime gneiss correlation-clustering \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--o-clustering /data/16s.v6.maxcts.d25AD.hierarchy.qza
qiime tools export \
--input-path /data/16s.v6.maxcts.d25AD.hierarchy.qza \
--output-path /data/.
mv tree.nwk 16s.v6.maxcts.d25AD.hierarchy.tre
qiime gneiss ilr-hierarchical \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--i-tree /data/16s.v6.maxcts.d25AD.hierarchy.qza \
--o-balances /data/16s.v6.maxcts.d25AD.balances.qza
qiime tools export \
--input-path /data/16s.v6.maxcts.d25AD.balances.qza \
--output-path /data/.
singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \
biom convert \
-i /data/feature-table.biom \
-o /data/16s.v6.maxcts.d25AD.gneiss.balances.tsv \
--to-tsv
sed -i '1d' 16s.v6.maxcts.d25AD.gneiss.balances.tsv
qiime gneiss dendrogram-heatmap \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--i-tree /data/16s.v6.maxcts.d25AD.hierarchy.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
--m-metadata-column treatment \
--p-color-map seismic \
--o-visualization /data/16s.v6.maxcts.d25AD.gneiss.by.treatment.qzv
```
The *Saccharibacter floricola* feature `699d78936fbf1cc18ab30cede7baad9e`, highlighted by ANCOM, appears with 4 other features on one side of the `y2` split. Most of the other features in the dataset are on the other side of that split. The additional 4 features are all Enterobacteriaceae, which the SILVA classifier failed to resolve to a lower taxonomic level. BLAST has multiple perfect matches to each of them.
| feature UUID | BLAST result |
|:-------------------------------- |:--- |
| [d3286dc2d6be57a96881c9f6a43b2859](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GAATTTGGCAGAGATGCCTTAGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGATTCGGTCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGCATATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTATGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGT) | *Citrobacter werkmanii*; *Klebsiella oxytoca*; several uncultured bacteria |
| [8fa34849d5dded61d203db8249907103](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GAATTTGGCAGAGATGCCTTAGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTTCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGCATATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTATGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGT) | *Citrobacter freundii* strain RTC1; *Enterobacter* sp. TS4; Uncultured bacterium AMW_D6 |
| [8d625c592171b66b877523cc46ac3963](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GAACTTAGCAGAGATGCTTTGGTGCCTTCGGGAACTCTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGATTCGGTCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGCATATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTATGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGT) | *Klebsiella* or *Citrobacter* |
| [8026cfce27b83d70eb073a10947f3f94](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GAACTTAGCAGAGATGCTTTGGTGCCTTCGGGAACTCTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGCATATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTATGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGT) | *Klebsiella* or *Citrobacter* |

This result matches the group of OTUs identified by Gneiss in the previous analysis, which contained *Saccharibacter* sp. and 6 unidentified Enterobacteriaceae.
I also tried the gradient clustering method.
```bash
qiime gneiss gradient-clustering \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--m-gradient-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
--m-gradient-column dose_ppb \
--o-clustering /data/16s.v6.maxcts.d25AD.grad.clust.qza
qiime tools export \
--input-path /data/16s.v6.maxcts.d25AD.grad.clust.qza \
--output-path /data/.
mv tree.nwk 16s.v6.maxcts.d25AD.grad.clust.tre
qiime gneiss ilr-hierarchical \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--i-tree /data/16s.v6.maxcts.d25AD.grad.clust.qza \
--o-balances /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.qza
qiime tools export \
--input-path /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.qza \
--output-path /data/.
singularity run --bind /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data /export/groups/singularity/qiime_202104.simg \
biom convert \
-i /data/feature-table.biom \
-o /data/16s.v6.maxcts.d25AD.gneiss.grad.balances.tsv \
--to-tsv
sed -i '1d' 16s.v6.maxcts.d25AD.gneiss.grad.balances.tsv
qiime gneiss dendrogram-heatmap \
--i-table /data/16s.v6.maxcts.d25AD.feature.table.qza \
--i-tree /data/16s.v6.maxcts.d25AD.grad.clust.qza \
--m-metadata-file /data/metadata.beeSeq.microbiota.d25AD.tsv \
--m-metadata-column treatment \
--p-color-map seismic \
--o-visualization /data/16s.v6.maxcts.d25AD.gneiss.grad.by.dose.qzv
```
From looking at the visualization, there doesn't seem to be an obvious split, where a large block of features are associated with any of the treatment groups in a consistent way. However, the *Saccharibacter floricola* feature `699d78936fbf1cc18ab30cede7baad9e`, highlighted by ANCOM, appears with several other features around the `y9` split. `8fa34849d5dded61d203db8249907103` from the correlation gneiss analysis also appears here.
| feature UUID | Classification or BLAST result |
|:-------------------------------- |:--- |
| 1a196db6e5bdd828fec6dd3e5dcbef1c | *Lactobacillus bombicola* |
| f1459b5093b227402d67fe3d593d4e44 | *Lactobacillus bombicola* |
| [48a57dc7a4e32a52c0ed2aac00d97c14](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GGATCGCGGCAGAGATGTCGTTTCCCTTCGGGGCCGGTTCACAGGTGGTGCATGGTCGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTGTGTTGCCAGCGGGTTATGCCGGGAACTCACAAGGGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAGATCATCATGCCCCTTACGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAACGGGATGCGACATGGCGACATGGAGCGGATCCCTTAAAACCGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGGCGGAGTCGCTAGTAATCGCGGATCAGCAACGCCGCGGT) | *Bifidobacterium bohemicum* |
| [595d46a49f01ca56f91b8947ff9cc272](http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GGATCGCGGCAGAGATGTCGTTTCCCTTCGGGGCCGGTTCACAGGTGGTGCATGGTCGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTGTGTTGCCAGCGGGTTATGCCGGGAACTCACAAGGGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAGATCATCATGCCCCTTACGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAACGGGATGCGACATGGTGACATGGAGCGGATCCCTTAAAACCGGTCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGGCGGAGTCGCTAGTAATCGCGGATCAGCAACGCCGCGGT) | *Bifidobacterium bohemicum* |
| 8026cfce27b83d70eb073a10947f3f94 | *Klebsiella* or *Citrobacter* |
| 0eb3327ed1fcfec6377770463795cd5f | *Asaia* |

### Songbird
:::danger
BOOKMARK: Working from here
:::
https://github.com/biocore/songbird#2-using-songbird-through-qiime-2
```bash
pip install songbird
qiime dev refresh-cache
qiime songbird multinomial \
--i-table redsea.biom.qza \
--m-metadata-file data/redsea/redsea_metadata.txt \
--p-formula "Depth+Temperature+Salinity+Oxygen+Fluorescence+Nitrate" \
--p-epochs 10000 \
--p-differential-prior 0.5 \
--p-training-column Testing \
--p-summary-interval 1 \
--o-differentials differentials.qza \
--o-regression-stats regression-stats.qza \
--o-regression-biplot regression-biplot.qza
qiime songbird summarize-single \
--i-regression-stats regression-stats.qza \
--o-visualization regression-summary.qzv
```
see [this section on interpreting model fitting](https://github.com/biocore/songbird#interpreting-model-fitting) for details on how to understand these plots.
## Gene expression
### Genome Annotation
The [current gene set annotation fot *B. impatiens*](https://www.ncbi.nlm.nih.gov/genome/?term=txid132113[orgn]) does not include GO terms, which will be very helpful in making sense of the gene expression results.
[EnTAP](https://entap.readthedocs.io/en/v0.10.8-beta/Getting_Started/introduction.html) combines annotations from protein domain assignment, orthologous gene family assessment, Gene Ontology term assignment, and KEGG pathway annotation. And it weighs evidence from these different sources to make decisions about applying annotations to gene models. You can list the target organism in EnTAP and it will leverage NCBI taxonomy. To implement EnTAP, I'll be relying on [notes from a workshop on functional genome annotation run by UConn and available on GitHub](https://github.com/CBC-UCONN/Structural-Annotation#18-functional-annotation-using-entap).
The entap_config.txt file should look like...
```
diamond_exe_path=diamond
rsem_exe_path=/isg/shared/apps/EnTAP/0.9.0-beta/libs/RSEM-1.3.0
genemarkst_exe_path=perl /isg/shared/apps/EnTAP/0.9.0-beta/libs/gmst_linux_64/gmst.pl
eggnog_sql_database=/isg/shared/apps/EnTAP/0.9.0-beta/databases/databases/eggnog.db
eggnog_dmnd_database=/isg/shared/apps/EnTAP/0.9.0-beta/databases/bin/eggnog_proteins.dmnd
interpro_exe_path=interproscan.sh
entap_database_sql_path=
entap_database_bin_path=/isg/shared/apps/EnTAP/0.9.0-beta/databases/bin/entap_database.bin
entap_graphing_script=
```
```
EnTAP --runP \
-i ../../16_gfacts/mono_o/genes.fasta.faa \
-d /isg/shared/databases/Diamond/RefSeq/complete.protein.faa.205.dmnd \
-d /isg/shared/databases/Diamond/Uniprot/uniprot_sprot.dmnd \
--ontology 0 \
--threads 16
```
:::danger
BOOKMARK: Working from here
:::
The EnTAP log file has useful info on e.g. species from which annotations were drawn, % sequences without alignment or annotations
### Full-length transcripts
Quality control was done previously using FastQC and trimming was done with [Trimmomatic](https://hackmd.io/ws-zHwGKSYicgUyU9zl-Qg?view#Trimming). There were 358,759,635 reads passing the filter criteria (`HEADCROP:12 LEADING:3 TRAILING:3 SLIDINGWINDOW:3:15 MINLEN:50`). I also did a pretty throrough [examination of xenic or contaminant sequences](https://hackmd.io/ws-zHwGKSYicgUyU9zl-Qg?view#fastq_screen) in the pool of full-length reads. I'm satisied with those results.
#### Trinity assembly
Transcript assembly is a critical step. Since the 3seq reads will be mapped to transcripts in order to calculate counts for gene expression, the quality of the transcriptome assembly is important. [Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki) is the go-to transcriptome assembler. Our 2019 analysis was done with an assenbly by Trinity-v2.4.0. The current version is v2.13.1. I don't know if the new version will produce a meaningfully different assembly. But I'll try it, and see if the number of predicted transcripts and genes differs much. If so (either up or down), the new version is likely to be more reliable.
```
Trinity --seqType fq --max_memory 64G --CPU 12 \
--single filtered_reads/*.fastq \
--output trinity.out.dir.211004 > Bi.RNAseq.Trinity.run.211004.log
```
:::danger
BOOKMARK: Working from here
:::