Dave Angelini
25 August 2021
This page will document the workflow used to integrate gene expression and microbiome analysis from the Bombus impatiens experiments conducted by Dave Angelini, Will Simmons and Anne Averill.
The project has lain dormant for some time. I am taking it up again in August 2021, and will start from scratch in order to ensure the most current tools are used.
While any process is running on nscc, you can monitor it's progress indirectly by checking the CPU usage on node 26 at http://schupflab.colby.edu/mrtg/nscc/nscc-n26.load.html
The workflow described here is meant to be a complete repeat or update to those steps.
Our goal was to examine effects on gene expression and microbiome composition in the gut of the bumblebee Bombus impatiens after chronic sublethal exposure to the neonicotinoid imidacloprid. Colonies were isolated in semi-field enclosures and provided imidacloprid in pollen and sugar syrup for 25 days, which spans the developmental time of new workers. Treatments were chosen to bracket the concentrations of imidacloprid detected in the pollen and nectar of crop plants. Workers were then sampled from replicate colonies. The gut was removed and used for microbiome and gene expression analyses.
This workflow will be applied to three datasets: 16s v4-v5 MiSeq reads and reads from v6-v8. These should tell similar stories. In the end we might focus the manuscript on which ever has more reliable OTU classifications. We also have data from 18S, which in theory should reveal eukaryotic biota. However, databases for 18S classification are not as extensive and the data are confounded by overwhelming signaling from the bee itself.
We have traditional full-length transcript RNAseq reads from 4 gut samples, and 3seq (3' end tag sequence) reads from 96 samples from the last time point in the experiment. Unfortunately, the current annotation of the B. impatiens gene set is not very extensive. In particular, our analysis would benefit from GO term annotation for each gene. Therefore, this workflow will include making those annotations with EnTAP.
The entire project is being run from nscc
's node 26 at /export/groups/drangeli/beeSeq.project/
Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the Dalhousie University IMR, from amplicons targetting 16S v4-v5.
GTGYCAGCMGCCGCGGTAA
CCGYCAATTYMTTTRAGTTT
Parada AE, Needham DM, Fuhrman JA. 2016. Every base matters: assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. Environ Microbiol 18(5):1403-14. Link
Walters W, Hyde ER, Berg-Lyons D, Ackermann G, Humphrey G, Parada A, Gilbert JA, Jansson JK, Caporaso JG, Fuhrman JA, Apprill A, Knight B. 2015. Improved bacterial 16S rRNA gene (V4 and V4-5) and fungal internal transcribed spacer marker gene primers for microbial community surveys. mSystems 1(1):e00009-15. Link
And 16S v6-v8.
ACGCGHNRAACCTTACC
ACGGGCRGTGWGTRCAA
Comeau AM, Li WKW, Tremblay J-É, Carmack EC, Lovejoy C (2011) Arctic Ocean Microbial Community Structure before and after the 2007 Record Sea Ice Minimum. PLoS ONE 6(11): e27492. Link
Note that the singularity container doesn't initially have access to /export/groups
, where I have all the data. Randy tells me this is for security reasons. In any case, the data folder must be mounted or "bound" within the container using the --bind
arguement. This creates a folder at the root, /data
, that contains the data in the working directoty (that is, the output of pwd
, such as /export/groups/drangeli/beeSeq.project/16s.v4v5/
, but it only exists within the container. So paths in the qiime
call refer to files using the /data/
path.
Start by running screen
. Then define an alias to run Qiime2 via singularity. The alias will stay active as long as the screen instance is maintained.
Get out of the screen (and leave it running) by typing the keys CTRL
+A
, then D
. When you're finally done using the screen, enter the command exit
.
There's another way to mount the data folder in Singularity that doesn't require it to be a named folder within the container. The argument --bind "$PWD":/export/home/drangeli
will replace the starting folder in the container with the contents of the current directory outside the container. This might be convenient, because commands wouldn't need to explicitly refer to that folder. However, I've stuck with the explicit \data\
binding because it seems safer.
For the 16S v4-v5 reads, I'm working in nscc:/export/groups/drangeli/beeSeq.project/16s.v4v5/
. All the steps will be mirrored for the v6-v8 data in ../16s.v6v8/
.
I moved the older processing files into a sub-folder summer.2018.analysis/
.
Raw reads are in the raw.reads
folder. I've previously run FastQC on all the raw data files. Nothing merited exclusion.
Import the data into Qiime2 data structures and summarize. In updating this workflow, I'm refering to my notes from 2018 and to the current Moving Pictures tutortial from the Qiime2 docs.
Download the qzv
visualization files to the local machine as follows.
Drag and drop the qzv
file into a web browser at https://view.qiime2.org/
The next step is dada2
, a tool for detecting and correcting Illumina sequence data. It also filters out reads from phiX and chimeric sequences. Here's an example call.
Then export the feature table to a more flexible format (tab-seperated values).
I ran the dada2 denoise
process under a range of truncation parameters using a bash script to produce the feature.table.qza
files and a summary tsv
file. I used the total number of features (presumptive OTUs) as the metric to choose the best parameters.
These results from dada2
are slightly different than I found in 2018. The optimal parameters identified then produce slightly different numbers of OTUs now (385 features versus 375 in the past). The input data is exactly the same, and the denoise process is determinative (not stochastic), so the discrepency must be due to updates in the dada2 denoise
algorithm. The parameters identified as optimal this time are slightly higher for the forward truncation and slightly lower for the reverse read truncation. Therefore the program is using relatively more data from the forward reads. The forward reads are higher quality (which is typical for paired-end Illumina sequences), so that seems good.
These parameters (maximizing feature and read numbers) correspond to at least 75% of reads with PHRED quality scores of 34 or better.
I'm not entirely sure whether it's best to proceed with a feature table based on maximizing OTUs or total counts. If you look at the parameter space, there's a knife-edge drop-off in both metrics. That corresponds to the truncation values that start to erode the amplicon (410 bp for the 16s v4-v5 sequence). It's interesting that OTUs are maximized right at that drop off and are favored by more forward read sequence, but more reads (counts) pass filter just inside that boundary and with slightly more balanced contribution of forward and reverse reads.
For now I'll move ahead with both files.
The "feature table" is a matrix of counts for each feature (OTU) in each sample. This can be imported into R for plotting and further analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to those IDs using the taxonomy.tsv
file which is generated later.
The commands below generate a list of the features, with their UUID, DNA sequence, and a link to BLAST search. This is useful as a record.
The feature-table summarize
command provides summary statistics and groups features on metadata category.
Generate trees for phylogenetic diversity analyses.
Determine the optimal rarefaction level by looking at the feature.table.qzv
visualizations. I chose the maximum sampling depth that would retain all day-25 samples from treatments A-D.
Qiime2’s core diversity metrics include:
Alpha diversity
Beta diversity
Naive Bayes classifiers have been built from the SILVA release 138 dataset, which was released in December 2019 and is currently the most recent. One was provided by the Qiime2 developers.
The Qiime2 docs reccommend building your classifier with sequences aligned to your amplicon. Despite many attempts I was unable to get the SILVA dataset into a format that would be accepted by Qiime. The classifier they provide works quite well however, and I think it's adequate to our needs.
The median posterior support for taxonomy assignments is 0.976.
The median posterior support for taxonomy assignments is 0.984.
Inspecting the most common OTUs, the classifications look sensible. Snodgrassella, Candidatus Schmidhempelia, and Gilliamella perdominate.
Start by filtering out the free-foraging (treatment F) samples, and just going ahead with the day 25 samples.
Calculate pseudocounts (counts +1) to avoid log-of-zero errors.
ANCOM (ANalysis of COMposition) is a statistical method for the comparisons of feature abundance in sample groups. It should be robust to distributional assumptions and scaling to large sample sizes. ANCOM also assumes that relatively few (< ~ 25%) of the features are changing between groups.
One feature is identified as significantly different: aa8b914828a465c360b7a079334746d8
, classifed only to the genus level as Saccharibacter by our Bayesian classifier, but a BLASTn search of the sequence to NCBI has a 99% identity match to Saccharibacter floricola.
I repeated the ANCOM analysis using the "maxcts" dataset and got almost precisely the same results: This OTU is the only one highlighted by ANCOM and the counts differ by only a few.
I ran all the same steps as above using the 16s.v6v8
dataset too.
This region yeilds far more features/OTUs and reads passing filter. The v6-v8 region was also the focus of the 2019 manuscript. It may make more sense to focus on these data in the final analysis.
Again, I chose sampling depths to retain all the day-25 samples in treatments A-D.
The more I think about it, 16s v6-v8 filtered to maximize reads (16s.v6.maxcts
) seems like the best dataset to proceed with.
One significantly different OTU: 699d78936fbf1cc18ab30cede7baad9e
classified as Saccharibacter floricola with 0.756 posterior support.
Another way to examine differences among treatments is the use of balance trees as implemented by the gneiss
algorithm Morton et al. 2017. Qiime has an instructive tutorial on its application of gneiss.
I'll restrict this analysis to the day-25 data. So the metadata file will need to be filtered down to those samples.
The functinality of qiime gneiss
appers to have been reduced since earlier versions. It's no longer possible to have the plug-in output the taxa on either side of a particular "balance" or node in the tree. Although the tree and balance data can be exported. The OLS regression function also seems to have disappeared. The Qiime2 forum says this was done to make it compatible with Songbird, although Songbird has also been eliminated from the most recent versions of Qiime2. I'll try it all anyway.
Gneiss can run based on trees created by clustering features based on their own correlations (correlation-clustering
), or in relation to some numerical factor (gradient-clustering
).
The Saccharibacter floricola feature 699d78936fbf1cc18ab30cede7baad9e
, highlighted by ANCOM, appears with 4 other features on one side of the y2
split. Most of the other features in the dataset are on the other side of that split. The additional 4 features are all Enterobacteriaceae, which the SILVA classifier failed to resolve to a lower taxonomic level. BLAST has multiple perfect matches to each of them.
feature UUID | BLAST result |
---|---|
d3286dc2d6be57a96881c9f6a43b2859 | Citrobacter werkmanii; Klebsiella oxytoca; several uncultured bacteria |
8fa34849d5dded61d203db8249907103 | Citrobacter freundii strain RTC1; Enterobacter sp. TS4; Uncultured bacterium AMW_D6 |
8d625c592171b66b877523cc46ac3963 | Klebsiella or Citrobacter |
8026cfce27b83d70eb073a10947f3f94 | Klebsiella or Citrobacter |
This result matches the group of OTUs identified by Gneiss in the previous analysis, which contained Saccharibacter sp. and 6 unidentified Enterobacteriaceae.
I also tried the gradient clustering method.
From looking at the visualization, there doesn't seem to be an obvious split, where a large block of features are associated with any of the treatment groups in a consistent way. However, the Saccharibacter floricola feature 699d78936fbf1cc18ab30cede7baad9e
, highlighted by ANCOM, appears with several other features around the y9
split. 8fa34849d5dded61d203db8249907103
from the correlation gneiss analysis also appears here.
feature UUID | Classification or BLAST result |
---|---|
1a196db6e5bdd828fec6dd3e5dcbef1c | Lactobacillus bombicola |
f1459b5093b227402d67fe3d593d4e44 | Lactobacillus bombicola |
48a57dc7a4e32a52c0ed2aac00d97c14 | Bifidobacterium bohemicum |
595d46a49f01ca56f91b8947ff9cc272 | Bifidobacterium bohemicum |
8026cfce27b83d70eb073a10947f3f94 | Klebsiella or Citrobacter |
0eb3327ed1fcfec6377770463795cd5f | Asaia |
BOOKMARK: Working from here
https://github.com/biocore/songbird#2-using-songbird-through-qiime-2
see this section on interpreting model fitting for details on how to understand these plots.
The current gene set annotation fot B. impatiens does not include GO terms, which will be very helpful in making sense of the gene expression results.
EnTAP combines annotations from protein domain assignment, orthologous gene family assessment, Gene Ontology term assignment, and KEGG pathway annotation. And it weighs evidence from these different sources to make decisions about applying annotations to gene models. You can list the target organism in EnTAP and it will leverage NCBI taxonomy. To implement EnTAP, I'll be relying on notes from a workshop on functional genome annotation run by UConn and available on GitHub.
The entap_config.txt file should look like…
BOOKMARK: Working from here
The EnTAP log file has useful info on e.g. species from which annotations were drawn, % sequences without alignment or annotations
Quality control was done previously using FastQC and trimming was done with Trimmomatic. There were 358,759,635 reads passing the filter criteria (HEADCROP:12 LEADING:3 TRAILING:3 SLIDINGWINDOW:3:15 MINLEN:50
). I also did a pretty throrough examination of xenic or contaminant sequences in the pool of full-length reads. I'm satisied with those results.
Transcript assembly is a critical step. Since the 3seq reads will be mapped to transcripts in order to calculate counts for gene expression, the quality of the transcriptome assembly is important. Trinity is the go-to transcriptome assembler. Our 2019 analysis was done with an assenbly by Trinity-v2.4.0. The current version is v2.13.1. I don't know if the new version will produce a meaningfully different assembly. But I'll try it, and see if the number of predicted transcripts and genes differs much. If so (either up or down), the new version is likely to be more reliable.
BOOKMARK: Working from here