Main page

BeeSeq

Dave Angelini
March 29, 2018

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Bombus impatiens (photo: Dave Angelini)

Microbiome analysis

16S v4-v5

Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the Dalhousie University IMR, from amplicons targetting 16S v4-v5.

  • 515F GTGYCAGCMGCCGCGGTAA
  • 926R CCGYCAATTYMTTTRAGTTT
  • The product is 410 bp, resulting in a 190 bp overlap.

Parada AE, Needham DM, Fuhrman JA. 2016. Every base matters: assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. Environ Microbiol 18(5):1403-14. Link
Walters W, Hyde ER, Berg-Lyons D, Ackermann G, Humphrey G, Parada A, Gilbert JA, Jansson JK, Caporaso JG, Fuhrman JA, Apprill A, Knight B. 2015. Improved bacterial 16S rRNA gene (V4 and V4-5) and fungal internal transcribed spacer marker gene primers for microbial community surveys. mSystems 1(1):e00009-15. Link

Data processing

In order to make Qiime work more smoothly, I'll rename the raw data files to have names without any special characters like dashes or underscores.

This can be done with a few variations on

rename 1-d 1d ?1-d*
rename d0- d00n ??d0-*

Set the Qiime2 docker alias. Note that I updated to the June 2018 Qiime2 release.

alias qiime 'docker run -t -i -v /export/groups/drangeli/beeSeq.project/16s.v4v5/:/data qiime2/core:2018.6 qiime'

Import the data and summarize.

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path raw.reads \
  --source-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path 16s.v4.demux.qza

qiime demux summarize \
  --i-data 16s.v4.demux.qza \
  --o-visualization 16s.v4.demux.qzv

I ran the dada2 denoise process under a range of truncation parameters using a bash script to run the denoise command and produce a feature.table.qza and tsv file. I used the total number of features (presumptive OTUs) as the metric to choose.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The parameters that produce the maximum number of features are --p-trunc-len-f 260 --p-trunc-len-r 175, corresponding to a quality score cutoff of roughly 34 as seen in the quality plot in the 16s.v4.demux.qzv file.

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs 16s.v4.demux.qza \
  --p-trunc-len-f 260 --p-trunc-len-r 175 \
  --p-trim-left-f 34  --p-trim-left-r 34 \
  --o-representative-sequences 16s.v4.rep.seqs.qza \
  --o-table 16s.v4.feature.table.qza \
  --o-denoising-stats 16s.v4.denoise.stats.qza \
  --p-n-threads 16

Export the feature table to a format that can be read by R.

qiime tools export  \
  16s.v4.feature.table.qza \
  --output-dir .

docker run -t -i -v /export/groups/drangeli/beeSeq.project/16s.v4v5/:/data qiime2/core:2018.6 \
  biom convert \
    -i feature-table.biom \
    -o 16s.v4.feature.table.tsv \
    --to-tsv

The resulting tab-separated values (tsv) file is a table of counts for each feature in each sample. This can be imported into R for custom statistical tests, plots, or other analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to the ID using 16s.v4.taxonomy.tsv, which is generated later in this workflow.

qiime feature-table tabulate-seqs \
  --i-data 16s.v4.rep.seqs.qza \
  --o-visualization 16s.v4.rep.seqs.qzv

A list of features with links to BLAST search. This isn’t so immediately useful, it can be used as a tool to follow up on the individual identity of particular features/taxa later.

qiime feature-table summarize \
  --i-table 16s.v4.feature.table.qza \
  --o-visualization 16s.v4.feature.table.qzv \
  --m-sample-metadata-file metadata.beeSeq.microbiota.tsv

Diversity analysis

Generate a tree for phylogenetic diversity analyses.

qiime alignment mafft \
  --i-sequences 16s.v4.rep.seqs.qza \
  --o-alignment 16s.v4.aligned.rep.seqs.qza
qiime alignment mask \
  --i-alignment 16s.v4.aligned.rep.seqs.qza \
  --o-masked-alignment 16s.v4.masked.aligned.rep.seqs.qza
qiime phylogeny fasttree \
  --i-alignment 16s.v4.masked.aligned.rep.seqs.qza \
  --o-tree 16s.v4.unrooted.tree.qza  
qiime phylogeny midpoint-root \
  --i-tree 16s.v4.unrooted.tree.qza \
  --o-rooted-tree 16s.v4.rooted.tree.qza

Determine the optimal rarefaction level.

qiime diversity alpha-rarefaction \
  --i-table 16s.v4.feature.table.qza \
  --i-phylogeny 16s.v4.rooted.tree.qza \
  --p-max-depth 5000\
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization 16s.v4.rarefaction.qzv

Optimal rarefaction looks to be about 2000.

Calculate diversity metrics. Qiime2’s core diversity metrics include:

Alpha diversity

  • Shannon’s diversity index (a quantitative measure of community richness)
  • Observed OTUs (a qualitative measure of community richness)
  • Faith’s Phylogenetic Diversity (a qualitiative measure of community richness that incorporates phylogenetic relationships between the features)
  • Evenness (or Pielou’s Evenness; a measure of community evenness)

Beta diversity

  • Jaccard distance (a qualitative measure of community dissimilarity)
  • Bray-Curtis distance (a quantitative measure of community dissimilarity)
  • unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)
  • weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)
qiime diversity core-metrics-phylogenetic \
  --i-phylogeny 16s.v4.rooted.tree.qza \
  --i-table 16s.v4.feature.table.qza \
  --p-sampling-depth 2000 \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --output-dir core.diversity.metrics

Test for associations between metadata categories and values and the alpha diversity metrics.

qiime diversity alpha-group-significance \
  --i-alpha-diversity core.diversity.metrics/observed_otus_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/observed-otus-group-significance.qzv
qiime diversity alpha-group-significance \
  --i-alpha-diversity core.diversity.metrics/faith_pd_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/faith-pd-group-significance.qzv
qiime diversity alpha-group-significance \
  --i-alpha-diversity core.diversity.metrics/shannon_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/shannon-group-significance.qzv
qiime diversity alpha-correlation \
  --i-alpha-diversity core.diversity.metrics/shannon_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/shannon-correlation.qzv
qiime diversity alpha-correlation \
  --i-alpha-diversity core.diversity.metrics/faith_pd_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/faith-pd-correlation.qzv

Comparisons of beta diversity by metadata categories can be made via PermANOVA.

qiime diversity beta-group-significance \
  --i-distance-matrix core.diversity.metrics/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment \
  --p-method permanova --p-pairwise \
  --p-permutations 9999 \
  --o-visualization core.diversity.metrics/pAOV.uu.by.treatment.qzv 
  
qiime diversity beta-group-significance \
  --i-distance-matrix core.diversity.metrics/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment_day \
  --p-method permanova --p-pairwise \
  --p-permutations 9999 \
  --o-visualization core.diversity.metrics/pAOV.uu.by.treatment_day.qzv  

PCA with a covariate.

qiime emperor plot \
  --i-pcoa core.diversity.metrics/unweighted_unifrac_pcoa_results.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --p-custom-axes day \
  --o-visualization core.diversity.metrics/emperor.uu.day.qzv

OTU Classification

Naive Bayes classifiers have been built from the SILVA release 132 dataset. One was provided by the Qiime2 developers. I also experimented with building my own classifier. I had no success creating a better classifier by narrowing the sequences in the training set to the region used in our own sequencing. This approach is said to improve results, but I had abysmal classifications using that approach.

Much more successful was the construction of a classifier using the full Silva dataset. On the cownose ray project, this classifier worked as well or better than the one provided by the Qiime developers. The one possible limitation is that it uses rank-free taxonomy, and while most entries have a 7-level system, some include intermediary taxon names.

Below, I'll use both classifiers on our dataset.

Qiime's provided Silva full-length classifier

cp ../../marker.ref.DBs/silva132/silva-132-99-nb-classifier.qza .

qiime feature-classifier classify-sklearn \
  --i-reads 16s.v4.rep.seqs.qza \
  --i-classifier silva-132-99-nb-classifier.qza \
  --o-classification 16s.v4.silva132.nb.taxonomy.calls.qza  

qiime metadata tabulate \
  --m-input-file 16s.v4.silva132.nb.taxonomy.calls.qza \
  --o-visualization 16s.v4.silva132.nb.taxonomy.calls.qzv

qiime tools export  \
  16s.v4.silva132.nb.taxonomy.calls.qza \
  --output-dir .
mv taxonomy.tsv 16s.v4.silva132.nb.taxonomy.calls.tsv

cut -f 3 16s.v4.silva132.nb.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk

The median posterior support for taxonomy assignments is 0.9843.

qiime taxa barplot \
  --i-table 16s.v4.feature.table.qza \
  --i-taxonomy 16s.v4.silva132.nb.taxonomy.calls.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization 16s.v4.silva132.nb.taxa.barplots.qzv

Our in-house-built Silva full-length classifier

cp ../../marker.ref.DBs/silva132/classifier.silva132.in.house.qza .

qiime feature-classifier classify-sklearn \
  --i-reads 16s.v4.rep.seqs.qza \
  --i-classifier classifier.silva132.in.house.qza \
  --o-classification 16s.v4.silva132.in.house.taxonomy.calls.qza  

qiime metadata tabulate \
  --m-input-file 16s.v4.silva132.in.house.taxonomy.calls.qza \
  --o-visualization 16s.v4.silva132.in.house.taxonomy.calls.qzv

qiime tools export  \
  16s.v4.silva132.in.house.taxonomy.calls.qza \
  --output-dir .
mv taxonomy.tsv 16s.v4.silva132.in.house.taxonomy.calls.tsv

cut -f 3 16s.v4.silva132.in.house.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk

The mean posterior support for taxonomy assignments is 0.9921.

qiime taxa barplot \
  --i-table 16s.v4.feature.table.qza \
  --i-taxonomy 16s.v4.silva132.in.house.taxonomy.calls.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization 16s.v4.silva132.in.house.taxa.barplots.qzv

Differential abundance analysis using ANCOM

Calculate pseudocounts (counts +1) to avoid log-of-zero errors.

qiime composition add-pseudocount \
  --i-table 16s.v4.feature.table.qza \
  --o-composition-table 16s.v4.comp.table.qza

ANCOM (ANalysis of COMposition) is a statistical method for the comparisons of feature abundance in sample groups. It should be robust to distributional assumptions and scaling to large sample sizes. ANCOM also assumes that relatively few (< ~ 25%) of the features are changing between groups.

qiime composition ancom \
  --i-table 16s.v4.comp.table.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment \
  --o-visualization ancom.16s.v4.by.treatment.qzv

There are 23 OTUs that differ significantly in abundance among the treatment groups. I created a generic R function to plot Qiime2's ANCOM output as a series of ggplot facets.

This is interesting, but most of the significant differences appear to involve the free-foraging treatment, compared to the enclosed colonies. What might be a more useful analysis is to focus on the enclosed treatments at the last time point.

qiime feature-table filter-samples \
  --i-table 16s.v4.feature.table.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --p-where "day='25' AND NOT treatment='F'" \
  --o-filtered-table 16s.v4.d25AD.feature.table.qza

qiime composition add-pseudocount \
  --i-table 16s.v4.d25AD.feature.table.qza \
  --o-composition-table 16s.v4.d25AD.comp.table.qza

qiime composition ancom \
  --i-table 16s.v4.d25AD.comp.table.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment \
  --o-visualization ancom.16s.v4.d25AD.by.treatment.qzv

Data from both regions of 16S identified the same, single OTU as differentially expressed among treatments. Both classifers identify it as "uncultured Acetobacteraceae bacterium" with greater than 90% probability. It appears to show a dose-responsive decrease in abdundance.

16S v6-v8

Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the Dalhousie University IMR, from amplicons targetting 16S v6-v8.

  • B969F ACGCGHNRAACCTTACC
  • BA1406R ACGGGCRGTGWGTRCAA
  • The product is 450 bp, resulting in a 150 bp overlap.

Comeau AM, Li WKW, Tremblay J-É, Carmack EC, Lovejoy C (2011) Arctic Ocean Microbial Community Structure before and after the 2007 Record Sea Ice Minimum. PLoS ONE 6(11): e27492. Link

Data processing

Set the Qiime2 docker alias.

alias qiime 'docker run -t -i -v /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data qiime2/core:2018.6 qiime'

Import the data and summarize.

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path raw.reads \
  --source-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path 16s.v6.demux.qza

qiime demux summarize \
  --i-data 16s.v6.demux.qza \
  --o-visualization 16s.v6.demux.qzv

Again, I tested a range of --p-trunc-len-f and --p-trunc-len-r values, using a bash script.

The optimum values to maximize feature/OTU number is --p-trunc-len-f 275 --p-trunc-len-r 185. These reads were very high quality, and these truncation values are roughly a quality cutoff of 35.

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs 16s.v6.demux.qza \
  --p-trunc-len-f 275 --p-trunc-len-r 185 \
  --p-trim-left-f 34  --p-trim-left-r 34 \
  --o-representative-sequences 16s.v6.rep.seqs.qza \
  --o-table 16s.v6.feature.table.qza \
  --o-denoising-stats 16s.v6.denoise.stats.qza \
  --p-n-threads 16

The feature table is worth exporting to a format that can be read by R.

qiime tools export  \
  16s.v6.feature.table.qza \
  --output-dir .

docker run -t -i -v /export/groups/drangeli/beeSeq.project/16s.v6v8/:/data qiime2/core:2018.6 \
  biom convert \
    -i feature-table.biom \
    -o 16s.v6.feature.table.tsv \
    --to-tsv

The resulting tab-separated values (tsv) file is a table of counts for each feature in each sample. This can be imported into R for custom statistical tests, plots, or other analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to the ID using 16s.v6.taxonomy.tsv, which is generated later in this workflow.

qiime feature-table tabulate-seqs \
  --i-data 16s.v6.rep.seqs.qza \
  --o-visualization 16s.v6.rep.seqs.qzv

A list of features with links to BLAST search. This isn’t so immediately useful, it can be used as a tool to follow up on the individual identity of particular features/taxa later.

qiime feature-table summarize \
  --i-table 16s.v6.feature.table.qza \
  --o-visualization 16s.v6.feature.table.qzv \
  --m-sample-metadata-file metadata.beeSeq.microbiota.tsv

Diversity analysis

Generate a tree for phylogenetic diversity analyses.

qiime alignment mafft \
  --i-sequences 16s.v6.rep.seqs.qza \
  --o-alignment 16s.v6.aligned.rep.seqs.qza
qiime alignment mask \
  --i-alignment 16s.v6.aligned.rep.seqs.qza \
  --o-masked-alignment 16s.v6.masked.aligned.rep.seqs.qza
qiime phylogeny fasttree \
  --i-alignment 16s.v6.masked.aligned.rep.seqs.qza \
  --o-tree 16s.v6.unrooted.tree.qza  
qiime phylogeny midpoint-root \
  --i-tree 16s.v6.unrooted.tree.qza \
  --o-rooted-tree 16s.v6.rooted.tree.qza

Determine the optimal rarefaction level.

qiime diversity alpha-rarefaction \
  --i-table 16s.v6.feature.table.qza \
  --i-phylogeny 16s.v6.rooted.tree.qza \
  --p-max-depth 2000 \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization 16s.v6.rarefaction.qzv

Optimal rarefaction looks to be about 1000.

Calculate diversity metrics.

qiime diversity core-metrics-phylogenetic \
  --i-phylogeny 16s.v6.rooted.tree.qza \
  --i-table 16s.v6.feature.table.qza \
  --p-sampling-depth 1000 \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --output-dir core.diversity.metrics

Test for associations between metadata categories and values and the alpha diversity metrics.

qiime diversity alpha-group-significance \
  --i-alpha-diversity core.diversity.metrics/observed_otus_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/observed-otus-group-significance.qzv
qiime diversity alpha-group-significance \
  --i-alpha-diversity core.diversity.metrics/faith_pd_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/faith-pd-group-significance.qzv
qiime diversity alpha-group-significance \
  --i-alpha-diversity core.diversity.metrics/shannon_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/shannon-group-significance.qzv
qiime diversity alpha-correlation \
      --i-alpha-diversity core.diversity.metrics/shannon_vector.qza \
      --m-metadata-file metadata.beeSeq.microbiota.tsv \
      --o-visualization core.diversity.metrics/shannon-correlation.qzv
qiime diversity alpha-correlation \
  --i-alpha-diversity core.diversity.metrics/faith_pd_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/faith-pd-correlation.qzv

Comparisons of beta diversity by metadata categories can be made via PermANOVA.

qiime diversity beta-group-significance \
  --i-distance-matrix core.diversity.metrics/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment \
  --p-method permanova --p-pairwise \
  --p-permutations 9999 \
  --o-visualization core.diversity.metrics/pAOV.uu.by.treatment.qzv 
  
qiime diversity beta-group-significance \
  --i-distance-matrix core.diversity.metrics/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment_day \
  --p-method permanova --p-pairwise \
  --p-permutations 9999 \
  --o-visualization core.diversity.metrics/pAOV.uu.by.treatment_day.qzv 

PCA with a covariate.

qiime emperor plot \
  --i-pcoa core.diversity.metrics/unweighted_unifrac_pcoa_results.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --p-custom-axes day \
  --o-visualization core.diversity.metrics/emperor.uu.day.qzv

OTU Classification

Qiime's provided Silva full-length classifier

cp ../../marker.ref.DBs/silva132/silva-132-99-nb-classifier.qza .

qiime feature-classifier classify-sklearn \
  --i-reads 16s.v6.rep.seqs.qza \
  --i-classifier silva-132-99-nb-classifier.qza \
  --o-classification 16s.v6.silva132.nb.taxonomy.calls.qza  

qiime metadata tabulate \
  --m-input-file 16s.v6.silva132.nb.taxonomy.calls.qza \
  --o-visualization 16s.v6.silva132.nb.taxonomy.calls.qzv

qiime tools export  \
  16s.v6.silva132.nb.taxonomy.calls.qza \
  --output-dir .
mv taxonomy.tsv 16s.v6.silva132.nb.taxonomy.calls.tsv

cut -f 3 16s.v6.silva132.nb.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk

The median posterior support for taxonomy assignments is 0.9479.

qiime taxa barplot \
  --i-table 16s.v6.feature.table.qza \
  --i-taxonomy 16s.v6.silva132.nb.taxonomy.calls.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization 16s.v6.silva132.nb.taxa.barplots.qzv

Our in-house-built Silva full-length classifier

cp ../../marker.ref.DBs/silva132/classifier.silva132.in.house.qza .

qiime feature-classifier classify-sklearn \
  --i-reads 16s.v6.rep.seqs.qza \
  --i-classifier classifier.silva132.in.house.qza \
  --o-classification 16s.v6.silva132.in.house.taxonomy.calls.qza  

qiime metadata tabulate \
  --m-input-file 16s.v6.silva132.in.house.taxonomy.calls.qza \
  --o-visualization 16s.v6.silva132.in.house.taxonomy.calls.qzv

qiime tools export  \
  16s.v6.silva132.in.house.taxonomy.calls.qza \
  --output-dir .
mv taxonomy.tsv 16s.v6.silva132.in.house.taxonomy.calls.tsv

cut -f 3 16s.v6.silva132.in.house.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk

The median posterior support for taxonomy assignments is 0.9620.

qiime taxa barplot \
  --i-table 16s.v6.feature.table.qza \
  --i-taxonomy 16s.v6.silva132.in.house.taxonomy.calls.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization 16s.v6.silva132.in.house.taxa.barplots.qzv

Differential abundance analysis using ANCOM

Calculate pseudocounts (counts +1) to avoid log-of-zero errors.

qiime composition add-pseudocount \
  --i-table 16s.v6.feature.table.qza \
  --o-composition-table 16s.v6.comp.table.qza

ANCOM

qiime composition ancom \
  --i-table 16s.v6.comp.table.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment \
  --o-visualization ancom.16s.v6.by.treatment.qzv

ANCOM identified 30 OTUs with differential abundance among the treatments. As above, I'll filter down to just the day 25 data for the enclosed treatments.

qiime feature-table filter-samples \
  --i-table 16s.v6.feature.table.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --p-where "day='25' AND NOT treatment='F'" \
  --o-filtered-table 16s.v6.d25AD.feature.table.qza

qiime composition add-pseudocount \
  --i-table 16s.v6.d25AD.feature.table.qza \
  --o-composition-table 16s.v6.d25AD.comp.table.qza

qiime composition ancom \
  --i-table 16s.v6.d25AD.comp.table.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment \
  --o-visualization ancom.16s.v6.d25AD.by.treatment.qzv

Differential abundance analysis with gneiss

Another way to examine differences among treatments is the use of balance trees as implemented by the gneiss algorithm Morton et al. 2017. Qiime has an instructive tutorial on its application of gneiss.

qiime gneiss correlation-clustering \
  --i-table 16s.v6.d25AD.comp.table.qza \
  --o-clustering hierarchy.qza
  
qiime gneiss ilr-transform \
  --i-table 16s.v6.d25AD.comp.table.qza \
  --i-tree hierarchy.qza \
  --o-balances balances.qza  

grep '^[ABCD]' metadata.beeSeq.microbiota.tsv > tmp
head -n 1 metadata.beeSeq.microbiota.tsv > metadata.beeSeq.microbiota.d25AD.tsv
grep '_day25_' tmp >> metadata.beeSeq.microbiota.d25AD.tsv

qiime gneiss ols-regression \
  --p-formula "treatment" \
  --i-table balances.qza \
  --i-tree hierarchy.qza \
  --m-metadata-file metadata.beeSeq.microbiota.d25AD.tsv \
  --o-visualization gneiss.ols.regression.to.treatment.qzv

Based on the OLS R2, the regression using treatment as the only factor can explain about 11.2% of all community variation.

qiime gneiss dendrogram-heatmap \
  --i-table 16s.v6.d25AD.comp.table.qza \
  --i-tree hierarchy.qza \
  --m-metadata-file metadata.beeSeq.microbiota.d25AD.tsv \
  --m-metadata-column treatment \
  --p-color-map seismic \
  --o-visualization gneiss.heatmap.qzv

From inspection of the qzv file, it looks like split y2 distinguishes the high dose treatment from the others.

qiime gneiss balance-taxonomy \
  --i-table 16s.v6.d25AD.comp.table.qza \
  --i-tree hierarchy.qza \
  --i-taxonomy 16s.v6.silva132.in.house.taxonomy.calls.qza \
  --p-taxa-level 5 \
  --p-balance-name 'y2' \
  --m-metadata-file metadata.beeSeq.microbiota.d25AD.tsv \
  --m-metadata-column treatment \
  --o-visualization gneiss.d25AD.y2.taxa.lvl5.qzv

The log ratio for these taxa is almost completely non-overlapping for treatments A and D, and the treatment medians show a progressive dose effect. The taxa that are reduced in the high dose treatment here are Saccharibacter sp. and 6 unidentified Enterobacteriaceae.

The gradient clustering variation tells a similar story.

qiime gneiss gradient-clustering \
  --i-table 16s.v6.d25AD.comp.table.qza \
  --m-gradient-file metadata.beeSeq.microbiota.d25AD.tsv \
  --m-gradient-column dose_ppb \
  --o-clustering gradient-hierarchy.qza

qiime gneiss ilr-transform \
  --i-table 16s.v6.d25AD.comp.table.qza \
  --i-tree gradient-hierarchy.qza \
  --o-balances gradient-balances.qza

qiime gneiss ols-regression \
  --p-formula "treatment" \
  --i-table gradient-balances.qza \
  --i-tree gradient-hierarchy.qza \
  --m-metadata-file metadata.beeSeq.microbiota.d25AD.tsv \
  --o-visualization gneiss.ols.regression.gradient.to.treatment.qzv

qiime gneiss dendrogram-heatmap \
  --i-table 16s.v6.d25AD.comp.table.qza \
  --i-tree gradient-hierarchy.qza \
  --m-metadata-file metadata.beeSeq.microbiota.d25AD.tsv \
  --m-metadata-column treatment \
  --p-color-map seismic \
  --o-visualization gneiss.gradient.heatmap.qzv

qiime gneiss balance-taxonomy \ 
  --i-table 16s.v6.d25AD.comp.table.qza \
  --i-tree gradient-hierarchy.qza \
  --i-taxonomy 16s.v6.silva132.in.house.taxonomy.calls.qza \
  --p-taxa-level 5 \
  --p-balance-name 'y9' \
  --m-metadata-file metadata.beeSeq.microbiota.d25AD.tsv \
  --m-metadata-column treatment \
  --o-visualization gneiss.d25AD.gradient.y9.taxa.lvl5.qzv

The OLS regression to treatment explains about 11.2% of variation. This analysis suggests a larger group of taxa that are reduced in the treated groups, although the effect is not quite as strong. The taxa affected include Saccharibacter, Lactobacillus, several Acetobacteraceae, Pantoea, Asaia, Curtobacterium, Sphingomonas, and others OTUs without good taxonomic resolution.

18S

Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the Dalhousie University IMR, from amplicons targetting 18S v4.

  • E572F CYGCGGTAATTCCAGCTC
  • E1009R CRAAGAYGATYAGATACCRT
  • The product is 438 bp, resulting in a 162 bp overlap.

Comeau AM, Li WKW, Tremblay J-É, Carmack EC, Lovejoy C (2011) Arctic Ocean Microbial Community Structure before and after the 2007 Record Sea Ice Minimum. PLoS ONE 6(11): e27492. Link

Data processing

Set the Qiime2 docker alias.

alias qiime 'docker run -t -i -v /export/groups/drangeli/beeSeq.project/18s/:/data qiime2/core:2018.6 qiime'

Import the data and summarize.

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path raw.reads \
  --source-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path 18s.demux.qza

qiime demux summarize \
  --i-data 18s.demux.qza \
  --o-visualization 18s.demux.qzv

The quality plot in the qzv file is used to estimate an appropriate point for trimming reads.

My initial 2017 analysis (using Qiime2 release 2017.2) classified one of the 18S OTUs as Apicystis bombi. I was concerned that I wasn't seeing that again. Several repeats with more recent releases failed to recover that taxon, and I became concerned that the filtering stage might introduce a fair be of stochasticity into the the analysis.

Therefore, I ran the dada2 denoise process under a range of truncation parameters using a bash script to run the denoise command and produce feature.table.qza and tsv files. I used the total number of features (presumptive OTUs) as the metric to choose.

The maximum number of features/OTUs appears at --p-trunc-len-f 290 --p-trunc-len-r 175. Gratifyingly, Apicystis bombi is among the taxa identified from these sequences.

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs 18s.demux.qza \
  --p-trunc-len-f 290 --p-trunc-len-r 175 \
  --p-trim-left-f 34  --p-trim-left-r 34 \
  --o-representative-sequences 18s.rep.seqs.qza \
  --o-table 18s.feature.table.qza \
  --o-denoising-stats 18s.denoise.stats.qza \
  --p-n-threads 16

The feature table is exported as a tsv file.

qiime tools export  \
  18s.feature.table.qza \
  --output-dir .

docker run -t -i -v /export/groups/drangeli/beeSeq.project/18s/:/data qiime2/core:2018.6 \
  biom convert \
    -i feature-table.biom \
    -o 18s.feature.table.tsv \
    --to-tsv

The resulting tab-separated values (tsv) file is a table of counts for each feature in each sample. This can be imported into R for custom statistical tests, plots, or other analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to the ID using 18s.taxonomy.tsv, which is generated later in this workflow.

qiime feature-table tabulate-seqs \
  --i-data 18s.rep.seqs.qza \
  --o-visualization 18s.rep.seqs.qzv

A list of features with links to BLAST search. This isn’t so immediately useful, it can be used as a tool to follow up on the individual identity of particular features/taxa later.

qiime feature-table summarize \
  --i-table 18s.feature.table.qza \
  --o-visualization 18s.feature.table.qzv \
  --m-sample-metadata-file metadata.beeSeq.microbiota.tsv

Diversity analysis

Generate a tree for phylogenetic diversity analyses.

qiime alignment mafft \
  --i-sequences 18s.rep.seqs.qza \
  --o-alignment 18s.aligned.rep.seqs.qza
qiime alignment mask \
  --i-alignment 18s.aligned.rep.seqs.qza \
  --o-masked-alignment 18s.masked.aligned.rep.seqs.qza
qiime phylogeny fasttree \
  --i-alignment 18s.masked.aligned.rep.seqs.qza \
  --o-tree 18s.unrooted.tree.qza  
qiime phylogeny midpoint-root \
  --i-tree 18s.unrooted.tree.qza \
  --o-rooted-tree 18s.rooted.tree.qza

Determine the optimal rarefaction level.

qiime diversity alpha-rarefaction \
  --i-table 18s.feature.table.qza \
  --i-phylogeny 18s.rooted.tree.qza \
  --p-max-depth 750 \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization 18s.rarefaction.qzv

Optimal rarefaction looks to be about 300. Calculate diversity metrics.

qiime diversity core-metrics-phylogenetic \
  --i-phylogeny 18s.rooted.tree.qza \
  --i-table 18s.feature.table.qza \
  --p-sampling-depth 300 \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --output-dir core.diversity.metrics

Test for associations between metadata categories and values and the alpha diversity metrics.

qiime diversity alpha-group-significance \
  --i-alpha-diversity core.diversity.metrics/observed_otus_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/observed-otus-group-significance.qzv
qiime diversity alpha-group-significance \
  --i-alpha-diversity core.diversity.metrics/faith_pd_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/faith-pd-group-significance.qzv
qiime diversity alpha-group-significance \
  --i-alpha-diversity core.diversity.metrics/shannon_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/shannon-group-significance.qzv
qiime diversity alpha-correlation \
      --i-alpha-diversity core.diversity.metrics/shannon_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/shannon-correlation.qzv
qiime diversity alpha-correlation \
  --i-alpha-diversity core.diversity.metrics/faith_pd_vector.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization core.diversity.metrics/faith-pd-correlation.qzv

Comparisons of beta diversity by metadata categories can be made via PermANOVA.

qiime diversity beta-group-significance \
  --i-distance-matrix core.diversity.metrics/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment \
  --p-method permanova --p-pairwise \
  --p-permutations 9999 \
  --o-visualization core.diversity.metrics/pAOV.uu.by.treatment.qzv 
  
qiime diversity beta-group-significance \
  --i-distance-matrix core.diversity.metrics/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment_day \
  --p-method permanova --p-pairwise \
  --p-permutations 9999 \
  --o-visualization core.diversity.metrics/pAOV.uu.by.treatment_day.qzv 

PCA with a covariate.

qiime emperor plot \
  --i-pcoa core.diversity.metrics/unweighted_unifrac_pcoa_results.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --p-custom-axes day \
  --o-visualization core.diversity.metrics/emperor.uu.day.qzv

OTU Classification

Qiime's provided Silva full-length classifier

cp ../../marker.ref.DBs/silva132/silva-132-99-nb-classifier.qza .

qiime feature-classifier classify-sklearn \
  --i-reads 18s.rep.seqs.qza \
  --i-classifier silva-132-99-nb-classifier.qza \
  --o-classification 18s.silva132.nb.taxonomy.calls.qza  

qiime metadata tabulate \
  --m-input-file 18s.silva132.nb.taxonomy.calls.qza \
  --o-visualization 18s.silva132.nb.taxonomy.calls.qzv

qiime tools export  \
  18s.silva132.nb.taxonomy.calls.qza \
  --output-dir .
mv taxonomy.tsv 18s.silva132.nb.taxonomy.calls.tsv

cut -f 3 18s.silva132.nb.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk

The median posterior support for taxonomy assignments is 0.9431.

qiime taxa barplot \
  --i-table 18s.feature.table.qza \
  --i-taxonomy 18s.silva132.nb.taxonomy.calls.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization 18s.silva132.nb.taxa.barplots.qzv

Our in-house-built Silva full-length classifier

cp ../../marker.ref.DBs/silva132/classifier.silva132.in.house.qza .

qiime feature-classifier classify-sklearn \
  --i-reads 18s.rep.seqs.qza \
  --i-classifier classifier.silva132.in.house.qza \
  --o-classification 18s.silva132.in.house.taxonomy.calls.qza  

qiime metadata tabulate \
  --m-input-file 18s.silva132.in.house.taxonomy.calls.qza \
  --o-visualization 18s.silva132.in.house.taxonomy.calls.qzv

qiime tools export  \
  18s.silva132.in.house.taxonomy.calls.qza \
  --output-dir .
mv taxonomy.tsv 18s.silva132.in.house.taxonomy.calls.tsv

cut -f 3 18s.silva132.in.house.taxonomy.calls.tsv | sed '1d' | sort -n | awk -f ../../median.awk

The median posterior support for taxonomy assignments is 0.9854.

qiime taxa barplot \
  --i-table 18s.feature.table.qza \
  --i-taxonomy 18s.silva132.in.house.taxonomy.calls.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --o-visualization 18s.silva132.in.house.taxa.barplots.qzv

Differential abundance analysis using ANCOM

Calculate pseudocounts (counts +1) to avoid log-of-zero errors.

qiime composition add-pseudocount \
  --i-table 18s.feature.table.qza \
  --o-composition-table 18s.comp.table.qza

ANCOM

qiime composition ancom \
  --i-table 18s.comp.table.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment \
  --o-visualization ancom.18s.by.treatment.qzv

Nine features were significantly different in abundance among treatments. However, all of them appear to be either over-represented in the free-foraging treatment (8, mostly chloroplast sequences) or not (1 yeast).

qiime feature-table filter-samples \
  --i-table 18s.feature.table.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --p-where "day='25' AND NOT treatment='F'" \
  --o-filtered-table 18s.d25AD.feature.table.qza

qiime composition add-pseudocount \
  --i-table 18s.d25AD.feature.table.qza \
  --o-composition-table 18s.d25AD.comp.table.qza

qiime composition ancom \
  --i-table 18s.d25AD.comp.table.qza \
  --m-metadata-file metadata.beeSeq.microbiota.tsv \
  --m-metadata-column treatment \
  --o-visualization ancom.18s.d25AD.by.treatment.qzv

After filtering down to just the day 25 enclosed colony data, ANCOM failed to identify any 18S OTUs with significant differences in expression.

Other analyses

  • Total feature counts are comparable across treatments (shown below for 16S v4-5). For the two 16S regions, the counts are normally distributed. For 18S, the distribution is skewed, but log-normal. For most treatment groups, total reads are higher on day 0 and decline at the later two time points.

  • I filtered samples to day 0 only, with and without the free foragers. We did leave hives in place for about one week to allow them to acclimate. When the free-foragers were included 4-6 OTUs in each dataset were different in abundance, and all but one were present/absent in group F. The exceptions also showed up with group F was excluded. Treatment C appears to have had an "uncultured gamma proteobacterium" and treatment A has an "uncultered bacterium". The 16S v6 dataset identified no day-0 differences among treatments A-D.
  • Checked for time differences among treatments A, D and F, separately. A small number of OTUs increased or decreased in abundance over the span of the experiment. The two 16s regions agreed well here.
  • Also checked for differences among hives within treatments A and D at days 0 and 25, separately. One or two OTUs (across all regions) were different among hives at day 0. However, on day 25, 12-15 OTUs were different in abundance among hives within these treatments, reflecting idiosynratic trajectories of microbial community dynamics.

Conclusions from the microbiome analysis

BLAH.

Gene expression analysis

Sequenced by the UC Davis DNA Technology Core from RNA isolated in the Angelini Lab at Colby College. Bombus impatiens samples originated from commercial hives raised at the UMass Cranberry Research Station. All samples were poly-A selected.

Classic RNAseq - 90-bp single-end reads on one Illumina HiSeq lane with 1,487,663,804 reads from 4 samples:

  • A1-d25-2_S161_L006_R1_001.fastq
  • B1-d25-4_S162_L006_R1_001.fastq
  • B2-d25-5_S163_L006_R1_001.fastq
  • C4-d25-3_S164_L006_R1_001.fastq

3'-end seq - 90-bp single-end reads from the 3' UTR.

  • 96 samples

These data were processed on Colby's high-preformance computing cluster nscc.

Quality control

FastQC

Check out one or more of the 3seq files.

fastqc A2-d25-1_S142_L005_R1_001.fastq --outdir=fastqc_raw &

Here are the classic RNAseq runs.

fastqc A1-d25-2_S161_L006_R1_001.fastq --outdir=fastqc_raw &
fastqc B1-d25-4_S162_L006_R1_001.fastq --outdir=fastqc_raw &
fastqc B2-d25-5_S163_L006_R1_001.fastq --outdir=fastqc_raw &
fastqc C4-d25-3_S164_L006_R1_001.fastq --outdir=fastqc_raw &

Trimming

Trimmomatic will trim sequences based on the read quality. I'll start with the classic RNAseq read files.

java -jar /export/local/src/Trimmomatic-0.33/trimmomatic-0.33.jar SE  /export/groups/drangeli/beeSeq.project/raw.data/rna.reads/A1-d25-2_S161_L006_R1_001.fastq /export/groups/drangeli/beeSeq.project/filtered.reads/rnaseq/A1-d25-2_S161_L006_R1_001.trimmed.fastq ILLUMINACLIP:/export/local/src/Trimmomatic-0.33/adapters/TruSeq3-SE.fa:2:30:10 HEADCROP:12 LEADING:3 TRAILING:3 SLIDINGWINDOW:3:15 MINLEN:50 &
java -jar /export/local/src/Trimmomatic-0.33/trimmomatic-0.33.jar SE /export/groups/drangeli/beeSeq.project/raw.data/rna.reads/B1-d25-4_S162_L006_R1_001.fastq /export/groups/drangeli/beeSeq.project/filtered.reads/rnaseq/B1-d25-4_S162_L006_R1_001.trimmed.fastq ILLUMINACLIP:/export/local/src/Trimmomatic-0.33/adapters/TruSeq3-SE.fa:2:30:10 HEADCROP:12 LEADING:3 TRAILING:3 SLIDINGWINDOW:3:15 MINLEN:50 &
java -jar /export/local/src/Trimmomatic-0.33/trimmomatic-0.33.jar SE /export/groups/drangeli/beeSeq.project/raw.data/rna.reads/B2-d25-5_S163_L006_R1_001.fastq /export/groups/drangeli/beeSeq.project/filtered.reads/rnaseq/B2-d25-5_S163_L006_R1_001.trimmed.fastq ILLUMINACLIP:/export/local/src/Trimmomatic-0.33/adapters/TruSeq3-SE.fa:2:30:10 HEADCROP:12 LEADING:3 TRAILING:3 SLIDINGWINDOW:3:15 MINLEN:50 &
java -jar /export/local/src/Trimmomatic-0.33/trimmomatic-0.33.jar SE /export/groups/drangeli/beeSeq.project/raw.data/rna.reads/C4-d25-3_S164_L006_R1_001.fastq /export/groups/drangeli/beeSeq.project/filtered.reads/rnaseq/C4-d25-3_S164_L006_R1_001.trimmed.fastq ILLUMINACLIP:/export/local/src/Trimmomatic-0.33/adapters/TruSeq3-SE.fa:2:30:10 HEADCROP:12 LEADING:3 TRAILING:3 SLIDINGWINDOW:3:15 MINLEN:50 &

This will get old fast. So I created a shell script trimmomatic.sh to run Trimmomatic on all the remaining read files, using the same parameters.

chmod a+x trimmomatic.sh
./trimmomatic.sh

Unzip the files to get line counts.

wc -l * > linecounts.txt &

Dividing the line count total by 4 gives the total number of reads passing the filter criteria: 358,759,635.

Next, I moved the filtered read files to a separate folder.

fastq_screen

Another concern for RNAseq data is contamination during sample prep or sequencing or from parasites or commensal organisms living in and on the bee. fastq_screen is a tool to quickly compare short reads to several reference genomes. It relies on previously indexed genomes, using an aligner such as bowtie2. Once a genome is downloaded from NCBI in fasta format, an index can be build as follows.

bowtie2-build --threads 30 -f GCF_000188095.2_BIMP_2.1_genomic.fna Bimp_2.1_genomic

The fastq_screen.conf file then directs the program to the aligner (bowtie2 in this case) and the databases used for comparisons. I included the following genomes.

Once the paths to these bowtie2 indices are in the config file, a call to fastq_screen is easy.

~/fastq_screen/fastq_screen *.fastq

The mean percentage of reads that could not be mapped to the B. impatiens genome was 18%.

grep 'Bimp' *.txt | cut -d '\t' -f 4 | awk '{sum+=$1} END {print sum/NR}'

sample fastq_screen results

The results show no evidence of contamination from humans, other lab organisms or molecular reagents. Among the commensals and pathogens screened, only sequences aligning to the mite are present. Varroa destructor is the only mite genome is our dataset, but it's defintely not this species, which is large and obvious externally. It's also unlikely to be evidence of the bumblebee mite, Parasitellus fucorum, also an obvious external mite. There's no genome currently available for tracheal mites, such as Locustacarus buchneri or Acarapis woodi. These species are a possibility, although the bees seemed to be in good health when collected. Other small commensal mites may explain these results. The number of reads mapping to the mite dataset ranges from 0.5% to 10.4% of all processed reads for samples and has a mean of 4.1% across all 3-seq libraries.

frequency of mite reads in 3seq samples

Overall, contamination doesn't seem to be a serious problem. However, it may be worth using BLAST to check for mite sequences after assembly.

Transcriptome assembly

Trinity

Before using Trinity to assemble the classic RNA seq reads, combine them into one file. (Later reconsideration: this isn't necessary with the current version of Trinity. It can take multiple input files.)

cat A1-d25-2_S161_L006_R1_001.trimmed.fastq > Bi.rna.se.combined.fq; \
cat B1-d25-4_S162_L006_R1_001.trimmed.fastq >> Bi.rna.se.combined.fq; \
cat B2-d25-5_S163_L006_R1_001.trimmed.fastq >> Bi.rna.se.combined.fq; \
cat C4-d25-3_S164_L006_R1_001.trimmed.fastq >> Bi.rna.se.combined.fq &

Trinity was run on nscc via a screen on node 26. I gave the run 30 processors and 50Gb of memory. The output was captured as a log file.

Trinity --seqType fq --max_memory 50G --single Bi.rna.se.combined.fq --CPU 30 > Bi.RNAseq.Trinity.run.log

Trinity includes a perl script that calculates some basic stats from the assembly.

$TRINITY_HOME/util/TrinityStats.pl Trinity.fasta > trinity.stats.out
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  97398
Total trinity transcripts:      158600
Percent GC: 36.80

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 9490
        Contig N20: 7190
        Contig N30: 5802
        Contig N40: 4743
        Contig N50: 3867

        Median contig length: 505
        Average contig: 1520.04
        Total assembled bases: 241077604


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 6839
        Contig N20: 4579
        Contig N30: 3173
        Contig N40: 2144
        Contig N50: 1345

        Median contig length: 333
        Average contig: 712.98
        Total assembled bases: 69443215

Rename the default output fasta file to something more descriptive.

mv Trinity.fasta Bi.trinity.fa

TransDecoder

TransDecoder will identity long open reading frames from putative protein-coding mRNAs.

From trinity_out_dir/

TransDecoder.LongOrfs -t Bi.trinity.fa
ls -lh 

This creates a folder Bi.trinity.fa.transdecoder_dir/ with several output files. The most useful of which is longest_orfs.cds, which contains the nucleotide sequences from the coding regions (CDS) of each transcript, and longest_orfs.pep, which has the amino acid translation of those CDS sequences.

Creating a BLAST database from the transcriptome

One useful part of annotation is using the transcriptome itself as a BLAST database. The makeblastdb command could be a little finicky, so using complete paths may be helpful.

mkdir /export/groups/drangeli/beeSeq.project/Bimp_gut_CDS_BLASTdb
cd /export/groups/drangeli/beeSeq.project/Bimp_gut_CDS_BLASTdb
/export/local/bin/makeblastdb \
  -in /export/groups/drangeli/beeSeq.project/rna.classic/trinity_out_dir/Bi.trinity.fa.transdecoder_dir/longest_orfs.cds \
  -out Bimp_gut_CDS_BLASTdb \
  -parse_seqids -dbtype nucl \
  -title "Bombus impatiens gut transcriptome CDS"

Indexing the transcriptome

For what it's worth, I'll also make a hisat2 index based on the Trinity assembly. This will allow the 3seq reads to be mapped to the transcriptome.

cd /export/groups/drangeli/beeSeq.project/rna.classic/hisat2.index/
hisat2-build -p 30 -f /export/groups/drangeli/beeSeq.project/rna.classic/trinity_out_dir/Bi.trinity.fa Bimp_gut

BUSCO analysis

BUSCO is a useful, fast utility to compare a transcriptome to a curated reference of taxonomically restricted sequences. The accronym stands for "Benchmarking Universal Single-Copy Orthologs."

First, download the reference databases

cd /research/drangeli/DB/busco.lineages
curl -L -O "http://busco.ezlab.org/datasets/endopterygota_odb9.tar.gz"
curl -L -O "http://busco.ezlab.org/datasets/hymenoptera_odb9.tar.gz"

Uncompress the archives.

I chose to run the BUSCO python script with the Hymenoptera database, but similar results were obtained with the broader endopterygote dataset.

python /export/local/src/busco-2.0.1/BUSCO.py \
  -i /export/groups/drangeli/beeSeq.project/rna.classic/trinity_out_dir/Bi.trinity.fa.transdecoder_dir/longest_orfs.cds \
  -o BUSCO.report \
  -l /research/drangeli/DB/busco.lineages/hymenoptera_odb9 \
  -m tran -c 16

This ran in 2.8 hours on node 26 using 16 CPUs.

C:84.9%[S:24.8%,D:60.1%],F:8.4%,M:6.7%,n:4415
    3751    Complete BUSCOs (C)
    1096    Complete and single-copy BUSCOs (S)
    2655    Complete and duplicated BUSCOs (D)
     372     Fragmented BUSCOs (F)
     292     Missing BUSCOs (M)
    4415    Total BUSCO groups searched

A note on transcript isoforms

Trinity's assembly algorithm allows for the identification of isoforms created through alternative splicing. Since we're not interested in transcript isoforms, it's tempting to pull the longest isoform for each component (loosely this can be thought of as a "gene"). This can be done with a python script provided with Trinity.

$TRINITY_HOME/util/misc/get_longest_isoform_seq_per_trinity_gene.pl Bi.trinity.fa > Bi.trinity.longest.isoforms.fa

However, as the script warns "the longest isoform isn't always best". I tried this, re-ran Transdecoder to define coding regions and then ran BUSCO on the resulting file. The number of conserved orthologs that could be identified fell drammatically, from 85% in the initial Trinity output to just 58% for longest isoforms. I'm not sure of the reason for this difference, but it suggests that a better approach will be to map reads to the full Trinity assembly and then combine counts by component/gene later.

Mapping transcripts to the genome

Download the genome

It's great to a reference genome! The files for B. impatiens can be downloaded directly from NCBI. The calls below are to version 2.1 of the B. impatiens genome, released on March 20, 2018. (My initial analyses used version 2.0, where the number of mapped transcripts was almost exactly the same just 4 more.)

curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.2_BIMP_2.1/README.txt"
curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.2_BIMP_2.1/GCF_000188095.2_BIMP_2.1_assembly_report.txt"
curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.2_BIMP_2.1/GCF_000188095.2_BIMP_2.1_assembly_stats.txt"
curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.2_BIMP_2.1/GCF_000188095.2_BIMP_2.1_feature_table.txt.gz"
curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.2_BIMP_2.1/GCF_000188095.2_BIMP_2.1_genomic.fna.gz"
curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.2_BIMP_2.1/GCF_000188095.2_BIMP_2.1_genomic.gff.gz"
curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.2_BIMP_2.1/GCF_000188095.2_BIMP_2.1_rna.fna.gz"
gunzip *.gz

GMAP

Read mapping can be done using GMAP, which is available on nscc.

First, create a GMAP index.

cd /export/groups/drangeli/beeSeq.project/genome
gmap_build -d gmap_BIMP_2.1 -D . -k 13 GCF_000188095.2_BIMP_2.1_genomic.fna

Next, align the assembled CDS's to the genome, outputting in SAM format. This runs GMAP using 8 threads.

cd /export/groups/drangeli/beeSeq.project/rna.classic/
/usr/local/src/gmap-2018-01-26/src/gmap \
  -n 0 -D . -d ../genome/gmap_BIMP_2.1 \
  trinity_out_dir/Bi.trinity.fa.transdecoder_dir/longest_orfs.cds \
  -f samse -t 8 > trinity.gmap_BIMP_2.1.sam

After the lengthy @SQ header, the SAM file format consists of tab-separated lines listing the mapped location of Trinity CDS's. The important columns include

column description
1 Trinity transcript ID
2 strand (0 = forward, 16 = reverse, 4 = unmapped)
3 genomic scaffold ID the transcript was mapped to
4 position (left-side) of the transcript's alignment
10 DNA sequence

For example:

TRINITY_DN59_c0_g1_i1|m.3       16      NT_176837.1     3522516 40      300M    *       0       0       TCAATACAAAAAATAATC...

So, how many transcripts are mapped?

grep 'TRINITY_DN' trinity.gmap_BIMP_2.1.sam | wc -l
131550

How many genes does this likely reflect?

grep 'TRINITY_DN' trinity.gmap_BIMP_2.1.sam | sed 's/_i/X/' | cut -d "X" -f 1 | sort -u | wc -l
16853

Convert the SAM file to BAM (binary sam) format using SAMtools.

samtools view -Sb trinity.gmap_BIMP_2.1.sam > trinity.gmap_BIMP_2.1.bam

Most downstream applications will require that the BAM file be sorted by genomic coordinates.

samtools sort trinity.gmap_BIMP_2.1.bam trinity.gmap_BIMP_2.1

Now index the BAM file to enable rapid navigation.

samtools index trinity.gmap_BIMP_2.1.bam

Munging GMAP results for R

Next, I'll use an R script to associate the mapped transcripts with the genome's annotations. I'll start by extracting just the positional information and sequence from the SAM file.

grep "TRINITY" trinity.gmap_BIMP_2.1.sam | cut -f-1,2,3,4,10 > trinity.gmap_BIMP_2.1.positions.tsv

Mapping 3seq reads

The 3seq reads can be mapped in two phases: First to the genome, usng HISAT2 and StringTie. Second to the de novo gut transcriptome assembly produced in this project, using Kallisto.

Mapping to the reference genome

HISAT2

HISAT2 is optimized for this sort of short read mapping to a reference genome. It requires one or more fasta files as input. These should include the entire reference genome (or transcriptome).

cd /export/groups/drangeli/beeSeq.project/genome/hisat2.index/
hisat2-build -p 30 -f /export/groups/drangeli/beeSeq.project/genome/GCF_000188095.2_BIMP_2.1_genomic.fna Bimp2_1

Next, you map reads for each filtered 3seq fastq file. I created a shell script to run hisat2 for each file.

chmod a+x hisat2.for.3seq.sh
./hisat2.for.3seq.sh

Here's an example line:

/export/local/src/hisat2-2.0.5/hisat2 \
  -p 30 --dta -q -t \
  --un A1-d25-1_S141.unmapped.fq \
  --rg-id A1-d25-1_S141 \
  -x /export/groups/drangeli/beeSeq.project/genome/hisat2.index/Bimp2_1 \
  -U ../filtered_reads/A1-d25-1_S141.trimmed.fastq \
  -S A1-d25-1_S141.mapped.sam

Move the unmapped read files into a separate folder.

mkdir unmapped
mv *.unmapped.fq unmapped

Another script converts to the BAM format and sorts each file using 30 processors.

chmod a+x sam.sort.sh
./sam.sort.sh

Example line:

samtools view -Su A1-d25-1_S141.mapped.sam | \
  samtools sort - -@30 A1-d25-1_S141.sorted

To confirm that the BAM file is in fact sorted, look for SO:coordinate in the first line. Then, remove the SAM files to save space. They take up about 8 times more drive space.

samtools view -H A1-d25-1_S141.sorted.bam | head
rm *.mapped.sam

StringTie

StringTie is used to calculate abundances for mapped reads. If needed download the complied Linux x86_64 binary.

curl -L -O "http://ccb.jhu.edu/software/stringtie/dl/stringtie-1.3.3b.Linux_x86_64.tar.gz"
tar xvfz stringtie-1.3.3b.Linux_x86_64.tar.gz

StringTie views reference features that aren't completely covered by reads as novel, and therefore it will not apply the annotation information. This is a problem for 3seq data, which by its nature will never cover an entire mRNA annotation. For this reason, it's probably best to run it initially without the -e option (which excludes the possibility of transcripts not present in the reference), then take the resulting GTF file and merge it with the reference. Then re-run StringTie with the merged GTF reference to obtain relative abundances. This workflow is recommended by the StringTie documentation. Again, since these data are all from 3' ends, it's likely that none of them will provide complete transcripts, but it's worth doing this to produce a single consensus GTF file.

I have a script to run this for each sample.

chmod a+x stringtie.initial.sh
./stringtie.initial.sh

Example line:

/export/groups/drangeli/bin/stringtie-1.3.3b.Linux_x86_64/stringtie \
  A1-d25-1_S141.sorted.bam \
  -o A1-d25-1_S141.gtf \
  -p 30 \
  -G /export/groups/drangeli/beeSeq.project/genome/GCF_000188095.2_BIMP_2.1_genomic.gff

Now merge the individual GTF files. First, create text file listing all the .gtf file names.

ls -1 *.gtf > gtf.filenames.txt
/export/groups/drangeli/bin/stringtie-1.3.3b.Linux_x86_64/stringtie --merge \
  -G /export/groups/drangeli/beeSeq.project/genome/GCF_000188095.2_BIMP_2.1_genomic.gff \
  -p 30 -o merged.3seq.gtf gtf.filenames.txt

Next repeat the StringTie run, using the merged GTF file as the reference. This will also output tab-delimited abundance files and input files for Ballgown - which we won't use.

chmod a+x stringtie.abundance.sh
./stringtie.abundance.sh

Example line:

/export/groups/drangeli/bin/stringtie-1.3.3b.Linux_x86_64/stringtie \
  A1-d25-1_S141.sorted.bam \
  -o A1-d25-1_S141.gtf \
  -p 30 -G merged.3seq.gtf \
  -A abund.A1-d25-1_S141.tab -b . -e

The authors of StringTie provide a python script for conversion of the Ballgown data into a format for use with R's DEseq2 package.

curl -L -O "http://ccb.jhu.edu/software/stringtie/dl/prepDE.py"

Create a sample list file.

sed 's/\.gtf//' gtf.filenames.txt > tmp1
sed 's/^/\.\//' gtf.filenames.txt > tmp2
paste tmp1 tmp2 > sample_lst.txt
rm tmp?

Run the python script.

python ./prepDE.py -i sample_lst.txt -l 78

That will create two count tables for genes and transcripts (gene_count_matrix.csv and transcript_count_matrix.csv). These files can be imported into R for analysis and associated with genomic features at their indicated positions (see the next section).

Associating ID numbers

What's needed at this point is the annotation of the gene/transcript counts. StringTie is conservative. Without reads spanning all of a feature, the software won't use the annotation information from the reference genome. However, the file merged.3seq.gtf includes the mapped genome locations for each MSTRG identification number. The problem is to locate these positions in the genome annotation file ../genome/GCF_000188095.2_BIMP_2.1_genomic.gff and associate the MSTRG numbers to their annotations. This is best accomplished using R. But first, bash can be used to make the read-map and genomic annotation data more tidy.

Start by extracting just transcript lines from the merged read-map GTF file.

grep 'transcript\s' merged.3seq.gtf \
  > merged.3seq.gtf.transcript.lines.tsv

Next, extract just mRNA lines from the scaffold gff3 file.

grep "mRNA\s" ../../genome/GCF_000188095.2_BIMP_2.1_genomic.gff \
  > BIMP_2.1.gff.mRNA.lines.tsv

The read maps and genomic annotations are associated using the R script collating.MSTRG.to.Bimp.gene.annotations.R. The product of this script is a tab-delimited file annotated.srtingtie.MSTRG.genes.tsv which lists genomic scaffold positions for predicted mRNAs with the StringTie gene IDs, as well as applying the GO terms and UniProt IDs based on GenBank IDs resulting from BLAST searches with the assembled transcript sequences.

head -n 1 annotated.srtingtie.MSTRG.genes.tsv
wc -l annotated.srtingtie.MSTRG.genes.tsv
cut -d "\t" -f 7 annotated.srtingtie.MSTRG.genes.tsv | grep -v -c 'NA'
cut -d "\t" -f 13 annotated.srtingtie.MSTRG.genes.tsv | grep -v -c 'NA'

The resulting table has 30,186 entries for putative mRNAs in the reference genome that are associated with our transcriptome. Of these, 22,551 are annotated with a GenBank ID and 8,844 have GO annotations.

Mapping to the gut transcriptome

Kallisto

Kallisto will quantify the abundance of transcripts from RNA-Seq data using a pseudoalignment method. First, create a Kallisto index from the Trinity assembled transcripts. I intentionally do this using the complete assembly, not the CDS files, because of course if there are fragmentary or non-coding RNAs that were sequenced, then some reads will correspond to those RNAs too.

cd /export/groups/drangeli/beeSeq.project/rna.3seq/
alias kallisto '~/bin/kallisto_linux-v0.43.1/kallisto'
kallisto index --index=Bi.gut.tx.kallisto.index /export/groups/drangeli/beeSeq.project/rna.classic/trinity_out_dir/Bi.trinity.fa

In order to run Kallisto on single-end sequence, it wants to know the mean and standard deviation of read lengths. First, use bash to find the total number of filtered reads.

cd /export/groups/drangeli/beeSeq.project/rna.3seq/filtered_reads/
cat *.trimmed.fastq | sed -n -e '2~4p' | \
  awk '{ print length($0); }' > read.lengths.txt
wc -l read.lengths.txt

That's 350,566,162. Next, calculate the mean and SD.

cat read.lengths.txt | awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}} END {for (i=1;i<=NF;i++) { printf "%f %f \n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)} }'
  • mean = 77.495209 bp
  • SD = 3.039139 bp

Next, I wrote a bash script to run kallisto quant for each sample and to rename the output files with the sample names.

chmod a+x Bi.3seq.kallisto.mapping.to.Trin.gut.tx.sh
./Bi.3seq.kallisto.mapping.to.Trin.gut.tx.sh

Example lines:

~/bin/kallisto_linux-v0.43.1/kallisto quant --index=/export/groups/drangeli/beeSeq.project/rna.3seq/Bi.gut.tx.kallisto.index --output-dir=/export/groups/drangeli/beeSeq.project/rna.3seq/kallisto.filtered.reads.vs.gut.tx/ --plaintext --single --single-overhang -l 77.495209 -s 3.039139 --threads=16 A1-d25-1_S141.trimmed.fastq
mv /export/groups/drangeli/beeSeq.project/rna.3seq/kallisto.filtered.reads.vs.gut.tx/abundance.tsv /export/groups/drangeli/beeSeq.project/rna.3seq/kallisto.filtered.reads.vs.gut.tx/A1-d25-1_S141.abundance.tsv
mv /export/groups/drangeli/beeSeq.project/rna.3seq/kallisto.filtered.reads.vs.gut.tx/run_info.json /export/groups/drangeli/beeSeq.project/rna.3seq/kallisto.filtered.reads.vs.gut.tx/A1-d25-1_S141.run_info.json

The result is a folder kallisto.filtered.reads.vs.gut.tx that contains .abundance.tsv files for each sample, listing the counts for each transcript assembled by Trinity. The easiest way to combine these data into a single matrix and to merge isoforms into counts for each gene is by using R. The R script merging.kallisto.results.R accomplishes this, and also removes counts from contaminant transcript sequences and saves them separately. The script's output is two new tables:

  • Bi.3seq.filtered.vs.gut.tx.kal.abundance.csv
  • contaminant.Bi.3seq.filtered.vs.gut.tx.kal.abundance.csv

One potential problem with these files, during later use in R, is the presence of pipe characters in the Trinity gene IDs. R can misinterpret | as the OR logical operator. The pipes can be removed using bash.

sed 's/|/_/' Bi.3seq.filtered.vs.gut.tx.kal.abundance.csv > tmp
cat tmp > Bi.3seq.filtered.vs.gut.tx.kal.abundance.csv; rm tmp

Annotation

BLAST to a custom SwissProt dataset

I pulled curated SwissProt entries from Drosophila melanogaster and any Hymenoptera. Direct link here.

Next, I created a modified version of the FASTA file, with sequence labels that will behave well in BLAST, with pipe delimiters between the IDs and replacing space characters with underscores in human-readable gene names.

sed 's/>sp|/>/' uniprot.Dmel.Hymenoptera.fa | sed 's/>tr|/>/' | sed 's/ OS=/|/' | sed 's/ GN=.*//' | sed 's/ PE=.*//' | sed -e 's/\s/|/' | sed -e 's/\s/_/g' | sed -e 's/\,//g' > uniprot.Dmel.Hymenoptera.custom.labels.fa

Make the BLAST database

/export/local/bin/makeblastdb \
  -in /export/groups/drangeli/beeSeq.project/rna.classic/trinity_out_dir/Bi.trinity.fa.transdecoder_dir/longest_orfs.cds \
  -out Bimp_gut_CDS_BLASTdb -parse_seqids -dbtype nucl \
  -title "Bombus impatiens gut transcriptome CDS"

tBLASTn

Now use the FASTA file for multiple local blast searches to our B. impatiens gut transcriptome.

cd /export/groups/drangeli/beeSeq.project/rna.classic/
tblastn -query /research/drangeli/BLAST_DBs/SwissProt_batch_queries/uniprot.Dmel.Hymenoptera.custom.labels.fa \
  -db /export/groups/drangeli/beeSeq.project/Bimp_gut_CDS_BLASTdb/Bimp_gut_CDS_BLASTdb \
  -evalue 1e-10 -max_target_seqs 10 -max_hsps 1 -num_threads 16 \
  -outfmt "6 qseqid sseqid pident qlen length mismatch evalue bitscore" \
  > sp.tblastn.Bimp.gut.tx.tsv

cut -d '\t' -f 5 sp.tblastn.Bimp.gut.tx.tsv | sed 's/_i/X/' | cut -d 'X' -f 1 | sort -u | wc -l

This provided annotations for 3250 genes.

Contaminant filtering

It's entirely possible that the mRNA sequences in the transcriptome or 3seq data originate with xenobiotic sources, such as bacteria, fungi or parasites, ingested pollen, or contaminating organisms from the bee's surface, the lab, or us humans. To identify and remove any potential contaminant sequences I'll BLAST the transcripts to a database created from mRNA sequences of these taxa in GenBank. I also included higher taxa for any organisms that appeared in the preliminary 18S microbiome identifications. The list of taxa included the following:

Bradyrhizobium, Homo sapiens, Pinus, Actinobacteria, Bacteroidetes, Burkholderiales, Burkholderiales, Clostridiales, Comamonadaceae, Conoidasida, Conopidae, Crenarchaeota, Dinoflagellata, Enterobacteriaceae, Enterobacteriales, Enterococcaceae, Escherichia, Euryarchaeota, Firmicutes, Flavobacteriales, Lactobacillales, Magnoliophyte (from pollen, plastid and mitochondrion only), Mesostigmata, Microsporidia, Nematoda, Proteobacteria, Rhodospirillales, Saccharomycetes, Taphrinomycetes, Trombidiformes, Trypanosoma, Trypanosomatidae, Zygomycota

NCBI doesn't have a great bash API for GenBank downloads of this sort. So I had to manually download these FASTA files, then upload them to nscc:/export/groups/drangeli/beeSeq.project/rna.classic/potential.contaminants

The taxon-specific files may be useful for other projects, but I'll combine these as a single file for use now.

cd  /export/groups/drangeli/beeSeq.project/rna.classic/potential.contaminants
cat *.mRNAs.fa > potential.Bombus.mRNA.contaminants.fa

I'll also add in a version of the B. impatiens reference genome RNA dataset. However, I'll mark the GenBank IDs numbers to make them easy to identify and remove from the output.

sed 's/^>/>BOMBUSIMPATIENS/' ../../genome/GCF_000188095.2_BIMP_2.1_rna.fna > Bimp.mRNAs.fa
cat Bimp.mRNAs.fa >> potential.Bombus.mRNA.contaminants.fa

Remove duplicate sequences

First I'll use FASTX-ToolKit to convert the FASTA file to tabular format. (Beware that this removes the "greater than" sign, >, that begins an ID line in the standard FASTA format.

fasta_formatter -i potential.Bombus.mRNA.contaminants.fa -o tmp1 -w 0 -t

Remove non-alphanumeric characters from the descriptions, and take this opportunity to exclude entries where the ID + sequence is less than 100 characters in length.

cat tmp1 | sed 's/[(),:;]//g' | sed -r '/^.{,100}$/d' > tmp2

Delimit the ID from the rest of the description.

sed 's/\s/\t/' tmp2 > tmp3

Sort the entries by the ID column (the first) and retain only those with unique IDs.

sort -t\t -k 1 -u tmp3 > tmp4

For some reason, there are some sequences that makeblastdb considers to be duplicates. These are complete and partial mRNAs from Neospora caninum Liverpool. Note the line numbers and delete them using sed.

cat tmp4 | grep -n 'Neospora caninum Liverpool' | grep 'partial mRNA'
sed -e '1374608d' -e '1374668d' -e '1376607d' -e '1377671d' -e '1378198d' -e '1380081d' tmp4 > tmp5

Next, I'll use bash commands to convert it back into standard FASTA format. Then clean-up the tmp files.

cat tmp5 | sed -e 's/\t/\ /' -e 's/^/>/' -e 's/\t/\n/' > potential.Bombus.mRNA.contaminants.fa
rm tmp?

This yeilds 1,921,753 unique sequences, based on ID, of which 21,241 are from B. impatiens.

Make the contaminant database

/export/local/bin/makeblastdb \
  -in /export/groups/drangeli/beeSeq.project/rna.classic/potential.contaminants/potential.Bombus.mRNA.contaminants.fa \
  -out potential_contaminant_mRNA_BLASTdb -parse_seqids -dbtype nucl \
  -title "Potential Bombus gut mRNA contaminants"

Contaminant BLASTn

Search the database of potential contaminant mRNA sequences using the B. impatiens gut CDS dataset. The command below uses 16 processors and outputs in the same format as the annotation tBLASTn, except that it keeps only the single best hit. The e-value filter is not set at 1e-10, no more stringently than for normal annotations.

blastn -query /export/groups/drangeli/beeSeq.project/rna.classic/trinity_out_dir/Bi.trinity.longest_orfs.cds \
  -db /export/groups/drangeli/beeSeq.project/rna.classic/potential.contaminants/potential_contaminant_mRNA_BLASTdb \
  -evalue 1e-10 -max_target_seqs 1 -max_hsps 1 -num_threads 16 \
  -outfmt "6 qseqid sseqid pident qlen length mismatch evalue bitscore" \
  > Bimp.gut.cds.blastn.potential.contaminants.tsv

Remove lines with hits to the B. impatiens genome.

sed -i '/BOMBUSIMPATIENS/d' Bimp.gut.cds.blastn.potential.contaminants.tsv

Extract the GenBank IDs from this file.

cat Bimp.gut.cds.blastn.potential.contaminants.tsv | \
  cut -f 2 | sed -e 's/^ref//' -e 's/^emb//' -e 's/^gb//' -e 's/^dbj//' -e 's/|//g' \
  > contaminant.GenBank.ids.txt

Reduce this to a list of unique entries.

sort -u contaminant.GenBank.ids.txt > contaminant.unique.GenBank.ids.txt

Only 355 of the 130,267 transcript CDS's (0.2725%) have significant hits to the contaminant dataset. I used GenBank's batch download feature to get full GenBank entries as contaminants.gb . For convenience, extract the gene descriptions and organism names.

cat contaminants.gb | grep 'DEFINITION\|ORGANISM' | \
  tr -d "\n\r" | \
  sed 's/DEFINITION  /\n/g' | \
  sed 's/  ORGANISM  /\t/g' | \
  sed 's/[().,:;]//g' | \
  sed -e '1d' > contaminant.short.list.tsv
paste -d "\t" contaminant.unique.GenBank.ids.txt contaminant.short.list.tsv > tmp
mv tmp contaminant.short.list.tsv

I then manually collated Bimp.gut.cds.blastn.potential.contaminants.tsv with the gene and organism data from contaminant.short.list.tsv into a file called Bimp.gut.cds.contaminant.info.tsv. I made reciprocal searchs of the best hit isoform for each gene, online using BLASTx against the non-redundant (NR) database. This revealed 157 transcripts that actually did have a strong hit to B. impatiens. These were removed from the contaminant list. What remained was a list of 184 likely contaminant transcripts, Bimp.gut.cds.contaminant.curated.info.tsv.

A list of the species sources of the contaminant sequences

sed '1d' Bimp.gut.cds.contaminant.curated.info.tsv | cut -d '\t' -f 4 | sed -e 's/\[//' -e 's/\]//' -e 's/TSA //' -e 's/PREDICTED //' -e '/Sequence/d' | cut -d ' ' -f 1,2 | sort -u
  • Ascaris suum
  • Caenorhabditis remanei
  • Candida albicans
  • Candida auris
  • Candida tanzawaensis
  • Clavispora lusitaniae
  • Deladenus siricidicola
  • Dirofilaria immitis
  • Kluyveromyces marxianus
  • Lactobacillus viridescens
  • Lingulodinium polyedrum
  • Metschnikowia bicuspidata
  • Meyerozyma guilliermondii
  • Pratylenchus thornei
  • Torenia fournieri
  • Varroa destructor

Most of the positive hits are to yeast and nematodes. It is likely that these are not the actual source species, but the closest matches available in this dataset.

cut -d '\t' -f 1 Bimp.gut.cds.contaminant.curated.info.tsv | grep 'TRIN' | sed 's/_i[123456790]//' | sort -u > Bimp.gut.cds.contaminant.curated.ids.txt
wc -l Bimp.gut.cds.contaminant.curated.ids.txt

These CDS come from 96 separate genes that can be removed prior to differential expression analysis. Amazingly, this small number of genes accounts for 4.09% of mapped 3seq reads. Most are from genes known to be highly expressed, such as ribosomal proteins and enzymes of cellular respiration.

Munging BLAST annotations for R

R's string handling functions don't deal well with the pipe character. So, we can replace it with the tab character via sed. First, create a back-up copy.

cp sp.tblastn.Bimp.gut.tx.tsv backup.sp.tblastn.Bimp.gut.tx.tsv
sed 's/|/\t/g' sp.tblastn.Bimp.gut.tx.tsv | \
  sed -r 's/\tc(\w*)_g(\w*)_i/_c\1_g\2\ti/' > tmp.tsv
cat tmp.tsv > sp.tblastn.Bimp.gut.tx.tsv; rm tmp.tsv

Next, I move the resulting file sp.tblastn.Bimp.gut.tx.tsv to my laptop for final processing in R using the script
SwissProt.annotations.for.BeeSeq.R

GO terms

SwissProt provides highly curated protein sequences that include gene ontology (GO) terms. Start by extracting the UniProt ID numbers from the file I downloaded with curated SwissProt entries from fruit flies and Hymenoptera.

grep '^>sp' uniprot.Dmel.Hymenoptera.fa | cut -d '|' -f 2 \
  > uniprot.Dmel.Hymenoptera.SwissProt.IDs.txt

For sequences that don't already have an association with a UniProt ID, I'll BLAST them to a database created from all insect SwissProt entries. (There's currently 9118 such sequences.) I downloaded that dataset and named it SwissProt.Insecta.fa.

cd /research/drangeli/BLAST_DBs/SwissProt_batch_queries/
sed 's/>sp|/>/' SwissProt.Insecta.fa | sed 's/>tr|/>/' | sed 's/ OS=/|/' | sed 's/ GN=.*//' | sed 's/ PE=.*//' | sed -e 's/\s/|/' | sed -e 's/\s/_/g' | sed -e 's/\,//g' > SwissProt.Insecta.custom.labels.fa

Create the BLAST database.

/export/local/bin/makeblastdb \
  -in /research/drangeli/BLAST_DBs/SwissProt_batch_queries/SwissProt.Insecta.custom.labels.fa \
  -out /research/drangeli/BLAST_DBs/SwissProt_Insecta_BLASTdb/SwissProt_Insecta_BLASTdb -parse_seqids -dbtype prot \
  -title "Reviewed Insect SwissProt Entries"

Make the BLASTx search.

cd /export/groups/drangeli/beeSeq.project/rna.classic/
blastx -query trinity_out_dir/Bi.trinity.fa \
  -db /research/drangeli/BLAST_DBs/SwissProt_Insecta_BLASTdb/SwissProt_Insecta_BLASTdb \
  -evalue 1e-10 -max_target_seqs 1 -max_hsps 1 -num_threads 16 \
  -outfmt "6 qseqid sseqid pident qlen length mismatch evalue bitscore" \
  > Bimp.gut.tx.BLASTx.sp.Insecta.tsv

Now, extract the UniProt IDs.

cat Bimp.gut.tx.BLASTx.sp.Insecta.tsv | cut -d '|' -f 1 | cut -d$'\t' -f 2 \
  > Bimp.gut.tx.BLASTx.sp.Insecta.SwissProt.IDs.txt

Use the UniProt website at https://www.uniprot.org/uploadlists/ to get the GO term from this list of IDs. Save the resuling file as Bimp.gut.tx.BLASTx.sp.Insecta.SwissProt.IDs.GO.tsv

Combine these into one file with all the relevant UniProt IDs, entry names, and GO terms.

cp uniprot.Dmel.Hymenoptera.SwissProt.IDs.GO.tsv GO.terms.tsv
cat Bimp.gut.tx.BLASTx.sp.Insecta.SwissProt.IDs.GO.tsv >> GO.terms.tsv

These results will be used in R scripts for analysis of GO enrichment.

Next Steps

From this point, the analysis of RNA data is best done in R. The files listed below have been produced up to this point in bash and will be used in the subsequent analysis.

  • BIMP_2.1.gff.mRNA.lines.tsv - A table of mRNA feature IDs and their positions in the reference B. impatiens genome (v2.1)

  • trinity.gmap_BIMP_2.1.positions.tsv - A table of mapped genomic positions for Trinity transcripts

  • merged.3seq.gtf.transcript.lines.tsv - A table of mapped genomic positions for the 3seq reads

  • gene_count_matrix.csv - A matrix of count data for the 3seq samples at the gene level, based on mapping to the genome

  • transcript_count_matrix.csv - Similar to above, but for transcript isoforms

  • *.abundance.tsv - A folder of 96 files of counts for each 3seq sample, based on mapping to the gut transcriptome

These files are first processed in R to associate annotations with the count data.

Misc

Some other useful tricks I learned along the way.

Extract a single sequence from a multi-fasta file

awk 'BEGIN {RS=">"} /NT_176555/ {print ">"$0}' bim_ref_BIMP_2.0_chrUn.fa > scaffolds/NT_176555.fa

awk 'BEGIN {RS=">"} /TRINITY_DN35142_c71_g1_i1/ {print ">"$0}' trinity_out_dir/Bi.trinity.fa > transcripts/TRINITY_DN35142_c71_g1_i1.fa

scp nscc:/export/groups/drangeli/beeSeq.project/rna.classic/transcripts/TRINITY_DN35142_c71_g1_i1.fa .




Return to the top

Dave Angelini, 2018 - bugsinourbackyard.org