Dave Angelini
March 29, 2018
Bombus impatiens (photo: Dave Angelini) |
Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the Dalhousie University IMR, from amplicons targetting 16S v4-v5.
GTGYCAGCMGCCGCGGTAA
CCGYCAATTYMTTTRAGTTT
Parada AE, Needham DM, Fuhrman JA. 2016. Every base matters: assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. Environ Microbiol 18(5):1403-14. Link
Walters W, Hyde ER, Berg-Lyons D, Ackermann G, Humphrey G, Parada A, Gilbert JA, Jansson JK, Caporaso JG, Fuhrman JA, Apprill A, Knight B. 2015. Improved bacterial 16S rRNA gene (V4 and V4-5) and fungal internal transcribed spacer marker gene primers for microbial community surveys. mSystems 1(1):e00009-15. Link
In order to make Qiime work more smoothly, I'll rename the raw data files to have names without any special characters like dashes or underscores.
This can be done with a few variations on…
Set the Qiime2 docker alias. Note that I updated to the June 2018 Qiime2 release.
Import the data and summarize.
I ran the dada2 denoise process under a range of truncation parameters using a bash script to run the denoise command and produce a feature.table.qza
and tsv
file. I used the total number of features (presumptive OTUs) as the metric to choose.
The parameters that produce the maximum number of features are --p-trunc-len-f 260 --p-trunc-len-r 175
, corresponding to a quality score cutoff of roughly 34 as seen in the quality plot in the 16s.v4.demux.qzv
file.
Export the feature table to a format that can be read by R.
The resulting tab-separated values (tsv) file is a table of counts for each feature in each sample. This can be imported into R for custom statistical tests, plots, or other analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to the ID using 16s.v4.taxonomy.tsv
, which is generated later in this workflow.
A list of features with links to BLAST search. This isn’t so immediately useful, it can be used as a tool to follow up on the individual identity of particular features/taxa later.
Generate a tree for phylogenetic diversity analyses.
Determine the optimal rarefaction level.
Optimal rarefaction looks to be about 2000.
Calculate diversity metrics. Qiime2’s core diversity metrics include:
Alpha diversity
Beta diversity
Test for associations between metadata categories and values and the alpha diversity metrics.
Comparisons of beta diversity by metadata categories can be made via PermANOVA.
PCA with a covariate.
Naive Bayes classifiers have been built from the SILVA release 132 dataset. One was provided by the Qiime2 developers. I also experimented with building my own classifier. I had no success creating a better classifier by narrowing the sequences in the training set to the region used in our own sequencing. This approach is said to improve results, but I had abysmal classifications using that approach.
Much more successful was the construction of a classifier using the full Silva dataset. On the cownose ray project, this classifier worked as well or better than the one provided by the Qiime developers. The one possible limitation is that it uses rank-free taxonomy, and while most entries have a 7-level system, some include intermediary taxon names.
Below, I'll use both classifiers on our dataset.
Qiime's provided Silva full-length classifier
The median posterior support for taxonomy assignments is 0.9843.
Our in-house-built Silva full-length classifier
The mean posterior support for taxonomy assignments is 0.9921.
Calculate pseudocounts (counts +1) to avoid log-of-zero errors.
ANCOM (ANalysis of COMposition) is a statistical method for the comparisons of feature abundance in sample groups. It should be robust to distributional assumptions and scaling to large sample sizes. ANCOM also assumes that relatively few (< ~ 25%) of the features are changing between groups.
There are 23 OTUs that differ significantly in abundance among the treatment groups. I created a generic R function to plot Qiime2's ANCOM output as a series of ggplot facets.
This is interesting, but most of the significant differences appear to involve the free-foraging treatment, compared to the enclosed colonies. What might be a more useful analysis is to focus on the enclosed treatments at the last time point.
Data from both regions of 16S identified the same, single OTU as differentially expressed among treatments. Both classifers identify it as "uncultured Acetobacteraceae bacterium" with greater than 90% probability. It appears to show a dose-responsive decrease in abdundance.
Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the Dalhousie University IMR, from amplicons targetting 16S v6-v8.
ACGCGHNRAACCTTACC
ACGGGCRGTGWGTRCAA
Comeau AM, Li WKW, Tremblay J-É, Carmack EC, Lovejoy C (2011) Arctic Ocean Microbial Community Structure before and after the 2007 Record Sea Ice Minimum. PLoS ONE 6(11): e27492. Link
Set the Qiime2 docker alias.
Import the data and summarize.
Again, I tested a range of --p-trunc-len-f
and --p-trunc-len-r
values, using a bash script.
The optimum values to maximize feature/OTU number is --p-trunc-len-f 275 --p-trunc-len-r 185
. These reads were very high quality, and these truncation values are roughly a quality cutoff of 35.
The feature table is worth exporting to a format that can be read by R.
The resulting tab-separated values (tsv) file is a table of counts for each feature in each sample. This can be imported into R for custom statistical tests, plots, or other analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to the ID using 16s.v6.taxonomy.tsv
, which is generated later in this workflow.
A list of features with links to BLAST search. This isn’t so immediately useful, it can be used as a tool to follow up on the individual identity of particular features/taxa later.
Generate a tree for phylogenetic diversity analyses.
Determine the optimal rarefaction level.
Optimal rarefaction looks to be about 1000.
Calculate diversity metrics.
Test for associations between metadata categories and values and the alpha diversity metrics.
Comparisons of beta diversity by metadata categories can be made via PermANOVA.
PCA with a covariate.
Qiime's provided Silva full-length classifier
The median posterior support for taxonomy assignments is 0.9479.
Our in-house-built Silva full-length classifier
The median posterior support for taxonomy assignments is 0.9620.
Calculate pseudocounts (counts +1) to avoid log-of-zero errors.
ANCOM
ANCOM identified 30 OTUs with differential abundance among the treatments. As above, I'll filter down to just the day 25 data for the enclosed treatments.
Another way to examine differences among treatments is the use of balance trees as implemented by the gneiss
algorithm Morton et al. 2017. Qiime has an instructive tutorial on its application of gneiss.
Based on the OLS R2, the regression using treatment as the only factor can explain about 11.2% of all community variation.
From inspection of the qzv
file, it looks like split y2 distinguishes the high dose treatment from the others.
The log ratio for these taxa is almost completely non-overlapping for treatments A and D, and the treatment medians show a progressive dose effect. The taxa that are reduced in the high dose treatment here are Saccharibacter sp. and 6 unidentified Enterobacteriaceae.
The gradient clustering variation tells a similar story.
The OLS regression to treatment explains about 11.2% of variation. This analysis suggests a larger group of taxa that are reduced in the treated groups, although the effect is not quite as strong. The taxa affected include Saccharibacter, Lactobacillus, several Acetobacteraceae, Pantoea, Asaia, Curtobacterium, Sphingomonas, and others OTUs without good taxonomic resolution.
Paired-end 2x300bp MiSeq libraries were made by Andre Comeau at the Dalhousie University IMR, from amplicons targetting 18S v4.
CYGCGGTAATTCCAGCTC
CRAAGAYGATYAGATACCRT
Comeau AM, Li WKW, Tremblay J-É, Carmack EC, Lovejoy C (2011) Arctic Ocean Microbial Community Structure before and after the 2007 Record Sea Ice Minimum. PLoS ONE 6(11): e27492. Link
Set the Qiime2 docker alias.
Import the data and summarize.
The quality plot in the qzv
file is used to estimate an appropriate point for trimming reads.
My initial 2017 analysis (using Qiime2 release 2017.2) classified one of the 18S OTUs as Apicystis bombi. I was concerned that I wasn't seeing that again. Several repeats with more recent releases failed to recover that taxon, and I became concerned that the filtering stage might introduce a fair be of stochasticity into the the analysis.
Therefore, I ran the dada2 denoise process under a range of truncation parameters using a bash script to run the denoise command and produce feature.table.qza
and tsv
files. I used the total number of features (presumptive OTUs) as the metric to choose.
The maximum number of features/OTUs appears at --p-trunc-len-f 290 --p-trunc-len-r 175
. Gratifyingly, Apicystis bombi is among the taxa identified from these sequences.
The feature table is exported as a tsv
file.
The resulting tab-separated values (tsv) file is a table of counts for each feature in each sample. This can be imported into R for custom statistical tests, plots, or other analyses. The features in this table are named by their UUIDs, but their taxonomy assignments can be matched up to the ID using 18s.taxonomy.tsv
, which is generated later in this workflow.
A list of features with links to BLAST search. This isn’t so immediately useful, it can be used as a tool to follow up on the individual identity of particular features/taxa later.
Generate a tree for phylogenetic diversity analyses.
Determine the optimal rarefaction level.
Optimal rarefaction looks to be about 300. Calculate diversity metrics.
Test for associations between metadata categories and values and the alpha diversity metrics.
Comparisons of beta diversity by metadata categories can be made via PermANOVA.
PCA with a covariate.
Qiime's provided Silva full-length classifier
The median posterior support for taxonomy assignments is 0.9431.
Our in-house-built Silva full-length classifier
The median posterior support for taxonomy assignments is 0.9854.
Calculate pseudocounts (counts +1) to avoid log-of-zero errors.
ANCOM
Nine features were significantly different in abundance among treatments. However, all of them appear to be either over-represented in the free-foraging treatment (8, mostly chloroplast sequences) or not (1 yeast).
After filtering down to just the day 25 enclosed colony data, ANCOM failed to identify any 18S OTUs with significant differences in expression.
BLAH.
Sequenced by the UC Davis DNA Technology Core from RNA isolated in the Angelini Lab at Colby College. Bombus impatiens samples originated from commercial hives raised at the UMass Cranberry Research Station. All samples were poly-A selected.
Classic RNAseq –- 90-bp single-end reads on one Illumina HiSeq lane with 1,487,663,804 reads from 4 samples:
3'-end seq –- 90-bp single-end reads from the 3' UTR.
These data were processed on Colby's high-preformance computing cluster nscc.
Check out one or more of the 3seq files.
Here are the classic RNAseq runs.
Trimmomatic will trim sequences based on the read quality. I'll start with the classic RNAseq read files.
This will get old fast. So I created a shell script trimmomatic.sh
to run Trimmomatic on all the remaining read files, using the same parameters.
Unzip the files to get line counts.
Dividing the line count total by 4 gives the total number of reads passing the filter criteria: 358,759,635.
Next, I moved the filtered read files to a separate folder.
Another concern for RNAseq data is contamination during sample prep or sequencing or from parasites or commensal organisms living in and on the bee. fastq_screen is a tool to quickly compare short reads to several reference genomes. It relies on previously indexed genomes, using an aligner such as bowtie2. Once a genome is downloaded from NCBI in fasta format, an index can be build as follows.
The fastq_screen.conf
file then directs the program to the aligner (bowtie2 in this case) and the databases used for comparisons. I included the following genomes.
Once the paths to these bowtie2 indices are in the config file, a call to fastq_screen is easy.
The mean percentage of reads that could not be mapped to the B. impatiens genome was 18%.
The results show no evidence of contamination from humans, other lab organisms or molecular reagents. Among the commensals and pathogens screened, only sequences aligning to the mite are present. Varroa destructor is the only mite genome is our dataset, but it's defintely not this species, which is large and obvious externally. It's also unlikely to be evidence of the bumblebee mite, Parasitellus fucorum, also an obvious external mite. There's no genome currently available for tracheal mites, such as Locustacarus buchneri or Acarapis woodi. These species are a possibility, although the bees seemed to be in good health when collected. Other small commensal mites may explain these results. The number of reads mapping to the mite dataset ranges from 0.5% to 10.4% of all processed reads for samples and has a mean of 4.1% across all 3-seq libraries.
Overall, contamination doesn't seem to be a serious problem. However, it may be worth using BLAST to check for mite sequences after assembly.
Before using Trinity to assemble the classic RNA seq reads, combine them into one file. (Later reconsideration: this isn't necessary with the current version of Trinity. It can take multiple input files.)
Trinity was run on nscc via a screen
on node 26. I gave the run 30 processors and 50Gb of memory. The output was captured as a log file.
Trinity includes a perl script that calculates some basic stats from the assembly.
Rename the default output fasta file to something more descriptive.
TransDecoder will identity long open reading frames from putative protein-coding mRNAs.
From trinity_out_dir/
This creates a folder Bi.trinity.fa.transdecoder_dir/
with several output files. The most useful of which is longest_orfs.cds
, which contains the nucleotide sequences from the coding regions (CDS) of each transcript, and longest_orfs.pep
, which has the amino acid translation of those CDS sequences.
One useful part of annotation is using the transcriptome itself as a BLAST database. The makeblastdb
command could be a little finicky, so using complete paths may be helpful.
For what it's worth, I'll also make a hisat2
index based on the Trinity assembly. This will allow the 3seq reads to be mapped to the transcriptome.
BUSCO is a useful, fast utility to compare a transcriptome to a curated reference of taxonomically restricted sequences. The accronym stands for "Benchmarking Universal Single-Copy Orthologs."
First, download the reference databases
Uncompress the archives.
I chose to run the BUSCO python script with the Hymenoptera database, but similar results were obtained with the broader endopterygote dataset.
This ran in 2.8 hours on node 26 using 16 CPUs.
Trinity's assembly algorithm allows for the identification of isoforms created through alternative splicing. Since we're not interested in transcript isoforms, it's tempting to pull the longest isoform for each component (loosely this can be thought of as a "gene"). This can be done with a python script provided with Trinity.
However, as the script warns "the longest isoform isn't always best". I tried this, re-ran Transdecoder to define coding regions and then ran BUSCO on the resulting file. The number of conserved orthologs that could be identified fell drammatically, from 85% in the initial Trinity output to just 58% for longest isoforms. I'm not sure of the reason for this difference, but it suggests that a better approach will be to map reads to the full Trinity assembly and then combine counts by component/gene later.
It's great to a reference genome! The files for B. impatiens can be downloaded directly from NCBI. The calls below are to version 2.1 of the B. impatiens genome, released on March 20, 2018. (My initial analyses used version 2.0, where the number of mapped transcripts was almost exactly the same – just 4 more.)
Read mapping can be done using GMAP, which is available on nscc.
First, create a GMAP index.
Next, align the assembled CDS's to the genome, outputting in SAM format. This runs GMAP using 8 threads.
After the lengthy @SQ
header, the SAM file format consists of tab-separated lines listing the mapped location of Trinity CDS's. The important columns include…
column | description |
---|---|
1 | Trinity transcript ID |
2 | strand (0 = forward, 16 = reverse, 4 = unmapped) |
3 | genomic scaffold ID the transcript was mapped to |
4 | position (left-side) of the transcript's alignment |
10 | DNA sequence |
For example:
So, how many transcripts are mapped?
How many genes does this likely reflect?
Convert the SAM file to BAM (binary sam) format using SAMtools.
Most downstream applications will require that the BAM file be sorted by genomic coordinates.
Now index the BAM file to enable rapid navigation.
Next, I'll use an R script to associate the mapped transcripts with the genome's annotations. I'll start by extracting just the positional information and sequence from the SAM file.
The 3seq reads can be mapped in two phases: First to the genome, usng HISAT2 and StringTie. Second to the de novo gut transcriptome assembly produced in this project, using Kallisto.
HISAT2 is optimized for this sort of short read mapping to a reference genome. It requires one or more fasta files as input. These should include the entire reference genome (or transcriptome).
Next, you map reads for each filtered 3seq fastq file. I created a shell script to run hisat2 for each file.
Here's an example line:
Move the unmapped read files into a separate folder.
Another script converts to the BAM format and sorts each file using 30 processors.
Example line:
To confirm that the BAM file is in fact sorted, look for SO:coordinate
in the first line. Then, remove the SAM files to save space. They take up about 8 times more drive space.
StringTie is used to calculate abundances for mapped reads. If needed download the complied Linux x86_64 binary.
StringTie views reference features that aren't completely covered by reads as novel, and therefore it will not apply the annotation information. This is a problem for 3seq data, which by its nature will never cover an entire mRNA annotation. For this reason, it's probably best to run it initially without the -e
option (which excludes the possibility of transcripts not present in the reference), then take the resulting GTF file and merge it with the reference. Then re-run StringTie with the merged GTF reference to obtain relative abundances. This workflow is recommended by the StringTie documentation. Again, since these data are all from 3' ends, it's likely that none of them will provide complete transcripts, but it's worth doing this to produce a single consensus GTF file.
I have a script to run this for each sample.
Example line:
Now merge the individual GTF files. First, create text file listing all the .gtf
file names.
Next repeat the StringTie run, using the merged GTF file as the reference. This will also output tab-delimited abundance files and input files for Ballgown –- which we won't use.
Example line:
The authors of StringTie provide a python script for conversion of the Ballgown data into a format for use with R's DEseq2 package.
Create a sample list file.
Run the python script.
That will create two count tables for genes and transcripts (gene_count_matrix.csv
and transcript_count_matrix.csv
). These files can be imported into R for analysis and associated with genomic features at their indicated positions (see the next section).
What's needed at this point is the annotation of the gene/transcript counts. StringTie is conservative. Without reads spanning all of a feature, the software won't use the annotation information from the reference genome. However, the file merged.3seq.gtf
includes the mapped genome locations for each MSTRG
identification number. The problem is to locate these positions in the genome annotation file ../genome/GCF_000188095.2_BIMP_2.1_genomic.gff
and associate the MSTRG numbers to their annotations. This is best accomplished using R. But first, bash can be used to make the read-map and genomic annotation data more tidy.
Start by extracting just transcript lines from the merged read-map GTF file.
Next, extract just mRNA lines from the scaffold gff3 file.
The read maps and genomic annotations are associated using the R script collating.MSTRG.to.Bimp.gene.annotations.R
. The product of this script is a tab-delimited file annotated.srtingtie.MSTRG.genes.tsv
which lists genomic scaffold positions for predicted mRNAs with the StringTie gene IDs, as well as applying the GO terms and UniProt IDs based on GenBank IDs resulting from BLAST searches with the assembled transcript sequences.
The resulting table has 30,186 entries for putative mRNAs in the reference genome that are associated with our transcriptome. Of these, 22,551 are annotated with a GenBank ID and 8,844 have GO annotations.
Kallisto will quantify the abundance of transcripts from RNA-Seq data using a pseudoalignment method. First, create a Kallisto index from the Trinity assembled transcripts. I intentionally do this using the complete assembly, not the CDS files, because of course if there are fragmentary or non-coding RNAs that were sequenced, then some reads will correspond to those RNAs too.
In order to run Kallisto on single-end sequence, it wants to know the mean and standard deviation of read lengths. First, use bash to find the total number of filtered reads.
That's 350,566,162. Next, calculate the mean and SD.
Next, I wrote a bash script to run kallisto quant
for each sample and to rename the output files with the sample names.
Example lines:
The result is a folder kallisto.filtered.reads.vs.gut.tx
that contains .abundance.tsv
files for each sample, listing the counts for each transcript assembled by Trinity. The easiest way to combine these data into a single matrix and to merge isoforms into counts for each gene is by using R. The R script merging.kallisto.results.R
accomplishes this, and also removes counts from contaminant transcript sequences and saves them separately. The script's output is two new tables:
Bi.3seq.filtered.vs.gut.tx.kal.abundance.csv
contaminant.Bi.3seq.filtered.vs.gut.tx.kal.abundance.csv
One potential problem with these files, during later use in R, is the presence of pipe characters in the Trinity gene IDs. R can misinterpret |
as the OR logical operator. The pipes can be removed using bash.
I pulled curated SwissProt entries from Drosophila melanogaster and any Hymenoptera. Direct link here.
Next, I created a modified version of the FASTA file, with sequence labels that will behave well in BLAST, with pipe delimiters between the IDs and replacing space characters with underscores in human-readable gene names.
Now use the FASTA file for multiple local blast searches to our B. impatiens gut transcriptome.
This provided annotations for 3250 genes.
It's entirely possible that the mRNA sequences in the transcriptome or 3seq data originate with xenobiotic sources, such as bacteria, fungi or parasites, ingested pollen, or contaminating organisms from the bee's surface, the lab, or us humans. To identify and remove any potential contaminant sequences I'll BLAST the transcripts to a database created from mRNA sequences of these taxa in GenBank. I also included higher taxa for any organisms that appeared in the preliminary 18S microbiome identifications. The list of taxa included the following:
Bradyrhizobium, Homo sapiens, Pinus, Actinobacteria, Bacteroidetes, Burkholderiales, Burkholderiales, Clostridiales, Comamonadaceae, Conoidasida, Conopidae, Crenarchaeota, Dinoflagellata, Enterobacteriaceae, Enterobacteriales, Enterococcaceae, Escherichia, Euryarchaeota, Firmicutes, Flavobacteriales, Lactobacillales, Magnoliophyte (from pollen, plastid and mitochondrion only), Mesostigmata, Microsporidia, Nematoda, Proteobacteria, Rhodospirillales, Saccharomycetes, Taphrinomycetes, Trombidiformes, Trypanosoma, Trypanosomatidae, Zygomycota
NCBI doesn't have a great bash API for GenBank downloads of this sort. So I had to manually download these FASTA files, then upload them to nscc:/export/groups/drangeli/beeSeq.project/rna.classic/potential.contaminants
The taxon-specific files may be useful for other projects, but I'll combine these as a single file for use now.
I'll also add in a version of the B. impatiens reference genome RNA dataset. However, I'll mark the GenBank IDs numbers to make them easy to identify and remove from the output.
First I'll use FASTX-ToolKit to convert the FASTA file to tabular format. (Beware that this removes the "greater than" sign, >
, that begins an ID line in the standard FASTA format.
Remove non-alphanumeric characters from the descriptions, and take this opportunity to exclude entries where the ID + sequence is less than 100 characters in length.
Delimit the ID from the rest of the description.
Sort the entries by the ID column (the first) and retain only those with unique IDs.
For some reason, there are some sequences that makeblastdb
considers to be duplicates. These are complete and partial mRNAs from Neospora caninum Liverpool. Note the line numbers and delete them using sed.
Next, I'll use bash commands to convert it back into standard FASTA format. Then clean-up the tmp
files.
This yeilds 1,921,753 unique sequences, based on ID, of which 21,241 are from B. impatiens.
Search the database of potential contaminant mRNA sequences using the B. impatiens gut CDS dataset. The command below uses 16 processors and outputs in the same format as the annotation tBLASTn, except that it keeps only the single best hit. The e-value filter is not set at 1e-10, no more stringently than for normal annotations.
Remove lines with hits to the B. impatiens genome.
Extract the GenBank IDs from this file.
Reduce this to a list of unique entries.
Only 355 of the 130,267 transcript CDS's (0.2725%) have significant hits to the contaminant dataset. I used GenBank's batch download feature to get full GenBank entries as contaminants.gb
. For convenience, extract the gene descriptions and organism names.
I then manually collated Bimp.gut.cds.blastn.potential.contaminants.tsv
with the gene and organism data from contaminant.short.list.tsv
into a file called Bimp.gut.cds.contaminant.info.tsv
. I made reciprocal searchs of the best hit isoform for each gene, online using BLASTx against the non-redundant (NR) database. This revealed 157 transcripts that actually did have a strong hit to B. impatiens. These were removed from the contaminant list. What remained was a list of 184 likely contaminant transcripts, Bimp.gut.cds.contaminant.curated.info.tsv
.
A list of the species sources of the contaminant sequences…
Most of the positive hits are to yeast and nematodes. It is likely that these are not the actual source species, but the closest matches available in this dataset.
These CDS come from 96 separate genes that can be removed prior to differential expression analysis. Amazingly, this small number of genes accounts for 4.09% of mapped 3seq reads. Most are from genes known to be highly expressed, such as ribosomal proteins and enzymes of cellular respiration.
R's string handling functions don't deal well with the pipe character. So, we can replace it with the tab character via sed
. First, create a back-up copy.
Next, I move the resulting file sp.tblastn.Bimp.gut.tx.tsv
to my laptop for final processing in R using the script
SwissProt.annotations.for.BeeSeq.R
SwissProt provides highly curated protein sequences that include gene ontology (GO) terms. Start by extracting the UniProt ID numbers from the file I downloaded with curated SwissProt entries from fruit flies and Hymenoptera.
For sequences that don't already have an association with a UniProt ID, I'll BLAST them to a database created from all insect SwissProt entries. (There's currently 9118 such sequences.) I downloaded that dataset and named it SwissProt.Insecta.fa
.
Create the BLAST database.
Make the BLASTx search.
Now, extract the UniProt IDs.
Use the UniProt website at https://www.uniprot.org/uploadlists/ to get the GO term from this list of IDs. Save the resuling file as Bimp.gut.tx.BLASTx.sp.Insecta.SwissProt.IDs.GO.tsv
Combine these into one file with all the relevant UniProt IDs, entry names, and GO terms.
These results will be used in R scripts for analysis of GO enrichment.
From this point, the analysis of RNA data is best done in R. The files listed below have been produced up to this point in bash and will be used in the subsequent analysis.
BIMP_2.1.gff.mRNA.lines.tsv
–- A table of mRNA feature IDs and their positions in the reference B. impatiens genome (v2.1)
trinity.gmap_BIMP_2.1.positions.tsv
–- A table of mapped genomic positions for Trinity transcripts
merged.3seq.gtf.transcript.lines.tsv
–- A table of mapped genomic positions for the 3seq reads
gene_count_matrix.csv
–- A matrix of count data for the 3seq samples at the gene level, based on mapping to the genome
transcript_count_matrix.csv
–- Similar to above, but for transcript isoforms
*.abundance.tsv
–- A folder of 96 files of counts for each 3seq sample, based on mapping to the gut transcriptome
These files are first processed in R to associate annotations with the count data.
Some other useful tricks I learned along the way.
Dave Angelini, 2018 –- bugsinourbackyard.org