BeeSeq 2.0 - HackMD

# BeeSeq 2.0 Following rejection from *Science*, I want to update the RNA analysis for the BeeSeq project. (The original HackMD workflow is [here](https://hackmd.io/@aphanotus/BeeSeqProject).) At a minimum, I've learned we should include a tool, CD-HIT-EST, to collapse transcript isoforms into a single "gene" for transcriptomes. This will pick up from the previous analysis after running [Trinity](https://hackmd.io/@dts8RULgQqi0n0PPDKh7JQ/ByvobUtbQ#Trinity). ### Compressing transcript variants This will be done using [CD-HIT-EST](https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHITEST). ```bash cd /export/groups/drangeli/beeSeq.project/rna.classic cd-hit-est -i trinity_out_dir/Bi.trinity.fa -o Bi.trinity.c90.fa -c 0.90 -n 10 -d 0 -M 0 -T 30 grep '>' -c Bi.trinity.c90.fa TransDecoder.LongOrfs -t Bi.trinity.c95n8.fa grep '>' -c Bi.trinity.c90.fa.transdecoder_dir/longest_orfs.cds ``` I tried a numnber of variations on the `-c` threshold parameter. Unfortunately, the resulting number of sequenes appears to scale linearly. ![](https://i.imgur.com/kVACKeN.png) I'll usethe 0.90 threshold. Here's what the `.clstr` file ouput looks like. ``` >Cluster 0 0 35100nt, >TRINITY_DN38988_c2_g1_i1... * >Cluster 1 0 25778nt, >TRINITY_DN38979_c0_g1_i1... at +/98.56% 1 25598nt, >TRINITY_DN38979_c0_g1_i2... at +/98.10% 2 25694nt, >TRINITY_DN38979_c0_g1_i3... at +/98.17% 3 25682nt, >TRINITY_DN38979_c0_g1_i4... at +/98.50% 4 999nt, >TRINITY_DN38979_c0_g1_i5... at +/94.89% 5 25893nt, >TRINITY_DN38979_c0_g1_i6... * 6 24632nt, >TRINITY_DN38979_c0_g1_i7... at +/99.90% 7 25713nt, >TRINITY_DN38979_c0_g1_i8... at +/97.85% 8 22853nt, >TRINITY_DN38979_c0_g1_i10... at +/99.89% ``` Later, it will be helpful to have the cd-hit-est clusters in an easily R-readable format. I'll parse the file now to have clustered transcript IDs appear all on the same line. ```bash cat Bi.trinity.c90.fa.clstr | sed -n -e 'H;${x;s/\n/\t/g;p;}' \ | sed -e 's/>Cluster /\n>Cluster_/g' -e 's/\t[0-9]*\t/\t/g' \ -e 's/[0-9]*nt\, >//g' -e 's/[.*/+]//g' -e 's/ at [0-9]*\%//g' \ -e 's/ //g' -e 's/at-[0-9]*\%//g' \ > Bi.trinity.c90.fa.clstr.tsv ``` ### BUSCO analysis BUSCO is a useful, fast utility to compare a transcriptome to a curated reference of taxonomically restricted sequences. ```bash python /export/local/src/busco-2.0.1/BUSCO.py \ -i /export/groups/drangeli/beeSeq.project/rna.classic/Bi.trinity.c90.fa.transdecoder_dir/longest_orfs.cds \ -o BUSCO.report \ -l /research/drangeli/DB/busco.lineages/hymenoptera_odb9 \ -m tran -c 46 ``` ``` C:77.2%[S:47.5%,D:29.7%],F:12.5%,M:10.3%,n:4415 3409 Complete BUSCOs (C) 2096 Complete and single-copy BUSCOs (S) 1313 Complete and duplicated BUSCOs (D) 552 Fragmented BUSCOs (F) 454 Missing BUSCOs (M) 4415 Total BUSCO groups searched ``` ## Annotation of the gut transcriptome ### Mapping transcripts to the genome #### Download the genome There's a new release of for [*B. impatiens*](https://www.ncbi.nlm.nih.gov/genome/3415?genome_assembly_id=365287) genome, version 2.2. ```bash cd /export/groups/drangeli/beeSeq.project/genome curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.3_BIMP_2.2/README.txt" curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.3_BIMP_2.2/GCF_000188095.3_BIMP_2.2_assembly_report.txt" curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.3_BIMP_2.2/GCF_000188095.3_BIMP_2.2_assembly_stats.txt" curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.3_BIMP_2.2/GCF_000188095.3_BIMP_2.2_feature_table.txt.gz" curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.3_BIMP_2.2/GCF_000188095.3_BIMP_2.2_genomic.fna.gz" curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.3_BIMP_2.2/GCF_000188095.3_BIMP_2.2_genomic.gff.gz" curl -L -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/188/095/GCF_000188095.3_BIMP_2.2/GCF_000188095.3_BIMP_2.2_rna.fna.gz" gunzip *.gz ``` How many genes are annotated in the genome? ```bash grep 'gbkey=CDS\;gene=' GCF_000188095.3_BIMP_2.2_genomic.gff | sed 's/gene=LOC/z/' | sed 's/;product=/z/' | cut -d "z" -f 2 | sort -u | wc -l ``` 10558 ### GMAP Read mapping can be done using [GMAP](https://academic.oup.com/bioinformatics/article/21/9/1859/409207), which is available on **nscc**. First, create a GMAP index. ```bash gmap_build -d gmap_BIMP_2.2 -D . -k 13 GCF_000188095.3_BIMP_2.2_genomic.fna ``` Next, align the assembled CDS's to the genome, outputting in SAM format. This runs GMAP using 46 threads. ```bash cd /export/groups/drangeli/beeSeq.project/rna.classic/ /usr/local/src/gmap-2018-01-26/src/gmap \ -n 0 -D . -d ../genome/gmap_BIMP_2.2 \ Bi.trinity.c90.fa.transdecoder_dir/longest_orfs.cds \ -f samse -t 46 > Bi.trinity.c90.gmap_BIMP_2.2.sam ``` After the lengthy `@SQ` header, the SAM file format consists of tab-separated lines listing the mapped location of Trinity CDS's. The important columns include... | column | description |:--:|:------------- | 1 | Trinity transcript ID | 2 | strand (0 = forward, 16 = reverse, 4 = unmapped) | 3 | genomic scaffold ID the transcript was mapped to | 4 | position (left-side) of the transcript's alignment | 10 | DNA sequence For example: ``` TRINITY_DN59_c0_g1_i1|m.3 16 NT_176837.1 3522516 40 300M * 0 0 TCAATACAAAAAATAATC... ``` So, how many transcripts are mapped? ```bash grep 'TRINITY_DN' Bi.trinity.c90.gmap_BIMP_2.2.sam | wc -l ``` 58929 -- Since there were only 58358 transcripts in this transcriptome, there's a little redundancy. How many Trinity genes does this reflect? ```bash grep 'TRINITY_DN' Bi.trinity.c90.gmap_BIMP_2.2.sam | sed 's/_i/X/' | cut -d "X" -f 1 | sort -u | wc -l ``` 16608 -- Which is more than the 10558 genes in the genome annotation. Convert the SAM file to BAM (binary sam) format using [SAMtools](http://www.htslib.org/). Most downstream applications will require that the BAM file be sorted by genomic coordinates and indexed. ```bash samtools view -Sb Bi.trinity.c90.gmap_BIMP_2.2.sam > Bi.trinity.c90.gmap_BIMP_2.2.bam samtools sort Bi.trinity.c90.gmap_BIMP_2.2.bam Bi.trinity.c90.gmap_BIMP_2.2 samtools index Bi.trinity.c90.gmap_BIMP_2.2.bam ``` ### Munging GMAP results for R Next, I'll use an R script to associate the mapped transcripts with the genome's annotations. I'll start by extracting just the positional information and sequence from the SAM file. ```bash grep "TRINITY" Bi.trinity.c90.gmap_BIMP_2.2.sam | cut -f-1,2,3,4,10 > Bi.trinity.c90.gmap_BIMP_2.2.positions.tsv ``` Looking again at those lines that don't have the column 2 "4" unmapped flag. ```bash cat Bi.trinity.c90.gmap_BIMP_2.2.positions.tsv | cut -f 2 | grep '^4' -v -c ``` 58582 -- Or 347 transcripts that are in here as "unmapped" I'll modify this file to only contain those that are mapped. ```bash cat Bi.trinity.c90.gmap_BIMP_2.2.positions.tsv | grep '\t4\t' | cut -f 1-4 > Bi.trinity.c90.gmap_BIMP_2.2.unmapped.tsv cat Bi.trinity.c90.gmap_BIMP_2.2.positions.tsv | grep '\t4\t' -v | cut -f 1-4 > tmp mv tmp Bi.trinity.c90.gmap_BIMP_2.2.positions.tsv ``` ### A note on mapping In the final analysis, I actually only used counts of the 3seq reads mapped to the de novo gut transcriptome. But this used the original Trinity assembly, not a derivative from CD-HIT-EST or TransDecoder. Therefore, the analysis will actually not changed based on the compressed transcriptome or the genome update -- except in two ways: 1. The CD-HIT-EST clustering should be used to "compress" transcript names, so that the counts reflect (as best we can gauge) the gene level, not the transcript level. 2. The annotation of the new genome version should be used to update the annotations placed on transcriptome. ## Annotation For what it's worth, I'll update the annotation process we used last year. ### BLAST to a custom SwissProt dataset I pulled curated SwissProt entries from *Drosophila melanogaster* and any Hymenoptera. [Direct link here.](http://www.uniprot.org/uniprot/?query=organism:%22Drosophila%20melanogaster%20(Fruit%20fly)%20%5B7227%5D%22%20OR%20taxonomy:%22Hymenoptera%20%5B7399%5D%22&fil=reviewed%3Ayes&sort=score) Next, I created a modified version of the FASTA file, with sequence labels that will behave well in BLAST, with pipe delimiters between the IDs and replacing space characters with underscores in human-readable gene names. ```bash sed 's/>sp|/>/' uniprot.Dmel.Hymenoptera.fa | sed 's/>tr|/>/' | sed 's/ OS=/|/' | sed 's/ GN=.*//' | sed 's/ PE=.*//' | sed -e 's/\s/|/' | sed -e 's/\s/_/g' | sed -e 's/\,//g' > uniprot.Dmel.Hymenoptera.custom.labels.fa ``` I'll use the same BLAST database as before, which includes thre TransDecoder CDS from the unclustered Trinity assembly Now use the UniProt file for multiple local blast searches to the gut transcriptome. ```bash cd /export/groups/drangeli/beeSeq.project/rna.classic/ tblastn -query ../uniprot.Dmel.Hymenoptera.custom.labels.fa \ -db /export/groups/drangeli/beeSeq.project/Bimp_gut_CDS_BLASTdb/Bimp_gut_CDS_BLASTdb \ -evalue 1e-10 -max_target_seqs 10 -max_hsps 1 -num_threads 46 \ -outfmt "6 qseqid sseqid pident qlen length mismatch evalue bitscore" \ > sp.tblastn.Bimp.gut.tx.191214.tsv cut -f 2 sp.tblastn.Bimp.gut.tx.191214.tsv | sed 's/_i/X/' | cut -d 'X' -f 1 | sort -u | wc -l ``` This provides annotations for 3332 genes in the transcriptome. That's about 100 more than the last time I did this. ### Applying annotations to count data I had previously combined all the Kallisto expression counts using a an R script called `merging.kallisto.results.R`. This produced a single table file `Bi.3seq.filtered.vs.gut.tx.kal.abundance.csv` with the counts for each of 97317 Trinity transcripts. The task now is to collapse these count data based on two sources of information: (1) the cd-hit-est results and (2) the mapping of transcripts to the genome (`Bi.trinity.c90.gmap_BIMP_2.2.positions.tsv` -- Transcripts that map to be same genomic feature / gene should be treated as isoforms.) Then annotation informtion should be applied to the count data.