Viral Metagenomics pipeline

# Viral Metagenomics pipeline ## Step 1 - Quality Control General notes for this section: * -Xmx specifies the amount of memory you have available for the process. * The usejni flage should be set to t whenever possible. * The ending of the code with the append (>>) is to direct the output to a file to be read at a later time. * These lines were created to be used in a loop and can easily be setup to do this with the proper naming system and for/while loop structure. * After merging, make sure to specify merge or unmerged [m or um] in the output files including the stats output files to avoid overwriting files. The files saving the outputs do not necessarily need this but inclusion of [u/m] will lead to better organization. * Seperate lanes should be treated as seperate samples. ### Merge ###### Create a file that predicts the adapter sequences `bbmerge.sh in1=sequencing/rawreads/[sampling]_R1.fastq.gz in2=sequencing/rawreads/[sampling]_R2.fastq.gz outa=sequencing/merged/[sampling]_adapters.fa -Xmx115g usejni=t ` ###### Merge R1 and R2 files using predicted adaptors `bbmerge.sh in1=sequencing/rawreads/[sampling]_R1.fastq.gz in2=sequencing/rawreads/[sampling]_R2.fastq.gz out=sequencing/merged/[sampling]_merged.fastq.gz outu=sequencing/merged/[sampling]_unmerged.fastq.gz adapter=sequencing/merged/[sampling]_adapters.fa rem k=62 extend2=50 ecct vstrict usejni=t -Xmx115g >> sequencing/merged/logs/[sampling]_merge.log 2>&1` ### Trim artifacts The input merge.out will be individual merged and unmerged files from each sampling. Should be processed individually. The sequencing artifacts file is provided by BBtools. `bbduk.sh in=[merge.out] out=sequencing/trimart/[sampling]_[m/um]_trimart.fastq.gz ref=sequencingartifacts.fa k=31 hdist=1 ftm=5 stats=sequencing/trimart/stats/[sampling]_[m/um]_art.stats -Xmx115g >> sequencing/trimart/logs/[sampling]_[m/u]_art.log 2>&1` ## Trim Adapters The input art_trim.out will be individual files from previous step. The adapters file is provided by BBtools. `bbduk.sh in=[art_trim.out] out=sequencing/trimadapt/[sampling]_[m/um]_adapt.fastq.gz ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 qtrim=r trimq=20 tpe tbo stats=sequencing/trimadapt/stats/[sampling]_[m/um]_adapt.stats -Xmx115g >> sequencing/trimadapt/logs/[sampling]_[m/u]_adapt.log 2>&1` ### Filter Contamination The adapt_trim.out file is the output from the previous step. The bbmaskindex was downloaded from SeqAnswers thread created by Brian Bushnell to filter out common laboratory contaminants. `bbmap.sh minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=/Users/viralecologylab/Documents/Software/bbmaskindex/ qtrim=rl trimq=10 untrim -Xmx115g in=[adapt_trim.out] outu=sequencing/mask/[sampling]_[m/u]_mask.fastq.gz >> sequencing/mask/logs/[sampling]_[m/u]_mask.log 2>&1` ### Mask Low Entropy The mask.out file is the output from the previous step. `bbmask.sh -Xmx115g in=[mask.out] out=sequencing/entropy/[sampling]_[m/u]_ent.fastq.gz entropy=0.50 >> sequencing/entropy/logs/[sampling]_[m/u]_ent.log 2>&1` ## Step 2 - Assembly For individual, only include lanes from similar samples. For coassemblies, include all lanes from all samples. Setting the min contig length to 1,000 will help speed up downstream analysis, to do this add `--min-contig-len 1000` `megahit --read [merged_fastqs] --12 [unmerged_fastqs] -o [outdir]` I would note that if you intend on using Anvi'o now would be a good time to use the Anvi'o reformat tool on your contigs file to ensure that the names will be compatible with Anvi'o. ## Step 3 - Identify Viral Sequences You can use as many viral identification softwares as you want. This walkthrough will provide you with the means to utilize three (Virsorter2, VIBRANT, and CheckV). If you choose to use another keep in mind that you will want your minimum contig length to be 1,000 bp since this is the default for VIBRANT. You will need the nucleotide output from your program of choice. Simply add the sequences to the step "Combine Viral Predictions" and perform the deduplication. ### Virsorter SOP #### Virsorter - 1st Pass Virsorter2 is usually setup in conda so check lab computer environments or on the HPRC (grace) load Anaconda3/2022.05 and activate /sw/hprc/sw/Anaconda3/2022.05/envs/virsorter-2.2.3. If you did not set a min contig length in assembly add `--min-length 1000` for consistency with Vibrant. `virsorter run -w [sample_id]_vs2-pass1 -i [assembly_contigs] --include-groups "dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae" -j [threads] all` #### Virsorter - CheckV The checkv_db on the TAMU HPRC (grace) is located at: /scratch/data/bio/checkv/checkv-db-v1.2/. For other lab computers check the Documents folders before running the setup. `checkv end_to_end [sample_id]_vs2-pass1/final-viral-combined.fa [out_dir] -d [/path/to/checkv_db] --remove_tmp -t [threads]` ##### Combine the output nucleotide files from CheckV `cat [checkv_out]/proviruses.fna [checkv_out]/viruses.fna > [checkv_out]/combined.fna` #### Virsorter 2nd Pass See first pass notes on how to run Virsorter2. If performing your own annotations `--prep-for-dramv` should be removed. `virsorter run --seqname-suffix-off --viral-gene-enrich-off --provirus-off --prep-for-dramv -i [checkv_out]/combined.fna -w [sample_id]_vs2-pass2 --include-groups "dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae" --min-length 1000 --min-score 0.5 -j [threads] all` ### Vibrant `VIBRANT_run.py -i [assembly_contigs] -virome -t [threads]` ### CheckV See above CheckV on where to find the db files. CheckV doesn't have a mini `checkv end_to_end [assembly_contigs] [out_dir] -d [database_folder] --remove_tmp -t [threads]` `cat viruses.fna proviruses.fna > checkv_combo.fna` #### Filter out contigs with no viral genes identified The first awk command is to find the name of non-proviral sequences with one viral gene and more than 1,000 bp (can remove bp awk if min-length defined at assembly). The second part of the awk is to find the name of proviral sequences with one viral gene and length over 1,000 bp in the provirus (leave the length awk even if min-length was set earlier, CheckV trims the contig to just the proviral portion). `awk '{if($3=="No" && $2>=1000 && $6>=1) print $1; else if($3=="Yes" && $4>=1000 && $6>=1) print $1;}' quality_summary.tsv > checkv_good.txt` **Geneious option for parsing:** Open Geneious, place the checkv_combo.fna file into Geneious, click Workflows, select Filter, set the options to > 999 bp, run the work flow. Delete the original file in Geneious. Select the filtered version of the file and click Workflows. Select Run Workflow..., from the drop down menu select Extract Sequences By Name. Open the checkv_filter_list.txt, find and replace '\n' for ',', select all, copy and paste into the Geneious window requesting a list of names. Run the workflow. **Bash option for parsing:** Run this loop, it will take a bit of time. ```for c in `cat checkv_good.txt`; do grep -A1 -e "${c}$" -e "${c}_" checkv_combo.fna >> checkv_combo_filter.fna; done``` ### Combine Viral predictions Upload output nucleotide files into Geneious. Highlight the sequence lists and from the dropdown menu Sequence choose Group Sequences into List. Remove duplicates by opening the Sequence dropdown menu and choosing Remove Duplicate Reads... Geneious uses Dedupe from the BBtools package. Include the ac=f flag. Export the file as a fasta sequence file. Dedupe does not remove sequences that were cut too small for the package to ID them as the same contig. To fix this issue I have written an R code to find contigs with the same names and retain the largest contig. Ask your lab mates for a copy of "remove_duplicated_seqs.R". The DeDupe step on Geneious is to help lower the amount of time the R script will run but theoretically it will work with just the R script too. **STOP HERE AND DECIDE IF YOU WANT TO BIN OR NOT. IT MAYBE HELPFUL DEPENDING ON YOUR QUESTION OR IT MAY NOT. IF IT IS CONTINUE ON, IF NOT SKIP TO STEP 6. IF YOU CHOSE TO DO THESE ALL INDIVIDUALLY AND NOT USE A CO-ASSEMBLY AND PLAN TO BE COMBINING THE CONTIGS, I WOULD BEGIN THE RENAMING PROCESS HERE SO THAT THE CONTIG NAMES REFLECT THE SAMPLE THEY WERE DERIVED FROM.** ## Step 4 - Map Viral Contigs to Raw Reads For this step you will need to have all of your quality controlled raw reads concatanated into the samples they came from. To do this run `cat [sample]_lane[x]_[merged/unmerged].fastq.gz > [sample].fastq.gz`. Include as many files as you have lanes and change [x] as needed. ### Create the Contig Index `bowtie2-build [viral_predictions].fasta [viral_index]` ### Map Raw Reads to the Contig Index `bowtie2 --threads 15 -x [viral_index] -U [sample].fastq.gz --no-unal -S [sample].sam` ### Sort and index BAM files `samtools view --threads [#] -F 4 -bS [sample].sam > [sample]_raw.bam` `samtools sort --threads [#] [sample]_raw.bam -o [sample].bam` `samtools index [sample].bam` At this point I would highly encourage you to delete the .sam files, _raw.bam files, _sort.bam files. All of these files are very large (100s GB each) and will take up alot of memory. ## Step 5 - Bin Viral Sequences `vRhyme -i [viral_predictions].fasta -b bam_folder/*.bam` Now you should extract the low contaminated viral bins to a seperate folder for downstream analysis. To do so open the summary output file from vRhyme in Excel. Filter the redundancy column to only those of 0 or 1. Copy and past the bin numbers into a .txt file. Write a while loop to cp the bins to a new folder something like: `while read b; do cp vRhyme_bin-${b}.fasta low_contam_bins/; done < low_contam_bins.txt` This can be done in any other fashion that produces the same result of a folder with the low contaminated bins (<= to 1 redundant protein). To filter the bin results based on sequence length you can run the `sequence_lengths_of_bins.py` helper script in vRhyme. Then open the .csv file and follow the same instructions above except with the length cutoff you are looking for. This will be you "good_bin_list.txt" file. *Side note: I messed up the anvi-reformat before binning. I ended up writing a script to rename the low contaminated bin contigs to the anvi-reformated names using the report file from anv-reformat-fasta. That script is there and if you come back to use the lower quailty bins you will need to reformat those.* ## Step 6 - Cluster Viral Contigs with Bins **The initial section will be presuming that you binned your sequences.** The first step is to remove sequences that have been binned from the fasta file with your viral IDed contigs using the vRhyme helper script `extract_unbinned_sequences.py`. You can generate the membership file needed by running a loop like this: `while read b; do awk -v pat=$b '$2 == pat' vRhyme_results_all_vir_id_dedup_anvi/vRhyme_best_bins.7.membership.tsv >> good_bin_membership.tsv; done < good_bin_list.txt`. Then run the `extract_unbinned_sequences.py` script. Finally, if you want to control the size of the remaining contigs you should do that now. You can use the `seqkit seq -m [min_length] unbinned_seqs.fasta > unbinned_trimmed_seqs.fasta` if you have SeqKit installed. Next you need to link your bin fasta files so that they appear as one sequence. This can be done with the vRhyme helper script `link_bin_sequences.py`. This adds a large amount of Ns between each sequence so it does not represent a real genome but does allow for programs that require 1 sequence per genome to utilize the information within each bin such as CD-HIT and CheckV. Next we need to merge the unbinned contigs and the linked binned contigs. `cat [unbinned_trimmed].fasta [linked_dir]/*.linked.fasta > contigs_4_clustering.fasta` **If you did not bin start here with your fasta file from remove_duplicate_sequences.R script.** Now we can cluster the bins and other sequences: `cd-hit-est -i contigs_4_clustering.fasta -o votus.fasta -c 0.95 -aS 0.85 -M 0` If CD-HIT fails because your sequences are too much then you can follow the instructions here [https://github.com/snayfach/MGV/blob/master/ani_cluster/README.md](https://github.com/snayfach/MGV/blob/master/ani_cluster/README.md). Download their code (upload it to the server if we need to) make sure we are on a system with Python 2 or create a conda environment with Python 2. Load in or install Biopython and BLAST+ the rest should work as intended. Then run this familiar looking loop to make your representative sequence file ```for c in `awk '{print $1}' clusters.tsv`; do grep -A1 -e "${c}$" -e "${c}_" [output_from_remove_duplicate_seqs.R].fna >> rep_seqs_for_clusters.fna; done```. If you binned then now we need to prepare the fasta files for downstream analysis. Otherwise, we are done with clustering, however, in order to complete the abundance information we will need to go back to Step 4 and do the mapping. First step is to get a list of the bins that were in the clustering fasta output (These are the bins that end up being representative sequences). `grep ">vR" votus.fasta > bins_in_clus.txt` Then we copy the unlinked bin files that were representative sequences to a new folder. If your number of low contam bins and clustered bins is the same you can skip this and use low_contam_bins dir in place of bins_in_votus dir in subsequent steps. `while read b; do cp low_contam_bins/${b}.fasta bins_in_votus/; done < bins_in_clus.txt` Next we create a new file containing all of the representative sequences with the unlinked bin sequences. `cat bins_in_votus/*.fasta > votus_unlink.fasta` `grep ">c" 'votus.fasta' -A1 >> votus_unlink.fasta` We need to do the same thing with the total sequences placed into the clustering step. `cat low_contam_bins/*.fasta > contig4clus_unlink.fasta` `grep ">c" contigs_4_clust.fasta -A1 >> contig4clus_unlink.fasta` I wrote an R script named "reformat_cdhit_out.R" to deal with the .clst file and the information in a more readable format it will also filter your sequences/bins to those above 5kb. ## Step 7 - Estimate Abundance of Viral Sequences First thing you will need to do is prepare your files for CheckM. This means we need a folder with all the bins and all the individual sequences that were clustered. Run this: `grep ">c" -A1 all_votu_5k.fasta | awk '{ if (substr($0, 1, 1)==">") {filename=(substr($0,2) ".fasta")} print $0 >> filename close(filename) }'` This seems to add a weird ending uses this to fix the endings `for f in *.fasta-1; do e=${f%%-*}; mv ${f} ${e}; done` You will need to move these files to the low_contam_bins/ directory or copy the directory with a new name and then add these to it. We then need to fix the contig names in the bins since vRhyme adds some extra information and CheckM won't recognize our contigs: `sed -i 's+vRhyme.*__++g' low_contam_bins/vRhyme_bin_*.fasta` Then run: `checkm coverage -r -m 0 -e 1 -t [threads] -x [file_ending] [bin_folder]/ [out_file].tsv [bam_folder]/*.bam`. I chose the -r and -m 0 options because these are things CheckM throws out because CheckM decided the reads were not properly paired (-r) or the Phred score was too low (-m). I also included the -e for the edit distance this way CheckM will include all of the reads in the mapping files. This was all dealt with when QC happened and you should have created your .bam files using QC'ed reads. Finally, run: `checkm profile [checkm_cov_out].tsv --tab_table -q > [out_file].tsv` For more information on the calculations made in each column see CheckM's documentation.