Metagenomic assembly pipeline

# Metagenomic assembly pipeline First note in HackMD. Let see if this space is useful or just yet another forgetable thing that will accumulate digital dust. This note is designed to set up a Google environement to perform metagenomic co Assembly on mouse stool sample. Each individual mouse was sample multiple point in time allowing to boos the microbial signal. # Google VM environment preparation ```bash sudo apt-get update sudo apt-get install bzip2 libxml2-dev sudo apt-get install git wget tar unzip sudo ln -s /usr/bin/python3 /usr/bin/python sudo apt-get -y install default-jdk ``` ## Link google Bucket with data ```bash #Check it works gsutil rsync -n gs://YOURSPACE-bucket1/data/ data/ #Sync! gsutil rsync gs://YOURSPACE-bucket1/data/ data/ ``` ## Install anaconda We use mamba now ```bash curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh" bash Mambaforge-$(uname)-$(uname -m).sh ``` ## Installing tools ```bash mkdir script ``` ### Spades It's a genome assembler design to reconstruct small genomes. Good for curated target assembly. [Paper](https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.102) [Github](https://github.com/ablab/spades) ```bash wget http://cab.spbu.ru/files/release3.15.5/SPAdes-3.15.5-Linux.tar.gz tar -xzf SPAdes-3.15.5-Linux.tar.gz rm SPAdes-3.15.5-Linux.tar.gz # check if works #cd SPAdes-3.15.5-Linux/bin/ mv SPAdes-3.15.5-Linux/ script/spades export PATH=~/script/spades/bin/:$PATH ``` ### Megahit Another assembler, less strict and faster than spades. Good for initial rough assembly [Paper](https://academic.oup.com/bioinformatics/article/31/10/1674/177884) [Github](https://github.com/voutcn/megahit) ```bash conda install -c bioconda megahit ``` ### BBmap BBMap is a splice-aware global aligner for DNA and RNA sequencing reads. I layman terms it is a fast mapper, we have to download manually for sourceforge. Available [here](https://sourceforge.net/projects/bbmap/). [Home page](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/) ```bash export PATH=~/script/bbmap/:$PATH ``` ### Quast Quast evaluates genome/metagenome assemblies by computing various metrics. [Github](https://github.com/ablab/quast) ```bash cd script git clone https://github.com/ablab/quast.git cd .. export PATH=~/script/quast/:$PATH ``` ### Binny Binny is a binning tool that produces high-quality metagenome-assembled genomes (MAG) from both contiguous and highly fragmented genomes [Paper](https://academic.oup.com/bib/article/23/6/bbac431/6760137) [Github page](https://github.com/a-h-b/binny) ```bash cd script git clone https://github.com/a-h-b/binny.git cd binny ./binny -i config/config.init.yaml ./binny -l -n "TESTRUN" -r config/config.test.yaml cd ../../ ``` Binny needs a yaml file to run its snakenake based pipeline. So I have edited one adjusted to the IND01 samples in a file called `config_ind01.yaml`. This file will then be the base for creating a yaml file for all subsequend individuals. ### Filtering host genome Download mouse Genome to remove unnecessary read from the library. ```bash mkdir reference cd reference wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.9_GRCm39/GCA_000001635.9_GRCm39_genomic.fna.gz ``` # Raw files preparation I follow the following folder system for my pipeline: ``` IND01/ ├── Binny/ │ ├── bin │ ├── related │ └── files ├── CoAssembly/ │ ├── assembly │ ├── related │ └── files ├── quast_results/ │ ├── Assembly QC │ ├── related │ └── files ├── Raw/ │ ├── All the │ ├── raw read │ └── files ├── Filtered/ │ ├── All the host │ ├── filtered read │ └── files ├── CoAssembly.mapping.tsv ├── *_covhist.txt └── *_covstats.txt ``` Preparation of the folder setup ```bash for i in {01..19}; do mkdir -p IND$i/Raw; mkdir -p IND$i/Filtered; mv ./data/B1_[ABCD]$i\_sequence.fastq.gz IND$i/Raw; done ``` # QC step ## Mapping to mouse In this step we remove the host reads, which help the assembler assembling microbial genomes. Here we ask bbmap to map our R1 and R2 raw read file to the mouse reference genome and output a single interleaved filtered read `fastq.gz` file. ```bash for i in {01..19}; do cd IND$i; for j in {A..D}; do bbmap.sh in=Raw/B1_$j$i\_1_sequence.fastq.gz \ in2=Raw/B1_$j$i\_2_sequence.fastq.gz \ minid=0.95 \ outu=Filtered/B1_$j\_hr.fastq.gz \ ref=reference/GCA_000001635.9_GRCm39_genomic.fna \ covstats=./$j\_covstats.txt \ covhist=./$j\_covhist.txt done ``` # Assembly pipeline The one loop that: 1. Perform a co-assembly with libraries belonging to the same individual 2. QC the co-assembly results 3. Map individual library to the co-assembly result a. Output a table with all the information 4. Perform automated binning of on the co-assembly results ```bash for i in {01..19}; do cd IND$i; echo "CoAssembly Step"; #Co-assembly megahit --12 ./B1_A$i\_hr.fastq.gz,./B1_B$i\_hr.fastq.gz,./B1_C$i\_hr.fastq.gz,./B1_D$i\_hr.fastq.gz -o CoAssembly #Overall stat of the coassembly quast.py CoAssembly/final.contigs.fa -o quast_results/reassembly #Remove descritive stat from Megahit contig names (Next tool doesn't like spaces) cut -f1 -d" " ./CoAssembly/final.contigs.fa > ./CoAssembly/final.contigs.renamed.fa for j in {A..D}; #individual mapping do echo "Mapping Step"; ls B1_$j*; a=$(ls B1_$j*); bbmap.sh ref=CoAssembly/final.contigs.renamed.fa in=$a minid=0.95 covstats=$j\_covstats.txt covhist=$i\_covhist.txt -Xmx40g; done; #Offload to gcloud tar -cjf CoAssembly.IND$i.tar.bz2 CoAssembly/ gcloud storage cp -r [ABCD]_*txt gs://YOURSPACE-bucket1/assembly/IND$i gcloud storage cp -r CoAssembly.IND$i.tar.bz2 gs://YOURSPACE-bucket1/assembly/IND$i cd .. cd CoAssembly rm -rf ./intermediate_contigs/ cd ../ #Merging mapping stats, no header, first column with contig name, then pasting column with individual library values (bbmap keep the same order to reference) cat IND$i/A_covstats.txt | cut -f1,2 >tmpA cat IND$i/B_covstats.txt | tail -n +2 | cut -f2 >tmpB cat IND$i/C_covstats.txt | tail -n +2 | cut -f2 >tmpC cat IND$i/D_covstats.txt | tail -n +2 | cut -f2 >tmpD paste tmpA tmpB tmpC tmpD > IND$i/CoAssembly.mapping.tsv rm tmp[ABCD] sed -i "s/ [^\t]*//g" IND$i/CoAssembly.mapping.tsv gcloud storage cp -r IND$i/CoAssembly_$i.mapping.tsv gs://YOURSPACE-bucket1/assembly/IND$i/CoAssembly_$i.mapping.tsv #Edit Yaml file for binny sed "s/IND01/IND$i/g" binny.config.yaml > config_ind$i.yaml ../script/binny/binny -l -n "IND$i" -t 8 -r config_ind$i.yaml cd binny_output rm -rf intermediary/ cd ../ gcloud storage cp -r ./IND$i/binny_output/ gs://YOURSPACE-bucket1/assembly/IND$i/binny done ``` ###### tags: `tutorials` `Metagenomic` `Assembly` `Mapping` `Mini` `pipeline`