# Metagenomic assembly pipeline First note in HackMD. Let see if this space is useful or just yet another forgetable thing that will accumulate digital dust. This note is designed to set up a Google environement to perform metagenomic co Assembly on mouse stool sample. Each individual mouse was sample multiple point in time allowing to boos the microbial signal. # Google VM environment preparation ```{bash, eval=FALSE} sudo apt-get update sudo apt-get install bzip2 libxml2-dev sudo apt-get install git wget tar unzip sudo ln -s /usr/bin/python3 /usr/bin/python sudo apt-get -y install default-jdk ``` ## Link google Bucket with data The `data` folder need to exist first ```{bash, eval=FALSE} #Check it works gsutil rsync -n gs://YOURSPACE-bucket1/data/ data/ #Sync! gsutil rsync gs://YOURSPACE-bucket1/data/ data/ ``` ## Install anaconda Should move to mamba ```{bash, eval=FALSE} wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh bash Anaconda3-2022.10-Linux-x86_64.sh rm Anaconda3-2022.10-Linux-x86_64.sh #Initialized it ``` ## Installing tools ```{bash, eval=FALSE} mkdir script ``` ### Spades It's a genome assembler design to reconstruct small genomes. Good for curated target assembly. [Paper](https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.102) [Github](https://github.com/ablab/spades) ```{bash, eval=FALSE} wget http://cab.spbu.ru/files/release3.15.5/SPAdes-3.15.5-Linux.tar.gz tar -xzf SPAdes-3.15.5-Linux.tar.gz rm SPAdes-3.15.5-Linux.tar.gz # check if works #cd SPAdes-3.15.5-Linux/bin/ mv SPAdes-3.15.5-Linux/ script/spades export PATH=~/script/spades/bin/:$PATH ``` ### Megahit Another assembler, less strict and faster than spades. Good for initial rough assembly [Paper](https://academic.oup.com/bioinformatics/article/31/10/1674/177884) [Github](https://github.com/voutcn/megahit) ```{bash, eval=FALSE} conda install -c bioconda megahit ``` ### BBmap BBMap is a splice-aware global aligner for DNA and RNA sequencing reads. I layman terms it is a fast mapper, we have to download manually for sourceforge. Available [here](https://sourceforge.net/projects/bbmap/). [Home page](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/) ```{bash, eval=FALSE} export PATH=~/script/bbmap/:$PATH ``` ### Quast Quast evaluates genome/metagenome assemblies by computing various metrics. [Github](https://github.com/ablab/quast) ```{bash, eval=FALSE} cd script git clone https://github.com/ablab/quast.git cd .. export PATH=~/script/quast/:$PATH ``` ### Binny Binny is a binning tool that produces high-quality metagenome-assembled genomes (MAG) from both contiguous and highly fragmented genomes [Paper](https://academic.oup.com/bib/article/23/6/bbac431/6760137) [Github page](https://github.com/a-h-b/binny) ```{bash, eval=FALSE} cd script git clone https://github.com/a-h-b/binny.git cd binny ./binny -i config/config.init.yaml ./binny -l -n "TESTRUN" -r config/config.test.yaml cd ../../ ``` Binny needs a yaml file to run its snakenake based pipeline. So I have edited one adjusted to the IND01 samples in a file called `config_ind01.yaml`. This file will then be the base for creating a yaml file for all subsequend individuals. ### Filtering host genome Download mouse Genome to remove unnecessary read from the library. ```{bash, eval=FALSE} mkdir reference cd reference wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.9_GRCm39/GCA_000001635.9_GRCm39_genomic.fna.gz ``` # Raw files preparation I follow the following folder system for my pipeline: ``` IND01/ ├── Binny/ │ ├── bin │ ├── related │ └── files ├── CoAssembly/ │ ├── assembly │ ├── related │ └── files ├── quast_results/ │ ├── Assembly QC │ ├── related │ └── files ├── Raw/ │ ├── All the │ ├── raw read │ └── files ├── Filtered/ │ ├── All the host │ ├── filtered read │ └── files ├── CoAssembly.mapping.tsv ├── *_covhist.txt └── *_covstats.txt ``` Preparation of the folder setup ```{bash, eval=FALSE} for i in {01..19}; do mkdir -p IND$i/Raw; mkdir -p IND$i/Filtered; mv ./data/B1_[ABCD]$i\_sequence.fastq.gz IND$i/Raw; done ``` # QC step ## Mapping to mouse In this step we remove the host reads, which help the assembler assembling microbial genomes. Here we ask bbmap to map our R1 and R2 raw read file to the mouse reference genome and output a single interleaved filtered read `fastq.gz` file. ```{bash, eval=FALSE} for i in {01..19}; do cd IND$i; for j in {A..D}; do bbmap.sh in=Raw/B1_$j$i\_1_sequence.fastq.gz \ in2=Raw/B1_$j$i\_2_sequence.fastq.gz \ minid=0.95 \ outu=Filtered/B1_$j\_hr.fastq.gz \ ref=reference/GCA_000001635.9_GRCm39_genomic.fna \ covstats=./$j\_covstats.txt \ covhist=./$j\_covhist.txt done ``` # Assembly pipeline The one loop that: 1. Perform a co-assembly with libraries belonging to the same individual 2. QC the co-assembly results 3. Map individual library to the co-assembly result a. Output a table with all the information 4. Perform automated binning of on the co-assembly results ```{bash, eval=FALSE} for i in {02..19}; do cd IND$i; #Co-assembly megahit --12 Filtered/B1_A$i\_hr.fastq.gz,Filtered/B1_B$i\_hr.fastq.gz,Filtered/B1_C$i\_hr.fastq.gz,Filtered/B1_D$i\_hr.fastq.gz -o CoAssembly #Overall stat of the coassembly quast.py CoAssembly/final.contigs.fa -o quast_results/reassembly for j in {A..D}; #individual mapping do echo "starting mapping"; ls B1_$j*; a=$(ls B1_$j*); bbmap.sh ref=CoAssembly/final.contigs.fa in=$a minid=0.95 covstats=$j\_covstats.txt covhist=$i\_covhist.txt -Xmx40g; done; #Offload to gcloud gcloud storage cp -r [ABCD]_*txt gs://YOURSPACE-bucket1/assembly/IND$i cd .. #Merging mapping stats, no header, first column with contig name, then pasting column with individual library values (bbmap keep the same order to reference) cat IND$i/A_covstats.txt | cut -f1,2 >tmpA cat IND$i/B_covstats.txt | tail -n +2 | cut -f2 >tmpB cat IND$i/C_covstats.txt | tail -n +2 | cut -f2 >tmpC cat IND$i/D_covstats.txt | tail -n +2 | cut -f2 >tmpD paste tmpA tmpB tmpC tmpD > IND$i/CoAssembly.mapping.tsv rm tmp[ABCD] sed -i "s/ [^\t]*//g" IND$i/CoAssembly.mapping.tsv #Remove descritive stat from Megahit contig names (Next tool doesn't like spaces) cut -f1 -d" " IND$i/CoAssembly/final.contigs.fa > IND$i/CoAssembly/final.contigs.renamed.fa #Edit Yaml file for binny sed "s/IND01/IND$i/g" config_ind01.yaml > config_ind$i.yaml ../script/binny/binny -l -n "IND$i" -t 16 -r config_ind$i.yaml gcloud storage cp -r ./IND$i/binny_output/ gs://YOURSPACE-bucket1/assembly/IND$i/binny done ``` ###### tags: `tutorials` `Metagenomic` `Assembly` `Mapping` `Mini` `pipeline`