# Metagenomic assembly pipeline
First note in HackMD. Let see if this space is useful or just yet another forgetable thing that will accumulate digital dust.
This note is designed to set up a Google environement to perform metagenomic co Assembly on mouse stool sample. Each individual mouse was sample multiple point in time allowing to boos the microbial signal.
# Google VM environment preparation
```{bash, eval=FALSE}
sudo apt-get update
sudo apt-get install bzip2 libxml2-dev
sudo apt-get install git wget tar unzip
sudo ln -s /usr/bin/python3 /usr/bin/python
sudo apt-get -y install default-jdk
```
## Link google Bucket with data
The `data` folder need to exist first
```{bash, eval=FALSE}
#Check it works
gsutil rsync -n gs://YOURSPACE-bucket1/data/ data/
#Sync!
gsutil rsync gs://YOURSPACE-bucket1/data/ data/
```
## Install anaconda
Should move to mamba
```{bash, eval=FALSE}
wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
bash Anaconda3-2022.10-Linux-x86_64.sh
rm Anaconda3-2022.10-Linux-x86_64.sh
#Initialized it
```
## Installing tools
```{bash, eval=FALSE}
mkdir script
```
### Spades
It's a genome assembler design to reconstruct small genomes. Good for curated target assembly.
[Paper](https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.102)
[Github](https://github.com/ablab/spades)
```{bash, eval=FALSE}
wget http://cab.spbu.ru/files/release3.15.5/SPAdes-3.15.5-Linux.tar.gz
tar -xzf SPAdes-3.15.5-Linux.tar.gz
rm SPAdes-3.15.5-Linux.tar.gz
# check if works
#cd SPAdes-3.15.5-Linux/bin/
mv SPAdes-3.15.5-Linux/ script/spades
export PATH=~/script/spades/bin/:$PATH
```
### Megahit
Another assembler, less strict and faster than spades. Good for initial rough assembly
[Paper](https://academic.oup.com/bioinformatics/article/31/10/1674/177884)
[Github](https://github.com/voutcn/megahit)
```{bash, eval=FALSE}
conda install -c bioconda megahit
```
### BBmap
BBMap is a splice-aware global aligner for DNA and RNA sequencing reads. I layman terms it is a fast mapper, we have to download manually for sourceforge. Available [here](https://sourceforge.net/projects/bbmap/).
[Home page](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/)
```{bash, eval=FALSE}
export PATH=~/script/bbmap/:$PATH
```
### Quast
Quast evaluates genome/metagenome assemblies by computing various metrics.
[Github](https://github.com/ablab/quast)
```{bash, eval=FALSE}
cd script
git clone https://github.com/ablab/quast.git
cd ..
export PATH=~/script/quast/:$PATH
```
### Binny
Binny is a binning tool that produces high-quality metagenome-assembled genomes (MAG) from both contiguous and highly fragmented genomes
[Paper](https://academic.oup.com/bib/article/23/6/bbac431/6760137)
[Github page](https://github.com/a-h-b/binny)
```{bash, eval=FALSE}
cd script
git clone https://github.com/a-h-b/binny.git
cd binny
./binny -i config/config.init.yaml
./binny -l -n "TESTRUN" -r config/config.test.yaml
cd ../../
```
Binny needs a yaml file to run its snakenake based pipeline. So I have edited one adjusted to the IND01 samples in a file called `config_ind01.yaml`. This file will then be the base for creating a yaml file for all subsequend individuals.
### Filtering host genome
Download mouse Genome to remove unnecessary read from the library.
```{bash, eval=FALSE}
mkdir reference
cd reference
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.9_GRCm39/GCA_000001635.9_GRCm39_genomic.fna.gz
```
# Raw files preparation
I follow the following folder system for my pipeline:
```
IND01/
├── Binny/
│ ├── bin
│ ├── related
│ └── files
├── CoAssembly/
│ ├── assembly
│ ├── related
│ └── files
├── quast_results/
│ ├── Assembly QC
│ ├── related
│ └── files
├── Raw/
│ ├── All the
│ ├── raw read
│ └── files
├── Filtered/
│ ├── All the host
│ ├── filtered read
│ └── files
├── CoAssembly.mapping.tsv
├── *_covhist.txt
└── *_covstats.txt
```
Preparation of the folder setup
```{bash, eval=FALSE}
for i in {01..19};
do mkdir -p IND$i/Raw;
mkdir -p IND$i/Filtered;
mv ./data/B1_[ABCD]$i\_sequence.fastq.gz IND$i/Raw;
done
```
# QC step
## Mapping to mouse
In this step we remove the host reads, which help the assembler assembling microbial genomes. Here we ask bbmap to map our R1 and R2 raw read file to the mouse reference genome and output a single interleaved filtered read `fastq.gz` file.
```{bash, eval=FALSE}
for i in {01..19};
do cd IND$i;
for j in {A..D};
do bbmap.sh in=Raw/B1_$j$i\_1_sequence.fastq.gz \
in2=Raw/B1_$j$i\_2_sequence.fastq.gz \
minid=0.95 \
outu=Filtered/B1_$j\_hr.fastq.gz \
ref=reference/GCA_000001635.9_GRCm39_genomic.fna \
covstats=./$j\_covstats.txt \
covhist=./$j\_covhist.txt
done
```
# Assembly pipeline
The one loop that:
1. Perform a co-assembly with libraries belonging to the same individual
2. QC the co-assembly results
3. Map individual library to the co-assembly result
a. Output a table with all the information
4. Perform automated binning of on the co-assembly results
```{bash, eval=FALSE}
for i in {02..19};
do cd IND$i;
#Co-assembly
megahit --12 Filtered/B1_A$i\_hr.fastq.gz,Filtered/B1_B$i\_hr.fastq.gz,Filtered/B1_C$i\_hr.fastq.gz,Filtered/B1_D$i\_hr.fastq.gz -o CoAssembly
#Overall stat of the coassembly
quast.py CoAssembly/final.contigs.fa -o quast_results/reassembly
for j in {A..D};
#individual mapping
do echo "starting mapping";
ls B1_$j*;
a=$(ls B1_$j*);
bbmap.sh ref=CoAssembly/final.contigs.fa in=$a minid=0.95 covstats=$j\_covstats.txt covhist=$i\_covhist.txt -Xmx40g;
done;
#Offload to gcloud
gcloud storage cp -r [ABCD]_*txt gs://YOURSPACE-bucket1/assembly/IND$i
cd ..
#Merging mapping stats, no header, first column with contig name, then pasting column with individual library values (bbmap keep the same order to reference)
cat IND$i/A_covstats.txt | cut -f1,2 >tmpA
cat IND$i/B_covstats.txt | tail -n +2 | cut -f2 >tmpB
cat IND$i/C_covstats.txt | tail -n +2 | cut -f2 >tmpC
cat IND$i/D_covstats.txt | tail -n +2 | cut -f2 >tmpD
paste tmpA tmpB tmpC tmpD > IND$i/CoAssembly.mapping.tsv
rm tmp[ABCD]
sed -i "s/ [^\t]*//g" IND$i/CoAssembly.mapping.tsv
#Remove descritive stat from Megahit contig names (Next tool doesn't like spaces)
cut -f1 -d" " IND$i/CoAssembly/final.contigs.fa > IND$i/CoAssembly/final.contigs.renamed.fa
#Edit Yaml file for binny
sed "s/IND01/IND$i/g" config_ind01.yaml > config_ind$i.yaml
../script/binny/binny -l -n "IND$i" -t 16 -r config_ind$i.yaml
gcloud storage cp -r ./IND$i/binny_output/ gs://YOURSPACE-bucket1/assembly/IND$i/binny
done
```
###### tags: `tutorials` `Metagenomic` `Assembly` `Mapping` `Mini` `pipeline`