# Metagenomic assembly pipeline
First note in HackMD. Let see if this space is useful or just yet another forgetable thing that will accumulate digital dust.
This note is designed to set up a Google environement to perform metagenomic co Assembly on mouse stool sample. Each individual mouse was sample multiple point in time allowing to boos the microbial signal.
# Google VM environment preparation
```bash
sudo apt-get update
sudo apt-get install bzip2 libxml2-dev
sudo apt-get install git wget tar unzip
sudo ln -s /usr/bin/python3 /usr/bin/python
sudo apt-get -y install default-jdk
```
## Link google Bucket with data
```bash
#Check it works
gsutil rsync -n gs://YOURSPACE-bucket1/data/ data/
#Sync!
gsutil rsync gs://YOURSPACE-bucket1/data/ data/
```
## Install anaconda
We use mamba now
```bash
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
```
## Installing tools
```bash
mkdir script
```
### Spades
It's a genome assembler design to reconstruct small genomes. Good for curated target assembly.
[Paper](https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.102)
[Github](https://github.com/ablab/spades)
```bash
wget http://cab.spbu.ru/files/release3.15.5/SPAdes-3.15.5-Linux.tar.gz
tar -xzf SPAdes-3.15.5-Linux.tar.gz
rm SPAdes-3.15.5-Linux.tar.gz
# check if works
#cd SPAdes-3.15.5-Linux/bin/
mv SPAdes-3.15.5-Linux/ script/spades
export PATH=~/script/spades/bin/:$PATH
```
### Megahit
Another assembler, less strict and faster than spades. Good for initial rough assembly
[Paper](https://academic.oup.com/bioinformatics/article/31/10/1674/177884)
[Github](https://github.com/voutcn/megahit)
```bash
conda install -c bioconda megahit
```
### BBmap
BBMap is a splice-aware global aligner for DNA and RNA sequencing reads. I layman terms it is a fast mapper, we have to download manually for sourceforge. Available [here](https://sourceforge.net/projects/bbmap/).
[Home page](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/)
```bash
export PATH=~/script/bbmap/:$PATH
```
### Quast
Quast evaluates genome/metagenome assemblies by computing various metrics.
[Github](https://github.com/ablab/quast)
```bash
cd script
git clone https://github.com/ablab/quast.git
cd ..
export PATH=~/script/quast/:$PATH
```
### Binny
Binny is a binning tool that produces high-quality metagenome-assembled genomes (MAG) from both contiguous and highly fragmented genomes
[Paper](https://academic.oup.com/bib/article/23/6/bbac431/6760137)
[Github page](https://github.com/a-h-b/binny)
```bash
cd script
git clone https://github.com/a-h-b/binny.git
cd binny
./binny -i config/config.init.yaml
./binny -l -n "TESTRUN" -r config/config.test.yaml
cd ../../
```
Binny needs a yaml file to run its snakenake based pipeline. So I have edited one adjusted to the IND01 samples in a file called `config_ind01.yaml`. This file will then be the base for creating a yaml file for all subsequend individuals.
### Filtering host genome
Download mouse Genome to remove unnecessary read from the library.
```bash
mkdir reference
cd reference
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.9_GRCm39/GCA_000001635.9_GRCm39_genomic.fna.gz
```
# Raw files preparation
I follow the following folder system for my pipeline:
```
IND01/
├── Binny/
│ ├── bin
│ ├── related
│ └── files
├── CoAssembly/
│ ├── assembly
│ ├── related
│ └── files
├── quast_results/
│ ├── Assembly QC
│ ├── related
│ └── files
├── Raw/
│ ├── All the
│ ├── raw read
│ └── files
├── Filtered/
│ ├── All the host
│ ├── filtered read
│ └── files
├── CoAssembly.mapping.tsv
├── *_covhist.txt
└── *_covstats.txt
```
Preparation of the folder setup
```bash
for i in {01..19};
do mkdir -p IND$i/Raw;
mkdir -p IND$i/Filtered;
mv ./data/B1_[ABCD]$i\_sequence.fastq.gz IND$i/Raw;
done
```
# QC step
## Mapping to mouse
In this step we remove the host reads, which help the assembler assembling microbial genomes. Here we ask bbmap to map our R1 and R2 raw read file to the mouse reference genome and output a single interleaved filtered read `fastq.gz` file.
```bash
for i in {01..19};
do cd IND$i;
for j in {A..D};
do bbmap.sh in=Raw/B1_$j$i\_1_sequence.fastq.gz \
in2=Raw/B1_$j$i\_2_sequence.fastq.gz \
minid=0.95 \
outu=Filtered/B1_$j\_hr.fastq.gz \
ref=reference/GCA_000001635.9_GRCm39_genomic.fna \
covstats=./$j\_covstats.txt \
covhist=./$j\_covhist.txt
done
```
# Assembly pipeline
The one loop that:
1. Perform a co-assembly with libraries belonging to the same individual
2. QC the co-assembly results
3. Map individual library to the co-assembly result
a. Output a table with all the information
4. Perform automated binning of on the co-assembly results
```bash
for i in {01..19};
do cd IND$i;
echo "CoAssembly Step";
#Co-assembly
megahit --12 ./B1_A$i\_hr.fastq.gz,./B1_B$i\_hr.fastq.gz,./B1_C$i\_hr.fastq.gz,./B1_D$i\_hr.fastq.gz -o CoAssembly
#Overall stat of the coassembly
quast.py CoAssembly/final.contigs.fa -o quast_results/reassembly
#Remove descritive stat from Megahit contig names (Next tool doesn't like spaces)
cut -f1 -d" " ./CoAssembly/final.contigs.fa > ./CoAssembly/final.contigs.renamed.fa
for j in {A..D};
#individual mapping
do echo "Mapping Step";
ls B1_$j*;
a=$(ls B1_$j*);
bbmap.sh ref=CoAssembly/final.contigs.renamed.fa in=$a minid=0.95 covstats=$j\_covstats.txt covhist=$i\_covhist.txt -Xmx40g;
done;
#Offload to gcloud
tar -cjf CoAssembly.IND$i.tar.bz2 CoAssembly/
gcloud storage cp -r [ABCD]_*txt gs://YOURSPACE-bucket1/assembly/IND$i
gcloud storage cp -r CoAssembly.IND$i.tar.bz2 gs://YOURSPACE-bucket1/assembly/IND$i
cd ..
cd CoAssembly
rm -rf ./intermediate_contigs/
cd ../
#Merging mapping stats, no header, first column with contig name, then pasting column with individual library values (bbmap keep the same order to reference)
cat IND$i/A_covstats.txt | cut -f1,2 >tmpA
cat IND$i/B_covstats.txt | tail -n +2 | cut -f2 >tmpB
cat IND$i/C_covstats.txt | tail -n +2 | cut -f2 >tmpC
cat IND$i/D_covstats.txt | tail -n +2 | cut -f2 >tmpD
paste tmpA tmpB tmpC tmpD > IND$i/CoAssembly.mapping.tsv
rm tmp[ABCD]
sed -i "s/ [^\t]*//g" IND$i/CoAssembly.mapping.tsv
gcloud storage cp -r IND$i/CoAssembly_$i.mapping.tsv gs://YOURSPACE-bucket1/assembly/IND$i/CoAssembly_$i.mapping.tsv
#Edit Yaml file for binny
sed "s/IND01/IND$i/g" binny.config.yaml > config_ind$i.yaml
../script/binny/binny -l -n "IND$i" -t 8 -r config_ind$i.yaml
cd binny_output
rm -rf intermediary/
cd ../
gcloud storage cp -r ./IND$i/binny_output/ gs://YOURSPACE-bucket1/assembly/IND$i/binny
done
```
###### tags: `tutorials` `Metagenomic` `Assembly` `Mapping` `Mini` `pipeline`