Metagenomic assembly pipeline

First note in HackMD. Let see if this space is useful or just yet another forgetable thing that will accumulate digital dust.

This note is designed to set up a Google environement to perform metagenomic co Assembly on mouse stool sample. Each individual mouse was sample multiple point in time allowing to boos the microbial signal.

Google VM environment preparation

sudo apt-get update
sudo apt-get install bzip2 libxml2-dev
sudo apt-get install git wget tar unzip

sudo ln -s /usr/bin/python3 /usr/bin/python

sudo apt-get -y install default-jdk

Link google Bucket with data

The data folder need to exist first

#Check it works
 gsutil rsync -n gs://YOURSPACE-bucket1/data/ data/
#Sync!
gsutil rsync gs://YOURSPACE-bucket1/data/ data/

Install anaconda

Should move to mamba

wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
bash Anaconda3-2022.10-Linux-x86_64.sh
rm Anaconda3-2022.10-Linux-x86_64.sh
#Initialized it

Installing tools

mkdir script

Spades

It's a genome assembler design to reconstruct small genomes. Good for curated target assembly.
Paper
Github

wget http://cab.spbu.ru/files/release3.15.5/SPAdes-3.15.5-Linux.tar.gz
tar -xzf SPAdes-3.15.5-Linux.tar.gz
rm SPAdes-3.15.5-Linux.tar.gz
# check if works
#cd SPAdes-3.15.5-Linux/bin/
 mv SPAdes-3.15.5-Linux/ script/spades

export PATH=~/script/spades/bin/:$PATH

Megahit

Another assembler, less strict and faster than spades. Good for initial rough assembly
Paper
Github

conda install -c bioconda megahit

BBmap

BBMap is a splice-aware global aligner for DNA and RNA sequencing reads. I layman terms it is a fast mapper, we have to download manually for sourceforge. Available here.
Home page

export PATH=~/script/bbmap/:$PATH

Quast

Quast evaluates genome/metagenome assemblies by computing various metrics.

Github

cd script
git clone https://github.com/ablab/quast.git
cd ..
export PATH=~/script/quast/:$PATH

Binny

Binny is a binning tool that produces high-quality metagenome-assembled genomes (MAG) from both contiguous and highly fragmented genomes
Paper
Github page

cd script
git clone https://github.com/a-h-b/binny.git
cd binny
./binny -i config/config.init.yaml 
./binny -l -n "TESTRUN" -r config/config.test.yaml

cd ../../

Binny needs a yaml file to run its snakenake based pipeline. So I have edited one adjusted to the IND01 samples in a file called config_ind01.yaml. This file will then be the base for creating a yaml file for all subsequend individuals.

Filtering host genome

Download mouse Genome to remove unnecessary read from the library.

mkdir reference
cd reference
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.9_GRCm39/GCA_000001635.9_GRCm39_genomic.fna.gz

Raw files preparation

I follow the following folder system for my pipeline:

IND01/
├── Binny/
│   ├── bin
│   ├── related
│   └── files
├── CoAssembly/
│   ├── assembly
│   ├── related
│   └── files
├── quast_results/
│   ├── Assembly QC
│   ├── related
│   └── files
├── Raw/
│   ├── All the 
│   ├── raw read
│   └── files
├── Filtered/
│   ├── All the host
│   ├── filtered read
│   └── files
├── CoAssembly.mapping.tsv
├── *_covhist.txt
└── *_covstats.txt

Preparation of the folder setup

for i in {01..19};
  do mkdir -p IND$i/Raw;
  mkdir -p IND$i/Filtered;
  mv ./data/B1_[ABCD]$i\_sequence.fastq.gz IND$i/Raw;
done

QC step

Mapping to mouse

In this step we remove the host reads, which help the assembler assembling microbial genomes. Here we ask bbmap to map our R1 and R2 raw read file to the mouse reference genome and output a single interleaved filtered read fastq.gz file.

for i in {01..19};
  do cd IND$i;
  for j in {A..D};
    do bbmap.sh in=Raw/B1_$j$i\_1_sequence.fastq.gz \
                in2=Raw/B1_$j$i\_2_sequence.fastq.gz \
                minid=0.95 \
                outu=Filtered/B1_$j\_hr.fastq.gz \
                ref=reference/GCA_000001635.9_GRCm39_genomic.fna \
                covstats=./$j\_covstats.txt \
                covhist=./$j\_covhist.txt
done

Assembly pipeline

The one loop that:

Perform a co-assembly with libraries belonging to the same individual
QC the co-assembly results
Map individual library to the co-assembly result
a. Output a table with all the information
Perform automated binning of on the co-assembly results

for i in {02..19};
  do cd IND$i;
  #Co-assembly
  megahit --12 Filtered/B1_A$i\_hr.fastq.gz,Filtered/B1_B$i\_hr.fastq.gz,Filtered/B1_C$i\_hr.fastq.gz,Filtered/B1_D$i\_hr.fastq.gz -o CoAssembly
  #Overall stat of the coassembly
  quast.py CoAssembly/final.contigs.fa -o quast_results/reassembly
  for j in {A..D};
    #individual mapping
    do echo "starting mapping";
    ls B1_$j*;
    a=$(ls B1_$j*);
    bbmap.sh ref=CoAssembly/final.contigs.fa in=$a minid=0.95 covstats=$j\_covstats.txt covhist=$i\_covhist.txt -Xmx40g;
  done;
  #Offload to gcloud
  gcloud storage cp -r [ABCD]_*txt gs://YOURSPACE-bucket1/assembly/IND$i
  cd ..
  #Merging mapping stats, no header, first column with contig name, then pasting column with individual library values (bbmap keep the same order to reference)
  cat IND$i/A_covstats.txt | cut -f1,2 >tmpA
  cat IND$i/B_covstats.txt | tail -n +2 | cut -f2 >tmpB
  cat IND$i/C_covstats.txt | tail -n +2 | cut -f2 >tmpC
  cat IND$i/D_covstats.txt | tail -n +2 | cut -f2 >tmpD
  paste tmpA tmpB tmpC tmpD > IND$i/CoAssembly.mapping.tsv
  rm tmp[ABCD]

  sed -i "s/ [^\t]*//g" IND$i/CoAssembly.mapping.tsv
  #Remove descritive stat from Megahit contig names (Next tool doesn't like spaces)
  cut -f1 -d" " IND$i/CoAssembly/final.contigs.fa  > IND$i/CoAssembly/final.contigs.renamed.fa

  #Edit Yaml file for binny
  sed "s/IND01/IND$i/g" config_ind01.yaml > config_ind$i.yaml

  ../script/binny/binny -l -n "IND$i" -t 16 -r config_ind$i.yaml

  gcloud storage cp -r ./IND$i/binny_output/ gs://YOURSPACE-bucket1/assembly/IND$i/binny
done

Metagenomic assembly pipeline

Google VM environment preparation

Link google Bucket with data

Install anaconda

Installing tools

Spades

Megahit

BBmap

Quast

Binny

Filtering host genome

Raw files preparation

QC step

Mapping to mouse

Assembly pipeline

tags: tutorials Metagenomic Assembly Mapping Mini pipeline

Read more

R - Lines and geom_jitter

Anvio Pangenomic

Statistiques on large datsets

R kegg scrapper

tags: `tutorials` `Metagenomic` `Assembly` `Mapping` `Mini` `pipeline`