## Binning using sourmash
[toc]
### Why: Can we omit mapping reads by using mgmanysearch?
- Can we instead of mapping reads, supply the binners with output from mgmanysearch (using average_abund and std_abund), where we provide the average abundance and stddev in kmers per contig in each read, instead of sequencing depth measured through read mapping.
- Read mapping is computationally intensive and takes time. By using a sourmash approach, we could potentially cut down on computation costs and time, and bin larger datasets, instead of single sample binning.
- This will only work if sourmash based results give approximately the same amount of bins as a read mapping based approach would.
**Compare sourmash based 'read depth' with actual read depth bins**
- Use VAMB and METABAT2.
- For the sake of getting it working, stick with METABAT2 (outperforms VAMB)
- Both VAMB and METABAT2 (can) use the output file from jgi_summarize_bam_contig_depths, which is included in METABAT2
Map reads to contigs using bowtie2, to compare to sourmash results. Convert sam files from mapping to bam files, and then use the [jgi_summarize_bam_contig_depths](https://bitbucket.org/berkeleylab/metabat/src/master/src/jgi_summarize_bam_contig_depths.cpp) to calculate read depth for all, and output a tsv. Both VAMB and METAbat2 take this tsv for bin creation
Using sourmash, we sketch the reads, sketch the contigs, run fastmultigather for all.reads x all.contigs. Use mgmanysearch using fastmultigather results as a picklist, then use python to create the jgi_summarize_bam_contigs_depth output format style.
### Problems with using kmers compared to trad binners:
So far sourmash kmer results don't work as well as read mapping. So maybe a different scale? Different k-size? Read the fairy paper and get some ideas
- Metabat2 outperforms VAMB in all instances.
- Higher k-size (51) performs best
- minimal intersect hashes of 3,5,7 are equal, 10 performs less.
- scaled is 100
**Compared to Fairy**
Let's see what they did:
- scaled = 50
- min intersect hashes = 8
- containment ANI of 95% (that would be match_containment_ani)
- what k-size???
### Running sourmash with the purpose of making a binning file:
- Sketch everything first
- Output file with contig lengths ([script](https://github.com/AnneliektH/2024-binning-kmer/blob/main/workflow/scripts/get-contig-lengths.py))
```
# Everything is in: /group/ctbrowngrp2/scratch/annie/2024-binning-kmer
# Manysketch reads
# make a read file
echo name,read1,read2 > manysketch.csv
for i in *_R1.fastq.gz
do
echo ${i%_QC*},$i,${i%_R1*}_R2.fastq.gz
done >> manysketch.csv
#sketch from readfile
mamba activate branchwater
sourmash scripts manysketch manysketch.csv \
-p k=21,k=31,k=51,scaled=50,abund -c 40 \
-o ../all_reads_s50.zip
# cat all contigs to a results file (n=349,706)
cat atlas_*/*/*_contigs.fasta > ../../results/sourmash_sketches/ERR11351.fasta
mamba activate branchwater
sourmash sketch dna -p k=21,k=31,k=51,scaled=50 \
--singleton ERR11351.fasta -o ERR11351_s50.zip
# For the contig file: Write csv with contig lengths
python ../workflow/scripts/get-contig-lengths.py ../resources/ERR11351.fasta depth_files/contig_lengths.csv
```
Now run manysearch:
Use branchwater version where abundances are also outputted:
This one: https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/302
```
# Sorry for the crappy code
mamba activate branchwater-abund
/usr/bin/time -v sourmash scripts manysearch \
../sourmash_sketches/all_reads_s50.zip \
../sourmash_sketches/ERR11351_s50.zip \
-k 31 --scaled 50 -o ERR11351.k31_s50.csv -c 100 -t 0 &> manysearch.k31_s50.log && \
/usr/bin/time -v sourmash scripts manysearch \
../sourmash_sketches/all_reads_s50.zip \
../sourmash_sketches/ERR11351_s50.zip \
-k 21 --scaled 50 -o ERR11351.k21_s50.csv -c 100 -t 0 &> manysearch.k21_s50.log && \
/usr/bin/time -v sourmash scripts manysearch \
../sourmash_sketches/all_reads_s50.zip \
../sourmash_sketches/ERR11351_s50.zip \
-k 51 --scaled 50 -o ERR11351.k51_s50.csv -c 100 -t 0 &> manysearch.k51_s50.log
```
Use [snakemake](https://github.com/AnneliektH/2024-binning-kmer/blob/main/workflow/Snakefile) flow to go from manysearch outputs to the contig_depth tsvs:
- Take the manysearch output file
- Set a treshold for minimum number hashes intersect and set min containment ANI at 95% (csvtk)
- Use python script to go from manysearch to depth in kmers ([script](https://github.com/AnneliektH/2024-binning-kmer/blob/main/workflow/scripts/manysearch-to-mb-depth.py))
- Create bins (metabat) using this depth file
- Use checkm2 to check bin quality
```
import os.path
# DBCONTIG = os.path.abspath('mgmanysearch/ERR11351.zip'),
MANYSEARCH_OUT, = glob_wildcards('../results/manysearch/scaled_50/{manysearch}.csv')
INTERSECT_HASHES = [3,5,7,8,10]
# include snakefiles
# include: "rules/vamb.smk"
# include: "rules/metabat2.smk"
# final out file
rule all:
input:
expand('../results/checkm/check/checkm_{manysearch}.h{intersect_hash}.mb2.95.done.txt', manysearch=MANYSEARCH_OUT, intersect_hash=INTERSECT_HASHES),
# expand('../results/checkm/checkm_{manysearch}.h{intersect_hash}.vamb.done.txt', manysearch=MANYSEARCH_OUT, intersect_hash=INTERSECT_HASHES),
# rule filter on hashes
rule filter_hash:
input:
manysearch='../results/manysearch/scaled_50/{manysearch}.csv',
output:
hash_sep='../results/manysearch/hash_tresh/{manysearch}.h{intersect_hash}.csv',
ani_tresh='../results/manysearch/hash_tresh/{manysearch}.h{intersect_hash}.95.csv',
conda:
"csvtk"
threads: 1
shell:"""
csvtk filter -f "intersect_hashes>{wildcards.intersect_hash}" \
{input.manysearch} > {output.hash_sep} && \
csvtk filter -f "match_containment_ani>0.95" \
{output.hash_sep} > {output.ani_tresh}
"""
# manysearch-to-mb-depth
rule manysearch_to_depth:
input:
manysearch='../results/manysearch/hash_tresh/{manysearch}.h{intersect_hash}.95.csv',
output:
depthf='../results/depth_files/{manysearch}.h{intersect_hash}.95.depth.csv',
mnslist = '../results/manysearch/all_reads_allcontigs/{manysearch}.h{intersect_hash}.filelist.txt'
conda:
"branchwater-abund"
threads: 1
shell:"""
readlink -f {input.manysearch} > {output.mnslist}
python scripts/manysearch-to-mb-depth.py \
{output.mnslist} --lengths ../results/depth_files/contig_lengths.csv \
-o {output.depthf}
"""
# rule metabat
rule metabat:
input:
depthf='../results/depth_files/{manysearch}.h{intersect_hash}.95.depth.csv',
fasta = '../resources/ERR11351.fasta'
output:
check='../results/metabat2/check/{manysearch}.h{intersect_hash}.95.done.txt',
conda:
"metabat2"
threads: 12
shell:"""
metabat2 -m 1500 \
-i {input.fasta} -a {input.depthf} -t {threads} \
-o ../results/metabat2/{wildcards.manysearch}.h{wildcards.intersect_hash}.95/{wildcards.manysearch}.h{wildcards.intersect_hash}.95 && \
touch {output.check}
"""
# rule checkm
rule checkm2:
input:
metabat ='../results/metabat2/check/{manysearch}.h{intersect_hash}.95.done.txt'
output:
check='../results/checkm/check/checkm_{manysearch}.h{intersect_hash}.mb2.95.done.txt',
conda:
"checkm2"
threads: 12
shell:"""
checkm2 predict --threads {threads} --force \
--input ../results/metabat2/{wildcards.manysearch}.h{wildcards.intersect_hash}.95/ \
-x .fa --output-directory ../results/checkm/metabat2/{wildcards.manysearch}.h{wildcards.intersect_hash}.95/ && \
touch {output.check}
"""
```
### Creating bins using metabat/VAMB:
- Vamb uses TNF and calculates that itself from input fastas, is pretty fast, probs dont have to change that.
- I don't know what other sequence info METABAT2 uses, but it does ask for the fastas seqs.
```
# creating bins with either vamb or metabat:
# METABAT
metabat -i contigs.fa \
-a cov_depth.txt
-o metabat_bins -t 20
# VAMB (min size of 50kb as a bin)
vamb --outdir ./vamb \
--fasta contigs.fa \
--jgi ./cov_depth.txt \
--minfasta 50000
```
## First trial using sourmash
- Using various kmer sizes and minimum number of intersect hashes
- k21, 31, 51
- minhash 3, 5, 7, 10
- scaled = 100
### METABAT2
Reads tsv is using mapped reads

### VAMB

## Read mapping based results (bins)
For read mapping only, this would be the approx number of bins we should be aiming for.
### Metabat2
- Total of 369 bins, of these 125 (33.8%) are medium quality or better
- CheckM2 results:


### VAMB
- Total of 1,394 bins, of these 118 (8.5%) are medium quality or better
- CheckM2 results:


## Installation notes and other things for command line
**Sourmash**
Tessa made fmg so that it now outputs all columns
Install conda env directly from git
https://stackoverflow.com/questions/19042389/conda-installing-upgrading-directly-from-github
Branch: https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/298
use `--full-results`
**Fairy**
Recent paper about binning using kmers instead of read mapping
https://www.biorxiv.org/content/10.1101/2024.04.23.590803v1
**VAMB: from clusters to bins**
With vamb another script needs to be run to create bins: Use the new installation cause script needs python 3.9 or higher.
- create clusters using the vamb code
- https://github.com/RasmussenLab/vamb/blob/master/src/create_fasta.py
- https://github.com/RasmussenLab/vamb (see clusters.tsv)