Binning using sourmash

## Binning using sourmash [toc] ### Why: Can we omit mapping reads by using mgmanysearch? - Can we instead of mapping reads, supply the binners with output from mgmanysearch (using average_abund and std_abund), where we provide the average abundance and stddev in kmers per contig in each read, instead of sequencing depth measured through read mapping. - Read mapping is computationally intensive and takes time. By using a sourmash approach, we could potentially cut down on computation costs and time, and bin larger datasets, instead of single sample binning. - This will only work if sourmash based results give approximately the same amount of bins as a read mapping based approach would. **Compare sourmash based 'read depth' with actual read depth bins** - Use VAMB and METABAT2. - For the sake of getting it working, stick with METABAT2 (outperforms VAMB) - Both VAMB and METABAT2 (can) use the output file from jgi_summarize_bam_contig_depths, which is included in METABAT2 Map reads to contigs using bowtie2, to compare to sourmash results. Convert sam files from mapping to bam files, and then use the [jgi_summarize_bam_contig_depths](https://bitbucket.org/berkeleylab/metabat/src/master/src/jgi_summarize_bam_contig_depths.cpp) to calculate read depth for all, and output a tsv. Both VAMB and METAbat2 take this tsv for bin creation Using sourmash, we sketch the reads, sketch the contigs, run fastmultigather for all.reads x all.contigs. Use mgmanysearch using fastmultigather results as a picklist, then use python to create the jgi_summarize_bam_contigs_depth output format style. ### Problems with using kmers compared to trad binners: So far sourmash kmer results don't work as well as read mapping. So maybe a different scale? Different k-size? Read the fairy paper and get some ideas - Metabat2 outperforms VAMB in all instances. - Higher k-size (51) performs best - minimal intersect hashes of 3,5,7 are equal, 10 performs less. - scaled is 100 **Compared to Fairy** Let's see what they did: - scaled = 50 - min intersect hashes = 8 - containment ANI of 95% (that would be match_containment_ani) - what k-size??? ### Running sourmash with the purpose of making a binning file: - Sketch everything first - Output file with contig lengths ([script](https://github.com/AnneliektH/2024-binning-kmer/blob/main/workflow/scripts/get-contig-lengths.py)) ``` # Everything is in: /group/ctbrowngrp2/scratch/annie/2024-binning-kmer # Manysketch reads # make a read file echo name,read1,read2 > manysketch.csv for i in *_R1.fastq.gz do echo ${i%_QC*},$i,${i%_R1*}_R2.fastq.gz done >> manysketch.csv #sketch from readfile mamba activate branchwater sourmash scripts manysketch manysketch.csv \ -p k=21,k=31,k=51,scaled=50,abund -c 40 \ -o ../all_reads_s50.zip # cat all contigs to a results file (n=349,706) cat atlas_*/*/*_contigs.fasta > ../../results/sourmash_sketches/ERR11351.fasta mamba activate branchwater sourmash sketch dna -p k=21,k=31,k=51,scaled=50 \ --singleton ERR11351.fasta -o ERR11351_s50.zip # For the contig file: Write csv with contig lengths python ../workflow/scripts/get-contig-lengths.py ../resources/ERR11351.fasta depth_files/contig_lengths.csv ``` Now run manysearch: Use branchwater version where abundances are also outputted: This one: https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/302 ``` # Sorry for the crappy code mamba activate branchwater-abund /usr/bin/time -v sourmash scripts manysearch \ ../sourmash_sketches/all_reads_s50.zip \ ../sourmash_sketches/ERR11351_s50.zip \ -k 31 --scaled 50 -o ERR11351.k31_s50.csv -c 100 -t 0 &> manysearch.k31_s50.log && \ /usr/bin/time -v sourmash scripts manysearch \ ../sourmash_sketches/all_reads_s50.zip \ ../sourmash_sketches/ERR11351_s50.zip \ -k 21 --scaled 50 -o ERR11351.k21_s50.csv -c 100 -t 0 &> manysearch.k21_s50.log && \ /usr/bin/time -v sourmash scripts manysearch \ ../sourmash_sketches/all_reads_s50.zip \ ../sourmash_sketches/ERR11351_s50.zip \ -k 51 --scaled 50 -o ERR11351.k51_s50.csv -c 100 -t 0 &> manysearch.k51_s50.log ``` Use [snakemake](https://github.com/AnneliektH/2024-binning-kmer/blob/main/workflow/Snakefile) flow to go from manysearch outputs to the contig_depth tsvs: - Take the manysearch output file - Set a treshold for minimum number hashes intersect and set min containment ANI at 95% (csvtk) - Use python script to go from manysearch to depth in kmers ([script](https://github.com/AnneliektH/2024-binning-kmer/blob/main/workflow/scripts/manysearch-to-mb-depth.py)) - Create bins (metabat) using this depth file - Use checkm2 to check bin quality ``` import os.path # DBCONTIG = os.path.abspath('mgmanysearch/ERR11351.zip'), MANYSEARCH_OUT, = glob_wildcards('../results/manysearch/scaled_50/{manysearch}.csv') INTERSECT_HASHES = [3,5,7,8,10] # include snakefiles # include: "rules/vamb.smk" # include: "rules/metabat2.smk" # final out file rule all: input: expand('../results/checkm/check/checkm_{manysearch}.h{intersect_hash}.mb2.95.done.txt', manysearch=MANYSEARCH_OUT, intersect_hash=INTERSECT_HASHES), # expand('../results/checkm/checkm_{manysearch}.h{intersect_hash}.vamb.done.txt', manysearch=MANYSEARCH_OUT, intersect_hash=INTERSECT_HASHES), # rule filter on hashes rule filter_hash: input: manysearch='../results/manysearch/scaled_50/{manysearch}.csv', output: hash_sep='../results/manysearch/hash_tresh/{manysearch}.h{intersect_hash}.csv', ani_tresh='../results/manysearch/hash_tresh/{manysearch}.h{intersect_hash}.95.csv', conda: "csvtk" threads: 1 shell:""" csvtk filter -f "intersect_hashes>{wildcards.intersect_hash}" \ {input.manysearch} > {output.hash_sep} && \ csvtk filter -f "match_containment_ani>0.95" \ {output.hash_sep} > {output.ani_tresh} """ # manysearch-to-mb-depth rule manysearch_to_depth: input: manysearch='../results/manysearch/hash_tresh/{manysearch}.h{intersect_hash}.95.csv', output: depthf='../results/depth_files/{manysearch}.h{intersect_hash}.95.depth.csv', mnslist = '../results/manysearch/all_reads_allcontigs/{manysearch}.h{intersect_hash}.filelist.txt' conda: "branchwater-abund" threads: 1 shell:""" readlink -f {input.manysearch} > {output.mnslist} python scripts/manysearch-to-mb-depth.py \ {output.mnslist} --lengths ../results/depth_files/contig_lengths.csv \ -o {output.depthf} """ # rule metabat rule metabat: input: depthf='../results/depth_files/{manysearch}.h{intersect_hash}.95.depth.csv', fasta = '../resources/ERR11351.fasta' output: check='../results/metabat2/check/{manysearch}.h{intersect_hash}.95.done.txt', conda: "metabat2" threads: 12 shell:""" metabat2 -m 1500 \ -i {input.fasta} -a {input.depthf} -t {threads} \ -o ../results/metabat2/{wildcards.manysearch}.h{wildcards.intersect_hash}.95/{wildcards.manysearch}.h{wildcards.intersect_hash}.95 && \ touch {output.check} """ # rule checkm rule checkm2: input: metabat ='../results/metabat2/check/{manysearch}.h{intersect_hash}.95.done.txt' output: check='../results/checkm/check/checkm_{manysearch}.h{intersect_hash}.mb2.95.done.txt', conda: "checkm2" threads: 12 shell:""" checkm2 predict --threads {threads} --force \ --input ../results/metabat2/{wildcards.manysearch}.h{wildcards.intersect_hash}.95/ \ -x .fa --output-directory ../results/checkm/metabat2/{wildcards.manysearch}.h{wildcards.intersect_hash}.95/ && \ touch {output.check} """ ``` ### Creating bins using metabat/VAMB: - Vamb uses TNF and calculates that itself from input fastas, is pretty fast, probs dont have to change that. - I don't know what other sequence info METABAT2 uses, but it does ask for the fastas seqs. ``` # creating bins with either vamb or metabat: # METABAT metabat -i contigs.fa \ -a cov_depth.txt -o metabat_bins -t 20 # VAMB (min size of 50kb as a bin) vamb --outdir ./vamb \ --fasta contigs.fa \ --jgi ./cov_depth.txt \ --minfasta 50000 ``` ## First trial using sourmash - Using various kmer sizes and minimum number of intersect hashes - k21, 31, 51 - minhash 3, 5, 7, 10 - scaled = 100 ### METABAT2 Reads tsv is using mapped reads ![image](https://hackmd.io/_uploads/HJ9YJOYX0.png) ### VAMB ![image](https://hackmd.io/_uploads/SyRX3E5mA.png) ## Read mapping based results (bins) For read mapping only, this would be the approx number of bins we should be aiming for. ### Metabat2 - Total of 369 bins, of these 125 (33.8%) are medium quality or better - CheckM2 results: ![MAGquality_pie_metabat2](https://hackmd.io/_uploads/B1LuKGTpT.png) ![metabat2_comp_cont](https://hackmd.io/_uploads/rkiQKzaTa.png) ### VAMB - Total of 1,394 bins, of these 118 (8.5%) are medium quality or better - CheckM2 results: ![MAGquality_pie_vamb](https://hackmd.io/_uploads/B1t9Fz6p6.png) ![vamb_comp_cont](https://hackmd.io/_uploads/B1AqKGaaa.png) ## Installation notes and other things for command line **Sourmash** Tessa made fmg so that it now outputs all columns Install conda env directly from git https://stackoverflow.com/questions/19042389/conda-installing-upgrading-directly-from-github Branch: https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/298 use `--full-results` **Fairy** Recent paper about binning using kmers instead of read mapping https://www.biorxiv.org/content/10.1101/2024.04.23.590803v1 **VAMB: from clusters to bins** With vamb another script needs to be run to create bins: Use the new installation cause script needs python 3.9 or higher. - create clusters using the vamb code - https://github.com/RasmussenLab/vamb/blob/master/src/create_fasta.py - https://github.com/RasmussenLab/vamb (see clusters.tsv)