confusing sourmash result pig gut micro

# Problem: - Ran sourmash gather on 462 SRA datasets - Used 2 different databases for this: - gtdb-reps only (after this gtdb) - gtdb-reps + MAGs (gtdb+MAGs) For 6 SRA datasets the % assignability is lower when using gtdb+MAGs, compared to only using gtdb. This is a problem, because % assignability shouldn't be less when adding more sequences to the database. At worst, the % assignability should be the same as running with only gtdb. To be clear: these are comparisons from my gather results against only gtdb and gtdb+MAGs, because your and my gather results against only gtdb were different as well. I don't know if this is a problem in itself, signatures I used are stated below, as the problem may have been that we used different pig metagenome signatures. Between our gather results against only gtdb, all of the % assigned values were different. ## Where are the files: ``` # gather results against gtdb only: /group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash/MAGs_gtdbk-reps-sub # gather results against gtdb + MAGs: /group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash/only_gtdbk-reps-sub # Signatures used: /home/ctbrown/scratch2/2023-swine-usda/outputs.swine-x-reps/{sample}.subtract.sig.gz ``` ## Commands annie ran and the outputs I first reran fastgather + gather on gtdb only, using your snakefile that you wrote to do that initially. All gather output files and logs are in: /group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash ``` # Sourmash for gtdb reps only rule fastgather_reps: input: sig = "/home/ctbrown/scratch2/2023-swine-usda/outputs.swine-x-reps/{sample}.subtract.sig.gz" output: csv = "sourmash/fastgather-reps-sub/{sample}.csv" resources: # limit to one fastgather with --resources rayon_exclude=1 rayon_exclude=1, mem_mb=7000 log: "logs/sourmash/{sample}_fg-r.log" conda: "branchwater" shell: """ sourmash scripts fastgather {input.sig} \ /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k21.zip -k 21 \ --scaled 10000 -o {output.csv} -c 32 """ rule sourmash_gather_reps: input: sig = "/home/ctbrown/scratch2/2023-swine-usda/outputs.swine-x-reps/{sample}.subtract.sig.gz", picklist = "sourmash/fastgather-reps/{sample}.csv" output: csv = "sourmash/only_gtdbk-reps-sub/{sample}.csv" log: "logs/sourmash/{sample}_gtdbk-r.log" resources: mem_mb=20000 conda: "sourmash" shell: """ sourmash gather {input.sig} \ /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k21.zip -k 21 \ --scaled 10000 -o {output.csv} --picklist {input.picklist}:match_name:ident """ ``` I then calculated signatures for all MAGs generated (did not deduplicate MAGs across samples) The output signatures are in: /group/ctbrowngrp2/scratch/annie/2023-swine-sra/atlas/MAGs db cover and zip with all signatures is in: /group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash/db-cover ``` # sketch individual signatures per MAG, otherwise 1 sig per contig not per MAG # Give sketches a name, otherwise gather results have no names module load parallel mamba activate branchwater for f in *.fasta do echo $f echo ${f%_*} echo sourmash sketch dna -p k=21,scaled=1000 $f --name ${f%.fasta*} -o ${f%.fasta*}.sig done | parallel -j 32 # Concatenate all signatures and manifest? sourmash sig cat *.sig -o ../../all-MAGs_21.zip sourmash sig collect *.sig -o ../../all-MAGs_21.sqlmf # do magic to make db cover cd sourmash/db-cover python /home/ctbrown/scratch/2022-database-covers/make-db-cover.py \ /group/ctbrowngrp2/scratch/annie/2023-swine-sra/atlas/MAGs/all-MAGs_21.zip \ -o mags.k21.cover.zip --scaled=10000 -k 21 mkdir -p mags.k21.cover.d cd mags.k21.cover.d sourmash sig split ../mags.k21.cover.zip -E .sig.gz cd .. find $(pwd)/mags.k21.cover.d -type f > list.mags.k21.cover.txt cat list.mags.k21.cover.txt /home/ctbrown/scratch/2022-database-covers/list.gtdb-rs214.k21.cover.txt >> list.mags-gtdb-r214.k21.cover.txt ``` **When creating these results I did not do the fastgather step for gtdb+MAGS** Took me a while to figure out how to run fastgather for gtdb+mags, but I needed a manifest.. ``` # fastgather MAGS : MAKING A PICKLIST WORKS BUT YOU NEED A MANIFEST FOR THE INPUT DB rule fastgather_MAG: input: sig = "/home/ctbrown/scratch2/2023-swine-usda/outputs.swine-x-reps/{sample}.subtract.sig.gz" output: csv = "sourmash/fastgather_MAG/{sample}.csv" resources: # limit to one fastgather with --resources rayon_exclude=1 rayon_exclude=1, mem_mb=7000 log: "logs/sourmash/{sample}_fg_mag.log" conda: "branchwater" shell: """ sourmash scripts fastgather {input.sig} \ sourmash/db-cover/list.mags-gtdb-r214.k21.cover.txt -k 31 \ --scaled 10000 -o {output.csv} -c 32 """ rule sourmash_gather_MAG: input: sig = "/home/ctbrown/scratch2/2023-swine-usda/outputs.swine-x-reps/{sample}.subtract.sig.gz" picklist = "sourmash/fastgather_MAG/{sample}.csv" output: csv = "sourmash/MAGs_gtdbk/{sample}.csv" log: "logs/sourmash/{sample}_mag.log" resources: mem_mb=25000 conda: "sourmash" shell: """ sourmash gather {input.sig} atlas/MAGs/all-MAGs_21.zip \ /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-k21.zip -k 21 \ --scaled 10000 -o {output.csv} """ ``` ## Comparison of % assignability for ones that went down % assigned for the 6 SRAs that had lower assignability when using gtdb+MAGs | acc | p_weighted_found_gtdbMAG | p_weighted_found_gtdbreps | impr_perc_p_weighted | -------- | -------- | -------- | -------- | SRR11945343|0.318|6.105|-5.787 SRR21266884|7.597|9.721|-2.124 SRR21278121|4.460|5.860|-1.400 SRR21276809|5.668|6.788|-1.119 SRR21276820|14.256|15.052|-0.795 SRR21266760|8.233|8.7036|-0.471 ## Titus thoughts 1. `p_weighted_found` might actually decrease, because gather uses unweighted. 2. database covers are weird. 3. in your second (bottom) fastgather/gather command above (`rule fastgather_MAG`) you specify k=31 for a k=21 database?? and that works?! 4. second (bottom) `rule sourmash_gather_MAG` uses all gtdb, not just gtdb reps? ### commands data files in `~ctbrown/2023-annie-debug` starting from: ``` % ls SRR11945343.gather.gtdb-reps+MAGs.csv all-MAGs_21.zip SRR11945343.gather.gtdb-reps.csv concat.csv SRR11945343.subtract.sig.gz gtdb-rs214-reps.k21.zip ``` make combined gather CSV: ``` csvtk concat *.csv > concat.csv ``` run against just GTDB: ``` % sourmash gather --picklist concat.csv:md5:md5 SRR11945343.subtract.sig.gz gtdb-rs214-reps.k21.zip --threshold=0 ... the recovered matches hit 5.3% of the abundance-weighted query. the recovered matches hit 1.6% of the query k-mers (unweighted). ``` run against GTDB+MAGs: ``` % sourmash gather --picklist concat.csv:md5:md5 SRR11945343.subtract.sig.gz gtdb-rs214-reps.k21.zip all-MAGs_21.zip --threshold=0 ... found 7 matches total; the recovered matches hit 7.0% of the abundance-weighted query. the recovered matches hit 1.8% of the query k-mers (unweighted). ``` so at least here we do see assignability go up!