# Problem:
- Ran sourmash gather on 462 SRA datasets
- Used 2 different databases for this:
- gtdb-reps only (after this gtdb)
- gtdb-reps + MAGs (gtdb+MAGs)
For 6 SRA datasets the % assignability is lower when using gtdb+MAGs, compared to only using gtdb. This is a problem, because % assignability shouldn't be less when adding more sequences to the database. At worst, the % assignability should be the same as running with only gtdb.
To be clear: these are comparisons from my gather results against only gtdb and gtdb+MAGs, because your and my gather results against only gtdb were different as well. I don't know if this is a problem in itself, signatures I used are stated below, as the problem may have been that we used different pig metagenome signatures. Between our gather results against only gtdb, all of the % assigned values were different.
## Where are the files:
```
# gather results against gtdb only:
/group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash/MAGs_gtdbk-reps-sub
# gather results against gtdb + MAGs:
/group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash/only_gtdbk-reps-sub
# Signatures used:
/home/ctbrown/scratch2/2023-swine-usda/outputs.swine-x-reps/{sample}.subtract.sig.gz
```
## Commands annie ran and the outputs
I first reran fastgather + gather on gtdb only, using your snakefile that you wrote to do that initially.
All gather output files and logs are in:
/group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash
```
# Sourmash for gtdb reps only
rule fastgather_reps:
input:
sig = "/home/ctbrown/scratch2/2023-swine-usda/outputs.swine-x-reps/{sample}.subtract.sig.gz"
output:
csv = "sourmash/fastgather-reps-sub/{sample}.csv"
resources:
# limit to one fastgather with --resources rayon_exclude=1
rayon_exclude=1,
mem_mb=7000
log:
"logs/sourmash/{sample}_fg-r.log"
conda:
"branchwater"
shell:
"""
sourmash scripts fastgather {input.sig} \
/group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k21.zip -k 21 \
--scaled 10000 -o {output.csv} -c 32
"""
rule sourmash_gather_reps:
input:
sig = "/home/ctbrown/scratch2/2023-swine-usda/outputs.swine-x-reps/{sample}.subtract.sig.gz",
picklist = "sourmash/fastgather-reps/{sample}.csv"
output:
csv = "sourmash/only_gtdbk-reps-sub/{sample}.csv"
log:
"logs/sourmash/{sample}_gtdbk-r.log"
resources:
mem_mb=20000
conda:
"sourmash"
shell:
"""
sourmash gather {input.sig} \
/group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k21.zip -k 21 \
--scaled 10000 -o {output.csv} --picklist {input.picklist}:match_name:ident
"""
```
I then calculated signatures for all MAGs generated (did not deduplicate MAGs across samples)
The output signatures are in:
/group/ctbrowngrp2/scratch/annie/2023-swine-sra/atlas/MAGs
db cover and zip with all signatures is in:
/group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash/db-cover
```
# sketch individual signatures per MAG, otherwise 1 sig per contig not per MAG
# Give sketches a name, otherwise gather results have no names
module load parallel
mamba activate branchwater
for f in *.fasta
do
echo $f
echo ${f%_*}
echo sourmash sketch dna -p k=21,scaled=1000 $f --name ${f%.fasta*} -o ${f%.fasta*}.sig
done | parallel -j 32
# Concatenate all signatures and manifest?
sourmash sig cat *.sig -o ../../all-MAGs_21.zip
sourmash sig collect *.sig -o ../../all-MAGs_21.sqlmf
# do magic to make db cover
cd sourmash/db-cover
python /home/ctbrown/scratch/2022-database-covers/make-db-cover.py \
/group/ctbrowngrp2/scratch/annie/2023-swine-sra/atlas/MAGs/all-MAGs_21.zip \
-o mags.k21.cover.zip --scaled=10000 -k 21
mkdir -p mags.k21.cover.d
cd mags.k21.cover.d
sourmash sig split ../mags.k21.cover.zip -E .sig.gz
cd ..
find $(pwd)/mags.k21.cover.d -type f > list.mags.k21.cover.txt
cat list.mags.k21.cover.txt /home/ctbrown/scratch/2022-database-covers/list.gtdb-rs214.k21.cover.txt >> list.mags-gtdb-r214.k21.cover.txt
```
**When creating these results I did not do the fastgather step for gtdb+MAGS**
Took me a while to figure out how to run fastgather for gtdb+mags, but I needed a manifest..
```
# fastgather MAGS : MAKING A PICKLIST WORKS BUT YOU NEED A MANIFEST FOR THE INPUT DB
rule fastgather_MAG:
input:
sig = "/home/ctbrown/scratch2/2023-swine-usda/outputs.swine-x-reps/{sample}.subtract.sig.gz"
output:
csv = "sourmash/fastgather_MAG/{sample}.csv"
resources:
# limit to one fastgather with --resources rayon_exclude=1
rayon_exclude=1,
mem_mb=7000
log:
"logs/sourmash/{sample}_fg_mag.log"
conda:
"branchwater"
shell:
"""
sourmash scripts fastgather {input.sig} \
sourmash/db-cover/list.mags-gtdb-r214.k21.cover.txt -k 31 \
--scaled 10000 -o {output.csv} -c 32
"""
rule sourmash_gather_MAG:
input:
sig = "/home/ctbrown/scratch2/2023-swine-usda/outputs.swine-x-reps/{sample}.subtract.sig.gz"
picklist = "sourmash/fastgather_MAG/{sample}.csv"
output:
csv = "sourmash/MAGs_gtdbk/{sample}.csv"
log:
"logs/sourmash/{sample}_mag.log"
resources:
mem_mb=25000
conda:
"sourmash"
shell:
"""
sourmash gather {input.sig} atlas/MAGs/all-MAGs_21.zip \
/group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-k21.zip -k 21 \
--scaled 10000 -o {output.csv}
"""
```
## Comparison of % assignability for ones that went down
% assigned for the 6 SRAs that had lower assignability when using gtdb+MAGs
| acc | p_weighted_found_gtdbMAG | p_weighted_found_gtdbreps | impr_perc_p_weighted
| -------- | -------- | -------- | -------- |
SRR11945343|0.318|6.105|-5.787
SRR21266884|7.597|9.721|-2.124
SRR21278121|4.460|5.860|-1.400
SRR21276809|5.668|6.788|-1.119
SRR21276820|14.256|15.052|-0.795
SRR21266760|8.233|8.7036|-0.471
## Titus thoughts
1. `p_weighted_found` might actually decrease, because gather uses unweighted.
2. database covers are weird.
3. in your second (bottom) fastgather/gather command above (`rule fastgather_MAG`) you specify k=31 for a k=21 database?? and that works?!
4. second (bottom) `rule sourmash_gather_MAG` uses all gtdb, not just gtdb reps?
### commands
data files in `~ctbrown/2023-annie-debug`
starting from:
```
% ls
SRR11945343.gather.gtdb-reps+MAGs.csv all-MAGs_21.zip
SRR11945343.gather.gtdb-reps.csv concat.csv
SRR11945343.subtract.sig.gz gtdb-rs214-reps.k21.zip
```
make combined gather CSV:
```
csvtk concat *.csv > concat.csv
```
run against just GTDB:
```
% sourmash gather --picklist concat.csv:md5:md5 SRR11945343.subtract.sig.gz gtdb-rs214-reps.k21.zip --threshold=0
...
the recovered matches hit 5.3% of the abundance-weighted query.
the recovered matches hit 1.6% of the query k-mers (unweighted).
```
run against GTDB+MAGs:
```
% sourmash gather --picklist concat.csv:md5:md5 SRR11945343.subtract.sig.gz gtdb-rs214-reps.k21.zip all-MAGs_21.zip --threshold=0
...
found 7 matches total;
the recovered matches hit 7.0% of the abundance-weighted query.
the recovered matches hit 1.8% of the query k-mers (unweighted).
```
so at least here we do see assignability go up!