Ralstonia pangenome
===
## Setup
```
mamba create -n smash-pg sourmash=4.8.10
mamba activate smash-pg
pip install sourmash_utils sourmash_plugin_pangenomics sourmash_plugin_betterplot
```
```
git clone git@github.com:ctb/2024-pangenome-hash-corr.git
cd 2024-pangenome-hash-corr
mkdir rs-pangenome
cd rs-pangenome
```
## Download sketches and taxonomic information
Download a database
containing signatures of 32 Ralstonia genomes (pathogenic and not) and the corresponding taxonomic and lingroup information.
```
# database
curl -JLO https://osf.io/download/wxtk3/
# taxonomy csv
curl -JLO https://osf.io/download/sj2z7/
# lingroup csv
curl -JLO https://osf.io/download/nqms2/
```
> scaled 100 db instead: https://osf.io/download/wx9kv/
## Build merged signature
```
sourmash scripts pangenome_merge -k 31\
ralstonia.sc1000.zip \
-o ralstonia32.k31-sc1000.merged.zip
```
## Build ranktable, no lineages
```
sourmash scripts pangenome_ranktable \
ralstonia32.k31-sc1000.merged.zip \
-o ralstonia32.k31-sc1000.ranktable.csv \
-k 31
```
## Calculate hash presence
```
../calc-hash-presence.py ralstonia32.k31-sc1000.ranktable.csv \
ralstonia.sc1000.zip -o ralstonia32.k31-sc1000.dump \
--scaled=1000
```
## Rectangular matrix showing hash x genome correlations:
```
../hash-by-sample.py ralstonia32.k31-sc1000.dump \
-o ralstonia32.k31-sc1000.presence.csv -m 15
```
output:
```
loaded 42616 hash to sample entries.
wrote 42616 entries to 'ralstonia32.k31.presence.csv'
use 'sourmash scripts clustermap1' from betterplot to plot!
e.g. 'sourmash scripts clustermap1 ralstonia32.k31.presence.csv -o fig.png
```
## Try to plot
```
sourmash scripts clustermap1 ralstonia32.k31-sc1000.presence.csv -u presence \
-o ralstonia32.k31-sc1000.presence.png --boolean -R ralstonia32.phylotypes.csv --figsize-x 13 --figsize-y 10
```
#--no-labels
# Now, add reads in to the plot
## cat the sigs to a zip
```
sourmash sig cat ralstonia.sc1000.zip \
bc16.first100.singleton.zip \
-o ralstonia+bc16first100.zip
```
## Calculate hash presence but with reads too
```
../calc-hash-presence.py ralstonia32.k31.ranktable.csv \
ralstonia+bc16first100.zip \
-o ralstonia32+bc16.k31.dump \
--scaled=1000
```
## Rectangular matrix showing hash x genome + read correlations:
```
../hash-by-sample.py ralstonia32+bc16.k31.dump \
-o ralstonia32+bc16first100.k31.presence.csv \
-m 15
```
## Try to plot
```
sourmash scripts clustermap1 ralstonia32+bc16first100.k31.presence.csv \
-u presence \
-o ralstonia32+bc16first100.k31.presence.png \
--boolean
```
## first 500 reads instead
```
sourmash sig cat ralstonia.sc1000.zip \
bc16.first500.singleton.zip \
-o ralstonia+bc16first500.zip
../calc-hash-presence.py ralstonia32.k31.ranktable.csv \
ralstonia+bc16first500.zip \
-o ralstonia32+bc16first500.k31.dump \
--scaled=1000
../hash-by-sample.py ralstonia32+bc16first500.k31.dump \
-o ralstonia32+bc16first500.k31.presence.csv \
-m 15
sourmash scripts clustermap1 ralstonia32+bc16first500.k31.presence.csv \
-u presence \
-o ralstonia32+bc16first500.k31.presence.png \
--boolean
```
## first 1000 reads instead
```
sourmash sig cat ralstonia.sc1000.zip \
bc16.first1000.sc1000.singleton.zip \
-o ralstonia+bc16first1000.zip
../calc-hash-presence.py ralstonia32.k31-sc1000.ranktable.csv \
ralstonia+bc16first1000.zip \
-o ralstonia32+bc16first1000.k31-sc1000.dump \
--scaled=1000
../hash-by-sample.py ralstonia32+bc16first1000.k31-sc1000.dump \
-o ralstonia32+bc16first1000.k31-sc1000.presence.csv \
-m 15
sourmash scripts clustermap1 ralstonia32+bc16first1000.k31-sc1000.presence.csv \
-u presence \
-o ralstonia32+bc16first1000.k31-sc1000.presence.png \
--boolean --figsize-x 15 --figsize-y 12 \
-R ralstonia32+reads.phylotypes.csv
```
## Try at lower scaled
first 100 reads, scaled 100
```
sourmash scripts pangenome_merge -k 31 \
ralstonia32.sc100.zip \
-o ralstonia32.k31-sc100.merged.zip
sourmash scripts pangenome_ranktable \
ralstonia32.k31-sc100.merged.zip \
-o ralstonia32.k31-sc100.ranktable.csv \
-k 31
sourmash sig cat ralstonia32.sc100.zip \
bc16.first100.sc100.singleton.zip \
-o ralstonia+bc16first100.sc100.zip
../calc-hash-presence.py ralstonia32.k31.ranktable.csv \
ralstonia+bc16first100.sc100.zip \
-o ralstonia32+bc16.k31-sc100.dump \
--scaled=100
../hash-by-sample.py ralstonia32+bc16.k31-sc100.dump \
-o ralstonia32+bc16first100.k31-sc100.presence.csv \
-m 15
sourmash scripts clustermap1 ralstonia32+bc16first100.k31-sc100.presence.csv \
-u presence \
-o ralstonia32+bc16first100.k31-sc100.presence.png \
--boolean --figsize-x 15 --figsize-y 12 \
-R ralstonia32+reads.phylotypes.csv
```
first 1000 reads, scaled 100
```
sourmash scripts pangenome_merge -k 31 \
ralstonia32.sc100.zip \
-o ralstonia32.k31-sc100.merged.zip
sourmash scripts pangenome_ranktable \
ralstonia32.k31-sc100.merged.zip \
-o ralstonia32.k31-sc100.ranktable.csv \
-k 31
sourmash sig cat ralstonia32.sc100.zip \
bc16.first1000.sc100.singleton.zip \
-o ralstonia+bc16first1000.sc100.zip
../calc-hash-presence.py ralstonia32.k31.ranktable.csv \
ralstonia+bc16first1000.sc100.zip \
-o ralstonia32+bc16first1000.k31-sc100.dump \
--scaled=100
../hash-by-sample.py ralstonia32+bc16first1000.k31-sc100.dump \
-o ralstonia32+bc16first1000.k31-sc100.presence.csv \
-m 15
sourmash scripts clustermap1 ralstonia32+bc16first1000.k31-sc100.presence.csv \
-u presence \
-o ralstonia32+bc16first1000.k31-sc100.presence.png \
--boolean --figsize-x 15 --figsize-y 12 \
-R ralstonia32+reads.phylotypes.csv
```
first 1000 reads, scaled 10
```
sourmash scripts pangenome_merge -k 31 \
ralstonia32.sc10.zip \
-o ralstonia32.k31-sc10.merged.zip
sourmash scripts pangenome_ranktable \
ralstonia32.k31-sc10.merged.zip \
-o ralstonia32.k31-sc10.ranktable.csv \
-k 31
sourmash sig cat ralstonia32.sc10.zip \
bc16.first1000.sc10.singleton.zip \
-o ralstonia+bc16first1000.sc10.zip
../calc-hash-presence.py ralstonia32.k31.ranktable.csv \
ralstonia+bc16first1000.sc10.zip \
-o ralstonia32+bc16first1000.k31-sc10.dump \
--scaled=10
../hash-by-sample.py ralstonia32+bc16first1000.k31-sc10.dump \
-o ralstonia32+bc16first1000.k31-sc10.presence.csv \
-m 15
sourmash scripts clustermap1 ralstonia32+bc16first1000.k31-sc10.presence.csv \
-u presence \
-o ralstonia32+bc16first1000.k31-sc10.presence.png \
--boolean --figsize-x 15 --figsize-y 12 \
-R ralstonia32+reads.phylotypes.csv
```
---
# Try cluster --> categories
all x all comparison
```
sourmash scripts pairwise ralstonia.sc1000.zip -k 31 --scaled 1000 -o ralstonia31.k31-sc1000.pairwise.csv --write-all --ani
```
cluster
```
sourmash scripts cluster ralstonia31.k31-sc1000.pairwise.csv \
-o ralstonia31.k31-sc1000.ani-cluster.csv \
--similarity average_containment_ani -t 0.97
```
---
:::warning
## Building pangenome with lineages did not work:
```
download gtdb ralstonia genomes and tax:
curl -JLO https://osf.io/download/4wbdn/
curl -JLO https://osf.io/download/y4p9k/
sourmash scripts pangenome_createdb ralstonia.sc1000.zip \
-t gtdb-ralstonia.lineages.csv \
-o ralstonia32.sig.zip --abund -k 31
```
output:
```
== This is sourmash version 4.8.10. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
loading taxonomies from ['gtdb-ralstonia.lineages.csv']
found 530 identifiers in taxdb.
selecting sketches: k=31 scaled=1000 moltype=DNA
loading sketches from file ralstonia.sc1000.zip
Traceback (most recent call last):
File "/Users/ntward/miniforge3/envs/smash-pg/bin/sourmash", line 11, in <module>
sys.exit(main())
^^^^^^
File "/Users/ntward/miniforge3/envs/smash-pg/lib/python3.12/site-packages/sourmash/__main__.py", line 20, in main
retval = mainmethod(args)
^^^^^^^^^^^^^^^^
File "/Users/ntward/miniforge3/envs/smash-pg/lib/python3.12/site-packages/sourmash_plugin_pangenomics.py", line 91, in main
return pangenome_createdb_main(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ntward/miniforge3/envs/smash-pg/lib/python3.12/site-packages/sourmash_plugin_pangenomics.py", line 233, in pangenome_createdb_main
lineage_tup = tax_utils.RankLineageInfo(lineage=lineage_tup)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<string>", line 7, in __init__
File "/Users/ntward/miniforge3/envs/smash-pg/lib/python3.12/site-packages/sourmash/tax/tax_utils.py", line 375, in __post_init__
self._init_from_lineage_tuples()
File "/Users/ntward/miniforge3/envs/smash-pg/lib/python3.12/site-packages/sourmash/tax/tax_utils.py", line 199, in _init_from_lineage_tuples
for lin_tup in self.lineage:
TypeError: 'RankLineageInfo' object is not iterable
```
:::