Example pangenome hash correlation: Ralstonia solanacearum
===
## Setup
```
mamba create -n smash-pg sourmash=4.8.10
mamba activate smash-pg
pip install sourmash_utils sourmash_plugin_pangenomics sourmash_plugin_betterplot
```
```
git clone git@github.com:ctb/2024-pangenome-hash-corr.git
cd 2024-pangenome-hash-corr
mkdir rs-pangenome
cd rs-pangenome
```
> if the git@ link doesn't work for you, use `https://github.com/ctb/2024-pangenome-hash-corr.git`
## Download sketches and taxonomic information
Download a database containing signatures of 32 Ralstonia genomes (pathogenic and not) and the corresponding taxonomic and lingroup information.
```
# database
curl -JLO https://osf.io/download/wxtk3/
# taxonomy csv
curl -JLO https://osf.io/download/sj2z7/
# lingroup csv
curl -JLO https://osf.io/download/nqms2/
# phylotype categories
curl -JLO https://osf.io/download/ptvgr/
```
## Build merged signature
Merge k-mers from all Ralstonia genomes into a single sourmash signature representing the "pangenome".
```
sourmash scripts pangenome_merge -k 31\
ralstonia.sc1000.zip \
-o ralstonia32.k31-sc1000.merged.zip
```
## Build ranktable, no lineages
Build a table of the k-mers in the ralstonia pangenome
```
sourmash scripts pangenome_ranktable \
ralstonia32.k31-sc1000.merged.zip \
-o ralstonia32.k31-sc1000.ranktable.csv \
-k 31
```
## Calculate hash presence
calculate presence of each k-mer in each ralstonia genome.
```
../calc-hash-presence.py ralstonia32.k31-sc1000.ranktable.csv \
ralstonia.sc1000.zip -o ralstonia32.k31-sc1000.dump \
--scaled=1000
```
## Rectangular matrix showing hash x genome correlations:
Build a presence-absence table from the hash presence counted above. Limit the total number of k-mers so we can plot in the next step (here, we limit to k-mers found in 15 or more genomes).
```
../hash-by-sample.py ralstonia32.k31-sc1000.dump \
-o ralstonia32.k31-sc1000.presence.csv -m 15
```
output:
```
loaded 42616 hash to sample entries.
wrote 42616 entries to 'ralstonia32.k31.presence.csv'
use 'sourmash scripts clustermap1' from betterplot to plot!
e.g. 'sourmash scripts clustermap1 ralstonia32.k31.presence.csv -o fig.png
```
## Try to plot
cluster and plot k-mers per genome. Add in the known phylogroup structure.
```
sourmash scripts clustermap1 ralstonia32.k31-sc1000.presence.csv -u presence \
-o ralstonia32.k31-sc1000.presence.png --boolean -R ralstonia32.phylotypes.csv --figsize-x 13 --figsize-y 10
```
> labels are not that useful. Try adding `--no-lables` for a prettier plot