Example pangenome hash correlation: Ralstonia solanacearum === ## Setup ``` mamba create -n smash-pg sourmash=4.8.10 mamba activate smash-pg pip install sourmash_utils sourmash_plugin_pangenomics sourmash_plugin_betterplot ``` ``` git clone git@github.com:ctb/2024-pangenome-hash-corr.git cd 2024-pangenome-hash-corr mkdir rs-pangenome cd rs-pangenome ``` > if the git@ link doesn't work for you, use `https://github.com/ctb/2024-pangenome-hash-corr.git` ## Download sketches and taxonomic information Download a database containing signatures of 32 Ralstonia genomes (pathogenic and not) and the corresponding taxonomic and lingroup information. ``` # database curl -JLO https://osf.io/download/wxtk3/ # taxonomy csv curl -JLO https://osf.io/download/sj2z7/ # lingroup csv curl -JLO https://osf.io/download/nqms2/ # phylotype categories curl -JLO https://osf.io/download/ptvgr/ ``` ## Build merged signature Merge k-mers from all Ralstonia genomes into a single sourmash signature representing the "pangenome". ``` sourmash scripts pangenome_merge -k 31\ ralstonia.sc1000.zip \ -o ralstonia32.k31-sc1000.merged.zip ``` ## Build ranktable, no lineages Build a table of the k-mers in the ralstonia pangenome ``` sourmash scripts pangenome_ranktable \ ralstonia32.k31-sc1000.merged.zip \ -o ralstonia32.k31-sc1000.ranktable.csv \ -k 31 ``` ## Calculate hash presence calculate presence of each k-mer in each ralstonia genome. ``` ../calc-hash-presence.py ralstonia32.k31-sc1000.ranktable.csv \ ralstonia.sc1000.zip -o ralstonia32.k31-sc1000.dump \ --scaled=1000 ``` ## Rectangular matrix showing hash x genome correlations: Build a presence-absence table from the hash presence counted above. Limit the total number of k-mers so we can plot in the next step (here, we limit to k-mers found in 15 or more genomes). ``` ../hash-by-sample.py ralstonia32.k31-sc1000.dump \ -o ralstonia32.k31-sc1000.presence.csv -m 15 ``` output: ``` loaded 42616 hash to sample entries. wrote 42616 entries to 'ralstonia32.k31.presence.csv' use 'sourmash scripts clustermap1' from betterplot to plot! e.g. 'sourmash scripts clustermap1 ralstonia32.k31.presence.csv -o fig.png ``` ## Try to plot cluster and plot k-mers per genome. Add in the known phylogroup structure. ``` sourmash scripts clustermap1 ralstonia32.k31-sc1000.presence.csv -u presence \ -o ralstonia32.k31-sc1000.presence.png --boolean -R ralstonia32.phylotypes.csv --figsize-x 13 --figsize-y 10 ``` > labels are not that useful. Try adding `--no-lables` for a prettier plot