Ralstonia pangenome === ## Setup ``` mamba create -n smash-pg sourmash=4.8.10 mamba activate smash-pg pip install sourmash_utils sourmash_plugin_pangenomics sourmash_plugin_betterplot ``` ``` git clone git@github.com:ctb/2024-pangenome-hash-corr.git cd 2024-pangenome-hash-corr mkdir rs-pangenome cd rs-pangenome ``` ## Download sketches and taxonomic information Download a database containing signatures of 32 Ralstonia genomes (pathogenic and not) and the corresponding taxonomic and lingroup information. ``` # database curl -JLO https://osf.io/download/wxtk3/ # taxonomy csv curl -JLO https://osf.io/download/sj2z7/ # lingroup csv curl -JLO https://osf.io/download/nqms2/ ``` > scaled 100 db instead: https://osf.io/download/wx9kv/ ## Build merged signature ``` sourmash scripts pangenome_merge -k 31\ ralstonia.sc1000.zip \ -o ralstonia32.k31-sc1000.merged.zip ``` ## Build ranktable, no lineages ``` sourmash scripts pangenome_ranktable \ ralstonia32.k31-sc1000.merged.zip \ -o ralstonia32.k31-sc1000.ranktable.csv \ -k 31 ``` ## Calculate hash presence ``` ../calc-hash-presence.py ralstonia32.k31-sc1000.ranktable.csv \ ralstonia.sc1000.zip -o ralstonia32.k31-sc1000.dump \ --scaled=1000 ``` ## Rectangular matrix showing hash x genome correlations: ``` ../hash-by-sample.py ralstonia32.k31-sc1000.dump \ -o ralstonia32.k31-sc1000.presence.csv -m 15 ``` output: ``` loaded 42616 hash to sample entries. wrote 42616 entries to 'ralstonia32.k31.presence.csv' use 'sourmash scripts clustermap1' from betterplot to plot! e.g. 'sourmash scripts clustermap1 ralstonia32.k31.presence.csv -o fig.png ``` ## Try to plot ``` sourmash scripts clustermap1 ralstonia32.k31-sc1000.presence.csv -u presence \ -o ralstonia32.k31-sc1000.presence.png --boolean -R ralstonia32.phylotypes.csv --figsize-x 13 --figsize-y 10 ``` #--no-labels # Now, add reads in to the plot ## cat the sigs to a zip ``` sourmash sig cat ralstonia.sc1000.zip \ bc16.first100.singleton.zip \ -o ralstonia+bc16first100.zip ``` ## Calculate hash presence but with reads too ``` ../calc-hash-presence.py ralstonia32.k31.ranktable.csv \ ralstonia+bc16first100.zip \ -o ralstonia32+bc16.k31.dump \ --scaled=1000 ``` ## Rectangular matrix showing hash x genome + read correlations: ``` ../hash-by-sample.py ralstonia32+bc16.k31.dump \ -o ralstonia32+bc16first100.k31.presence.csv \ -m 15 ``` ## Try to plot ``` sourmash scripts clustermap1 ralstonia32+bc16first100.k31.presence.csv \ -u presence \ -o ralstonia32+bc16first100.k31.presence.png \ --boolean ``` ## first 500 reads instead ``` sourmash sig cat ralstonia.sc1000.zip \ bc16.first500.singleton.zip \ -o ralstonia+bc16first500.zip ../calc-hash-presence.py ralstonia32.k31.ranktable.csv \ ralstonia+bc16first500.zip \ -o ralstonia32+bc16first500.k31.dump \ --scaled=1000 ../hash-by-sample.py ralstonia32+bc16first500.k31.dump \ -o ralstonia32+bc16first500.k31.presence.csv \ -m 15 sourmash scripts clustermap1 ralstonia32+bc16first500.k31.presence.csv \ -u presence \ -o ralstonia32+bc16first500.k31.presence.png \ --boolean ``` ## first 1000 reads instead ``` sourmash sig cat ralstonia.sc1000.zip \ bc16.first1000.sc1000.singleton.zip \ -o ralstonia+bc16first1000.zip ../calc-hash-presence.py ralstonia32.k31-sc1000.ranktable.csv \ ralstonia+bc16first1000.zip \ -o ralstonia32+bc16first1000.k31-sc1000.dump \ --scaled=1000 ../hash-by-sample.py ralstonia32+bc16first1000.k31-sc1000.dump \ -o ralstonia32+bc16first1000.k31-sc1000.presence.csv \ -m 15 sourmash scripts clustermap1 ralstonia32+bc16first1000.k31-sc1000.presence.csv \ -u presence \ -o ralstonia32+bc16first1000.k31-sc1000.presence.png \ --boolean --figsize-x 15 --figsize-y 12 \ -R ralstonia32+reads.phylotypes.csv ``` ## Try at lower scaled first 100 reads, scaled 100 ``` sourmash scripts pangenome_merge -k 31 \ ralstonia32.sc100.zip \ -o ralstonia32.k31-sc100.merged.zip sourmash scripts pangenome_ranktable \ ralstonia32.k31-sc100.merged.zip \ -o ralstonia32.k31-sc100.ranktable.csv \ -k 31 sourmash sig cat ralstonia32.sc100.zip \ bc16.first100.sc100.singleton.zip \ -o ralstonia+bc16first100.sc100.zip ../calc-hash-presence.py ralstonia32.k31.ranktable.csv \ ralstonia+bc16first100.sc100.zip \ -o ralstonia32+bc16.k31-sc100.dump \ --scaled=100 ../hash-by-sample.py ralstonia32+bc16.k31-sc100.dump \ -o ralstonia32+bc16first100.k31-sc100.presence.csv \ -m 15 sourmash scripts clustermap1 ralstonia32+bc16first100.k31-sc100.presence.csv \ -u presence \ -o ralstonia32+bc16first100.k31-sc100.presence.png \ --boolean --figsize-x 15 --figsize-y 12 \ -R ralstonia32+reads.phylotypes.csv ``` first 1000 reads, scaled 100 ``` sourmash scripts pangenome_merge -k 31 \ ralstonia32.sc100.zip \ -o ralstonia32.k31-sc100.merged.zip sourmash scripts pangenome_ranktable \ ralstonia32.k31-sc100.merged.zip \ -o ralstonia32.k31-sc100.ranktable.csv \ -k 31 sourmash sig cat ralstonia32.sc100.zip \ bc16.first1000.sc100.singleton.zip \ -o ralstonia+bc16first1000.sc100.zip ../calc-hash-presence.py ralstonia32.k31.ranktable.csv \ ralstonia+bc16first1000.sc100.zip \ -o ralstonia32+bc16first1000.k31-sc100.dump \ --scaled=100 ../hash-by-sample.py ralstonia32+bc16first1000.k31-sc100.dump \ -o ralstonia32+bc16first1000.k31-sc100.presence.csv \ -m 15 sourmash scripts clustermap1 ralstonia32+bc16first1000.k31-sc100.presence.csv \ -u presence \ -o ralstonia32+bc16first1000.k31-sc100.presence.png \ --boolean --figsize-x 15 --figsize-y 12 \ -R ralstonia32+reads.phylotypes.csv ``` first 1000 reads, scaled 10 ``` sourmash scripts pangenome_merge -k 31 \ ralstonia32.sc10.zip \ -o ralstonia32.k31-sc10.merged.zip sourmash scripts pangenome_ranktable \ ralstonia32.k31-sc10.merged.zip \ -o ralstonia32.k31-sc10.ranktable.csv \ -k 31 sourmash sig cat ralstonia32.sc10.zip \ bc16.first1000.sc10.singleton.zip \ -o ralstonia+bc16first1000.sc10.zip ../calc-hash-presence.py ralstonia32.k31.ranktable.csv \ ralstonia+bc16first1000.sc10.zip \ -o ralstonia32+bc16first1000.k31-sc10.dump \ --scaled=10 ../hash-by-sample.py ralstonia32+bc16first1000.k31-sc10.dump \ -o ralstonia32+bc16first1000.k31-sc10.presence.csv \ -m 15 sourmash scripts clustermap1 ralstonia32+bc16first1000.k31-sc10.presence.csv \ -u presence \ -o ralstonia32+bc16first1000.k31-sc10.presence.png \ --boolean --figsize-x 15 --figsize-y 12 \ -R ralstonia32+reads.phylotypes.csv ``` --- # Try cluster --> categories all x all comparison ``` sourmash scripts pairwise ralstonia.sc1000.zip -k 31 --scaled 1000 -o ralstonia31.k31-sc1000.pairwise.csv --write-all --ani ``` cluster ``` sourmash scripts cluster ralstonia31.k31-sc1000.pairwise.csv \ -o ralstonia31.k31-sc1000.ani-cluster.csv \ --similarity average_containment_ani -t 0.97 ``` --- :::warning ## Building pangenome with lineages did not work: ``` download gtdb ralstonia genomes and tax: curl -JLO https://osf.io/download/4wbdn/ curl -JLO https://osf.io/download/y4p9k/ sourmash scripts pangenome_createdb ralstonia.sc1000.zip \ -t gtdb-ralstonia.lineages.csv \ -o ralstonia32.sig.zip --abund -k 31 ``` output: ``` == This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == loading taxonomies from ['gtdb-ralstonia.lineages.csv'] found 530 identifiers in taxdb. selecting sketches: k=31 scaled=1000 moltype=DNA loading sketches from file ralstonia.sc1000.zip Traceback (most recent call last): File "/Users/ntward/miniforge3/envs/smash-pg/bin/sourmash", line 11, in <module> sys.exit(main()) ^^^^^^ File "/Users/ntward/miniforge3/envs/smash-pg/lib/python3.12/site-packages/sourmash/__main__.py", line 20, in main retval = mainmethod(args) ^^^^^^^^^^^^^^^^ File "/Users/ntward/miniforge3/envs/smash-pg/lib/python3.12/site-packages/sourmash_plugin_pangenomics.py", line 91, in main return pangenome_createdb_main(args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ntward/miniforge3/envs/smash-pg/lib/python3.12/site-packages/sourmash_plugin_pangenomics.py", line 233, in pangenome_createdb_main lineage_tup = tax_utils.RankLineageInfo(lineage=lineage_tup) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<string>", line 7, in __init__ File "/Users/ntward/miniforge3/envs/smash-pg/lib/python3.12/site-packages/sourmash/tax/tax_utils.py", line 375, in __post_init__ self._init_from_lineage_tuples() File "/Users/ntward/miniforge3/envs/smash-pg/lib/python3.12/site-packages/sourmash/tax/tax_utils.py", line 199, in _init_from_lineage_tuples for lin_tup in self.lineage: TypeError: 'RankLineageInfo' object is not iterable ``` :::