CompareM: compare genomes with AAI (on Mac OS X)

Titus Brown, titus@idyll.org

(Linked to CompareM/issues/16.)

Install dependencies

(You'll need homebrew installed already.)

brew install prodigal
brew install diamond
brew install coreutils

Install comparem in Python 2

Note, comparem is not (yet) Python 3 compatible.

pip install comparem

Fix gsort and sed incompatibilities on Mac OS X

Mac OS X has issues with regular sort and so I am using GNU sort installed with homebrew coreutils (above) as gsort.

Also, sed is different on Mac OS X so that has to be fixed too. Whatevs.

diff --git a/comparem/similarity_search.py b/comparem/similarity_search.py
index 6a4b579..33b3559 100644
--- a/comparem/similarity_search.py
+++ b/comparem/similarity_search.py
@@ -90,8 +90,8 @@ class SimilaritySearch(object):
         """
         
         self.logger.info('Sorting table with hits (be patient!).')
-        os.system("LC_ALL=C sed -i 's/~/\t/g' %s" % input_hit_table)
-        os.system("LC_ALL=C sort --parallel=8 -o %s -k1,1 -k3,3 %s" % (input_hit_table, input_hit_table))
+        os.system("LC_ALL=C sed -i '' 's/~/\t/g' %s" % input_hit_table)
+        os.system("LC_ALL=C gsort --parallel=8 -o %s -k1,1 -k3,3 %s" % (input_hit_table, input_hit_table))
         os.system('mv %s %s' % (input_hit_table, output_hit_table))
      
     def _run_self_blastp(self, query_gene_file, 

Run on test data included as part of sourmash

git clone https://github.com/dib-lab/sourmash.git
cd sourmash/data
mkdir comparem
cd comparem
cp ../GCF*.fna.gz .
gunzip *.fna.gz

Now, put all of the genomes in a list file:

ls -1 *.fna > comparem.list

and run comparem aai_wf:

comparem aai_wf comparem.list out

I see the following output:

[2018-02-19 11:52:47] INFO: CompareM v0.0.23
[2018-02-19 11:52:47] INFO: comparem aai_wf comparem.list out
[2018-02-19 11:52:47] INFO: Identifying genes within genomes:
  Finished processing 3 of 3 (100.00%) genomes.
[2018-02-19 11:54:36] INFO: Identified genes written to: out/genes
[2018-02-19 11:54:36] INFO: Appending genome identifiers to query genes.
[2018-02-19 11:54:36] INFO: Creating DIAMOND database (be patient!).
[2018-02-19 11:54:37] INFO: Performing self similarity sequence between genomes (be patient!).
[2018-02-19 11:54:55] INFO: Sorting table with hits (be patient!).
[2018-02-19 11:54:55] INFO: Sequence similarity results written to: out/similarity
[2018-02-19 11:54:55] INFO: Calculating length of genes.
[2018-02-19 11:54:55] INFO: Indexing sorted hit table.
[2018-02-19 11:54:55] INFO: Calculating AAI between all 3 pairs of genomes:
  Finished processing 3 of 3 (100.00%) pairs.
[2018-02-19 11:54:55] INFO: Summarizing AAI results.
[2018-02-19 11:54:55] INFO: AAI between genomes written to: out/aai/aai_summary.tsv

Then (assuming it's all successful) display things with tabs lined up nicely:

column -t -s $'\\t' out/aai/aai_summary.tsv

or load into Excel or R or Python or something

The output is (N * (N-1)) / 2 rows, i.e. the upper half of a distance matrix.

Convert output into a sourmash compare-style numpy matrix

#! /usr/bin/env python
from __future__ import print_function, division
import sys
import numpy
import argparse

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('aai_summary_tsv')
    parser.add_argument('-o', '--output')
    args = parser.parse_args()

    assert args.output, 'please specify --output'

    indices = {}
    with open(args.aai_summary_tsv, 'rt') as fp:
        lines = fp.readlines()
        
    lines = [ x.strip().split('\t') for x in lines ]
    lines = lines[1:]

    # assign unique indices to the thing
    filenum = 0
    for row in lines:
        g1 = row[0]
        g2 = row[2]

        if g1 not in indices:
            indices[g1] = filenum
            filenum += 1

        if g2 not in indices:
            indices[g2] = filenum
            filenum += 1

    print('...got {} genomes.'.format(filenum))

    D = numpy.zeros((filenum, filenum))
    for row in lines:
        g1 = row[0]
        n1 = indices[g1]
        g2 = row[2]
        n2 = indices[g2]
        sim = row[5]

        D[n1,n2] = D[n2,n1] = float(sim) / 100.

    for i in range(filenum):
        D[i,i] = 1.0

    x = []
    for k, v in indices.items():
        x.append((v, k))
    x.sort()

    labeloutname = args.output + '.labels.txt'
    print('saving labels to: {}'.format(labeloutname))
    with open(labeloutname, 'w') as fp:
        fp.write("\n".join([ tup[1] for tup in x ]))

    print('saving distance matrix to: {}'.format(args.output))
    with open(args.output, 'wb') as fp:
        numpy.save(fp, D)


if __name__ == '__main__':
    main()
Select a repo