owned this note
owned this note
Published
Linked with GitHub
# CompareM: compare genomes with AAI (on Mac OS X)
Titus Brown, titus@idyll.org
(Linked to [CompareM/issues/16](https://github.com/dparks1134/CompareM/issues/16).)
## Install dependencies
(You'll need [homebrew](https://brew.sh/) installed already.)
```
brew install prodigal
brew install diamond
brew install coreutils
```
## Install comparem in Python 2
Note, comparem is not (yet) Python 3 compatible.
```
pip install comparem
```
## Fix gsort and sed incompatibilities on Mac OS X
Mac OS X has issues with regular `sort` and so I am using GNU sort installed with homebrew coreutils (above) as `gsort`.
Also, `sed` is [different on Mac OS X](https://ed.gs/2016/01/26/os-x-sed-invalid-command-code/) so that has to be fixed too. Whatevs.
```
diff --git a/comparem/similarity_search.py b/comparem/similarity_search.py
index 6a4b579..33b3559 100644
--- a/comparem/similarity_search.py
+++ b/comparem/similarity_search.py
@@ -90,8 +90,8 @@ class SimilaritySearch(object):
"""
self.logger.info('Sorting table with hits (be patient!).')
- os.system("LC_ALL=C sed -i 's/~/\t/g' %s" % input_hit_table)
- os.system("LC_ALL=C sort --parallel=8 -o %s -k1,1 -k3,3 %s" % (input_hit_table, input_hit_table))
+ os.system("LC_ALL=C sed -i '' 's/~/\t/g' %s" % input_hit_table)
+ os.system("LC_ALL=C gsort --parallel=8 -o %s -k1,1 -k3,3 %s" % (input_hit_table, input_hit_table))
os.system('mv %s %s' % (input_hit_table, output_hit_table))
def _run_self_blastp(self, query_gene_file,
```
## Run on test data included as part of [sourmash](https://sourmash.readthedocs.io/en/latest/)
```
git clone https://github.com/dib-lab/sourmash.git
cd sourmash/data
mkdir comparem
cd comparem
cp ../GCF*.fna.gz .
gunzip *.fna.gz
```
Now, put all of the genomes in a list file:
```
ls -1 *.fna > comparem.list
```
and run `comparem aai_wf`:
```
comparem aai_wf comparem.list out
```
I see the following output:
```
[2018-02-19 11:52:47] INFO: CompareM v0.0.23
[2018-02-19 11:52:47] INFO: comparem aai_wf comparem.list out
[2018-02-19 11:52:47] INFO: Identifying genes within genomes:
Finished processing 3 of 3 (100.00%) genomes.
[2018-02-19 11:54:36] INFO: Identified genes written to: out/genes
[2018-02-19 11:54:36] INFO: Appending genome identifiers to query genes.
[2018-02-19 11:54:36] INFO: Creating DIAMOND database (be patient!).
[2018-02-19 11:54:37] INFO: Performing self similarity sequence between genomes (be patient!).
[2018-02-19 11:54:55] INFO: Sorting table with hits (be patient!).
[2018-02-19 11:54:55] INFO: Sequence similarity results written to: out/similarity
[2018-02-19 11:54:55] INFO: Calculating length of genes.
[2018-02-19 11:54:55] INFO: Indexing sorted hit table.
[2018-02-19 11:54:55] INFO: Calculating AAI between all 3 pairs of genomes:
Finished processing 3 of 3 (100.00%) pairs.
[2018-02-19 11:54:55] INFO: Summarizing AAI results.
[2018-02-19 11:54:55] INFO: AAI between genomes written to: out/aai/aai_summary.tsv
```
Then (assuming it's all successful...) display things with tabs lined up nicely:
```
column -t -s $'\\t' out/aai/aai_summary.tsv
```
...or load into Excel or R or Python or something...
The output is (N * (N-1)) / 2 rows, i.e. the upper half of a distance matrix.
# Convert output into a `sourmash compare`-style numpy matrix
```
#! /usr/bin/env python
from __future__ import print_function, division
import sys
import numpy
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('aai_summary_tsv')
parser.add_argument('-o', '--output')
args = parser.parse_args()
assert args.output, 'please specify --output'
indices = {}
with open(args.aai_summary_tsv, 'rt') as fp:
lines = fp.readlines()
lines = [ x.strip().split('\t') for x in lines ]
lines = lines[1:]
# assign unique indices to the thing
filenum = 0
for row in lines:
g1 = row[0]
g2 = row[2]
if g1 not in indices:
indices[g1] = filenum
filenum += 1
if g2 not in indices:
indices[g2] = filenum
filenum += 1
print('...got {} genomes.'.format(filenum))
D = numpy.zeros((filenum, filenum))
for row in lines:
g1 = row[0]
n1 = indices[g1]
g2 = row[2]
n2 = indices[g2]
sim = row[5]
D[n1,n2] = D[n2,n1] = float(sim) / 100.
for i in range(filenum):
D[i,i] = 1.0
x = []
for k, v in indices.items():
x.append((v, k))
x.sort()
labeloutname = args.output + '.labels.txt'
print('saving labels to: {}'.format(labeloutname))
with open(labeloutname, 'w') as fp:
fp.write("\n".join([ tup[1] for tup in x ]))
print('saving distance matrix to: {}'.format(args.output))
with open(args.output, 'wb') as fp:
numpy.save(fp, D)
if __name__ == '__main__':
main()
```