---
# System prepended metadata

title: FastANI example
tags: [General]

---

---
tags: General
title: FastANI example
---

# FastANI example
[FastANI](https://github.com/ParBLiSS/FastANI#fastani) is tool for generating a metric telling us something about how similar two genomes are called average nucleotide identity (ANI). It can be run on many genomes, but each value tells us something about a pairwise comparison between two genomes. It is on a scale between 1 and 100 and it is (roughly) the percent of nucleotide bases that are identical between the two genomes 👍

[toc]

## Conda install
Using [mamba](https://astrobiomike.github.io/unix/conda-intro#bonus-mamba-no-5) on top of conda, 'cause it's faster. Including my [bit](https://github.com/AstrobioMike/bit#readme) package to be able to download example genomes quickly.


```bash
mamba create -n fastani -c conda-forge -c bioconda -c defaults -c astrobiomike fastani bit -y

conda activate fastani
```

## Getting example genomes

The input genomes need to be fasta files, so here just making a list with a few accessions to pass to `bit-dl-ncbi-assemblies`. Using these:

| isolate ID | tax | assembly acc | isolation date | isolation location |
|--|--|--|--|--| 
| s1 | S. epidermidis | GCA_016809475.1 | 8/14/06 | Node 1 |
| s2 | S. epidermidis | GCA_016808675.1 | 8/14/06 | Node 1 |
| s3 | S. epidermidis | GCA_016809435.1 | 8/14/06 | U.S. LAB |
| s4 | S. epidermidis | GCA_016808655.1 | 7/15/09 | Node 1 |
| s49 | S. epidermidis | GCA_016808775.1 | 4/16/15 | Node 2 |
| s53 | S. epidermidis | GCA_016808695.1 | 11/16/15 | PMM |

```bash
printf "GCA_016809475.1\nGCA_016808675.1\nGCA_016809435.1\nGCA_016808655.1\nGCA_016808775.1\nGCA_016808695.1\n" > target-accs.txt

bit-dl-ncbi-assemblies -f fasta -w target-accs.txt -j 3
```

## Example running FastANI to compare all
We just need a file holding the paths to the fasta files we want to compare, which right now, this will work to create that:

```bash
ls *.gz > genome-paths.txt

head genome-paths.txt | sed 's/^/# /'
# GCA_016808655.1.fa.gz
# GCA_016808675.1.fa.gz
# GCA_016808695.1.fa.gz
# GCA_016808775.1.fa.gz
# GCA_016809435.1.fa.gz
# GCA_016809475.1.fa.gz
```

And running fastANI:

```bash
fastANI --ql genome-paths.txt --rl genome-paths.txt -t 4 -o fastANI-out-raw.tsv
```

The output looks like this, with the columns: 
1. query
2. reference
3. ANI value
4. number of fragments mapped
5. total number of query fragments

```bash
cat fastANI-out-raw.tsv | sed 's/^/# /'
# GCA_016808655.1.fa.gz	GCA_016808655.1.fa.gz	100	793	795
# GCA_016808655.1.fa.gz	GCA_016808695.1.fa.gz	99.6563	762	795
# GCA_016808655.1.fa.gz	GCA_016808775.1.fa.gz	99.388	736	795
# GCA_016808655.1.fa.gz	GCA_016809435.1.fa.gz	99.377	752	795
# GCA_016808655.1.fa.gz	GCA_016809475.1.fa.gz	96.8913	737	795
# GCA_016808655.1.fa.gz	GCA_016808675.1.fa.gz	96.756	734	795
# GCA_016808675.1.fa.gz	GCA_016808675.1.fa.gz	100	780	786
# GCA_016808675.1.fa.gz	GCA_016809475.1.fa.gz	98.5649	734	786
# GCA_016808675.1.fa.gz	GCA_016808775.1.fa.gz	96.8927	727	786
# GCA_016808675.1.fa.gz	GCA_016808655.1.fa.gz	96.7925	730	786
# GCA_016808675.1.fa.gz	GCA_016809435.1.fa.gz	96.7467	736	786
# GCA_016808675.1.fa.gz	GCA_016808695.1.fa.gz	96.7231	739	786
# GCA_016808695.1.fa.gz	GCA_016808695.1.fa.gz	100	793	795
# GCA_016808695.1.fa.gz	GCA_016808655.1.fa.gz	99.684	762	795
# GCA_016808695.1.fa.gz	GCA_016808775.1.fa.gz	99.4361	743	795
# GCA_016808695.1.fa.gz	GCA_016809435.1.fa.gz	99.372	745	795
# GCA_016808695.1.fa.gz	GCA_016809475.1.fa.gz	96.9338	742	795
# GCA_016808695.1.fa.gz	GCA_016808675.1.fa.gz	96.68	740	795
# GCA_016808775.1.fa.gz	GCA_016808775.1.fa.gz	100	776	779
# GCA_016808775.1.fa.gz	GCA_016809435.1.fa.gz	99.4809	738	779
# GCA_016808775.1.fa.gz	GCA_016808695.1.fa.gz	99.4482	743	779
# GCA_016808775.1.fa.gz	GCA_016808655.1.fa.gz	99.4255	736	779
# GCA_016808775.1.fa.gz	GCA_016809475.1.fa.gz	96.963	736	779
# GCA_016808775.1.fa.gz	GCA_016808675.1.fa.gz	96.8906	727	779
# GCA_016809435.1.fa.gz	GCA_016809435.1.fa.gz	100	808	811
# GCA_016809435.1.fa.gz	GCA_016808775.1.fa.gz	99.4869	740	811
# GCA_016809435.1.fa.gz	GCA_016808655.1.fa.gz	99.3886	757	811
# GCA_016809435.1.fa.gz	GCA_016808695.1.fa.gz	99.3089	753	811
# GCA_016809435.1.fa.gz	GCA_016809475.1.fa.gz	96.9509	727	811
# GCA_016809435.1.fa.gz	GCA_016808675.1.fa.gz	96.8241	728	811
# GCA_016809475.1.fa.gz	GCA_016809475.1.fa.gz	100	809	810
# GCA_016809475.1.fa.gz	GCA_016808675.1.fa.gz	98.6019	735	810
# GCA_016809475.1.fa.gz	GCA_016809435.1.fa.gz	97.0084	724	810
# GCA_016809475.1.fa.gz	GCA_016808695.1.fa.gz	96.9251	735	810
# GCA_016809475.1.fa.gz	GCA_016808655.1.fa.gz	96.9205	728	810
# GCA_016809475.1.fa.gz	GCA_016808775.1.fa.gz	96.9141	735	810
```

### Note on output
There are self-comparisons in the default output. And as with all comparisons like these, A vs B can be slightly different than B vs A (both are in the outputs). I have some R code to create a new table always use the longer genome as the query, proportion aligned is reported consistently as being out of the longer of the two genomes. Let me know if you'd like it!

## BONUS: Some clean-up I do
I like to remove the ones that are comparing to themselves (same query and reference genome), and I like to add a column with percent aligned, which is useful information. Can be done any number of ways, here is doing these two things with awk, then sorting by percent aligned and then percent identity:

```bash
awk '$1 != $2 ' fastANI-out-raw.tsv | awk ' { palign = $4 / $5 * 100 } { print $0"\t"palign } ' | sort -nr -k 6,6 -k 4,4 > fastANI-out.tsv
```

Now there are no rows with self-contrasts, and there is a new column at the end of percent aligned:

```bash
cat fastANI-out.tsv | sed 's/^/# /'
# GCA_016808695.1.fa.gz	GCA_016808655.1.fa.gz	99.684	762	795	95.8491
# GCA_016808655.1.fa.gz	GCA_016808695.1.fa.gz	99.6563	762	795	95.8491
# GCA_016808775.1.fa.gz	GCA_016808695.1.fa.gz	99.4482	743	779	95.3787
# GCA_016808775.1.fa.gz	GCA_016809435.1.fa.gz	99.4809	738	779	94.7368
# GCA_016808655.1.fa.gz	GCA_016809435.1.fa.gz	99.377	752	795	94.5912
# GCA_016808775.1.fa.gz	GCA_016809475.1.fa.gz	96.963	736	779	94.4801
# GCA_016808775.1.fa.gz	GCA_016808655.1.fa.gz	99.4255	736	779	94.4801
# GCA_016808675.1.fa.gz	GCA_016808695.1.fa.gz	96.7231	739	786	94.0204
# GCA_016808695.1.fa.gz	GCA_016809435.1.fa.gz	99.372	745	795	93.7107
# GCA_016808675.1.fa.gz	GCA_016809435.1.fa.gz	96.7467	736	786	93.6387
# GCA_016808695.1.fa.gz	GCA_016808775.1.fa.gz	99.4361	743	795	93.4591
# GCA_016808675.1.fa.gz	GCA_016809475.1.fa.gz	98.5649	734	786	93.3842
# GCA_016809435.1.fa.gz	GCA_016808655.1.fa.gz	99.3886	757	811	93.3416
# GCA_016808695.1.fa.gz	GCA_016809475.1.fa.gz	96.9338	742	795	93.3333
# GCA_016808775.1.fa.gz	GCA_016808675.1.fa.gz	96.8906	727	779	93.3248
# GCA_016808695.1.fa.gz	GCA_016808675.1.fa.gz	96.68	740	795	93.0818
# GCA_016808675.1.fa.gz	GCA_016808655.1.fa.gz	96.7925	730	786	92.8753
# GCA_016809435.1.fa.gz	GCA_016808695.1.fa.gz	99.3089	753	811	92.8483
# GCA_016808655.1.fa.gz	GCA_016809475.1.fa.gz	96.8913	737	795	92.7044
# GCA_016808655.1.fa.gz	GCA_016808775.1.fa.gz	99.388	736	795	92.5786
# GCA_016808675.1.fa.gz	GCA_016808775.1.fa.gz	96.8927	727	786	92.4936
# GCA_016808655.1.fa.gz	GCA_016808675.1.fa.gz	96.756	734	795	92.327
# GCA_016809435.1.fa.gz	GCA_016808775.1.fa.gz	99.4869	740	811	91.2454
# GCA_016809475.1.fa.gz	GCA_016808775.1.fa.gz	96.9141	735	810	90.7407
# GCA_016809475.1.fa.gz	GCA_016808695.1.fa.gz	96.9251	735	810	90.7407
# GCA_016809475.1.fa.gz	GCA_016808675.1.fa.gz	98.6019	735	810	90.7407
# GCA_016809475.1.fa.gz	GCA_016808655.1.fa.gz	96.9205	728	810	89.8765
# GCA_016809435.1.fa.gz	GCA_016808675.1.fa.gz	96.8241	728	811	89.7657
# GCA_016809435.1.fa.gz	GCA_016809475.1.fa.gz	96.9509	727	811	89.6424
# GCA_016809475.1.fa.gz	GCA_016809435.1.fa.gz	97.0084	724	810	89.3827
```

And if wanting a header:

```bash
cat <( printf "query\treference\tANI\tnum_frags_mapped\ttotal_query_frags\tpercent_aligned\n" ) fastANI-out.tsv > fastANI-out-with-header.tsv
```

```bash
column -ts $'\t' fastANI-out-with-header.tsv | head | sed 's/^/# /'
# query                  reference              ANI      num_frags_mapped  total_query_frags  percent_aligned
# GCA_016808695.1.fa.gz  GCA_016808655.1.fa.gz  99.684   762               795                  95.8491
# GCA_016808655.1.fa.gz  GCA_016808695.1.fa.gz  99.6563  762               795                  95.8491
# GCA_016808775.1.fa.gz  GCA_016808695.1.fa.gz  99.4482  743               779                  95.3787
# GCA_016808775.1.fa.gz  GCA_016809435.1.fa.gz  99.4809  738               779                  94.7368
# GCA_016808655.1.fa.gz  GCA_016809435.1.fa.gz  99.377   752               795                  94.5912
# GCA_016808775.1.fa.gz  GCA_016809475.1.fa.gz  96.963   736               779                  94.4801
# GCA_016808775.1.fa.gz  GCA_016808655.1.fa.gz  99.4255  736               779                  94.4801
# GCA_016808675.1.fa.gz  GCA_016808695.1.fa.gz  96.7231  739               786                  94.0204
# GCA_016808695.1.fa.gz  GCA_016809435.1.fa.gz  99.372   745               795                  93.7107
```


---

**Side notes**

> It looks like there are some stable Staph populations persisting there across time.

The highest ANI was between s53 (11/16/15 from PMM) and s4 (7/15/09 Node 1): 99.7% ANI with 95% aligning

The lowest ANI was between s3 (8/14/06 from U.S.LAB) and s1 (8/14/06 Node 1).

s1 to s2 (same date and location): 98.6% ANI with 90.1% aligning (or 98.6% ANI with 93.4% aligning)

---
