FastANI example

--- tags: Nadjet title: FastANI example --- # FastANI example [FastANI](https://github.com/ParBLiSS/FastANI#fastani) is tool for generating a metric telling us something about how similar two genomes are called average nucleotide identity (ANI). It can be run on many genomes, but each value tells us something about a pairwise comparison between two genomes. It is on a scale between 1 and 100 and it is (roughly) the percent of nucleotide bases that are identical between the two genomes 👍 [toc] # Setup We already got a Unix-like environment on your machine (woohoo!), so you'd want to open up the Windows Subsystem for Linux (WSL) command-line. --- > **NOTE** > Entering `explorer.exe .` on the WSL command line will open the explorer window at that location! --- ## Making our conda environment Conda is an awesome environment and program manager I have an intro for [here](https://astrobiomike.github.io/unix/conda-intro) if you want to become more familiar with it. We already created the environment we're going to use on your computer, but here is the line we used and explanation just to have it noted here: ```bash conda create -n ISS-bugs -c conda-forge -c bioconda -c defaults -c astrobiomike fastani bit ``` --- > **Code Breakdown** > * `conda` > - this is our primary command > * `create` > - this is a sort of sub-command we are passing to the `conda` program > * `-n ISS-bugs` > - the `-n` parameter is for specifying the name we want to give the environment, which can be anything, but I think we used "ISS-bugs" > * `-c ...` > - the `-c` parameter is for "channel", and it tells the `conda` program where to look for the programs we are trying to install (and their dependencies, which is why we might need to list more channels than programs we are listing like in this case here) > * `fastani bit` > - these are the programs we want to install which we provide as [positional arguments](https://astrobiomike.github.io/unix/getting-started#running-commands) at the end --- ## Activating our conda environment It is important to activate the environment we created. If we try to run a command from one of the programs we installed into the "ISS-bugs" environment when we are not in that environment, we will get a command not found error. Here's how we activate it: ```bash conda activate ISS-bugs ``` And our prompt at the command line should change, and in parentheses in front of it should be the name of the conda environment we just entered. We can check but looking for the help menu of one of the programs. Often, this can be done by providing the program and a `-h`, e.g.: ```bash fastANI -h ``` ***Prints out*** ``` ----------------- fastANI is a fast alignment-free implementation for computing whole-genome Average Nucleotide Identity (ANI) between genomes ----------------- Example usage: $ fastANI -q genome1.fa -r genome2.fa -o output.txt $ fastANI -q genome1.fa --rl genome_list.txt -o output.txt Available options ----------------- -h, --help Print this help page -r <value>, --ref <value> reference genome (fasta/fastq)[.gz] --refList <value>, --rl <value> a file containing list of reference genome files, one genome per line -q <value>, --query <value> query genome (fasta/fastq)[.gz] --ql <value>, --queryList <value> a file containing list of query genome files, one genome per line -k <value>, --kmer <value> kmer size <= 16 [default : 16] -t <value>, --threads <value> thread count for parallel execution [default : 1] --fragLen <value> fragment length [default : 3,000] --minFraction <value> minimum fraction of genome that must be shared for trusting ANI. If reference and query genome size differ, smaller one among the two is considered. [default : 0.2] --visualize output mappings for visualization, can be enabled for single genome to single genome comparison only [disabled by default] --matrix also output ANI values as lower triangular matrix (format inspired from phylip). If enabled, you should expect an output file with .matrix extension [disabled by default] -o <value>, --output <value> [required] output file name -v, --version Show version ``` ## Changing into a working directory At the command line, we want to "go" somewhere we want to do this work. For now, just copying and pasting this codeblock should do: ```bash cd mkdir fastani-try cd fastani-try ``` > You rememember from our [Unix intro](https://astrobiomike.github.io/unix/unix-intro) and/or our troubleshooting the other day: > * `cd` is for "change directory". When we run it without any arguments like the first one here, it brings us to our "[home](https://astrobiomike.github.io/unix/getting-started#the-unix-file-system-structure)" location. When we use it at the end, we are telling it the directory we want to change into. > * `mkdir` is the command to make a directory (folder), and we give it the name of the directory we want to make. # Getting genomes I have a program in my [bioinf tools (bit) package](https://github.com/AstrobioMike/bit#bioinformatics-tools-bit) that we installed above that let's us download genomes by providing their assembly accession numbers. These start with either "GCA_..." (NCBI's GenBank) or "GCF_..." (a more highly curated subset of GenBank, called RefSeq). ## Target genome accessions The *Staphylococcus epidermidis* in GLDS-361 has accession "GCA_017167385.1" listed on the GLDS page in the "Assays/Measurements" table. And it turns out there is a RefSeq version too, just with GC**F** in front, so we are going to use that one: **GCF_017167385.1** The other 3 you tracked down and sent me were: |Bug|Accession| |---|---------| |*Staphylococcus saprophyticus* | GCF_000010125.1 | |*Micrococcus luteus* | GCF_000023205.1 | |*Staphylococcus epidermidis* | GCF_000011925.1 | To download the genomes with the `bit` program, we need to put these wanted accessions into a text file. We can do that anyway we want, but here's one way we can at the command-line, so that we can just copy and paste this codeblock to make the file: ```bash printf "GCF_017167385.1\nGCF_000010125.1\nGCF_000023205.1\nGCF_000011925.1\n" > wanted-accessions.txt ``` --- > **Code Breakdown** > * `printf` > - this is a command that allows us to format and print out text > * `"GCF_..."` > - everything in within the quotes is one positional argument we are giving to the `printf` program, which holds each accession we want joined by `\n` > - `\n` represents a newline, it's saying basically to press return between each of the accessions > * `> wanted-accessions.txt` > - the `>` [redirector](https://astrobiomike.github.io/unix/wild-redirectors#redirectors) tells the command line to send the output to a file so we can save it rather than just printing it to the screen in front of us, and we are naming that file with the "wanted-accessions.txt" we have there --- Now if we check with our list command, `ls`, we can see there is a file here called wanted-accessions.txt. And if we run the `head` command on it, it will print out contents: ```bash ls head wanted-accessions.txt ``` ![](https://i.imgur.com/GiWqswu.png) ## Downloading them Now we're ready to pass that file to the download program: ```bash bit-dl-ncbi-assemblies -w wanted-accessions.txt -f fasta ``` --- > **Code Breakdown** > * `bit-dl-ncbi-assemblies` > - this is our main command > * `-w` > - here is where we provide the file holding our target accessions, one per line > * `-f` > - here we are specifying the format we want to download the genome in > > We could see the help menu with `bit-dl-ncbi-assemblies -h` if we wanted. At first the program needs to download 2 tables holding the NCBI genome information, as it needs those to find and build the links to download them. Then it will download our target genomes. Afterwards, we have our files here in gzip format: ```bash ls ``` ![](https://i.imgur.com/3wKnh54.png) # Running fastANI Remember we can see the help menu for this program with: ```bash fastANI -h ``` ## Comparing 1 genome to 1 genome At the top, it has this example usage, comparing one genome to one other genome: > fastANI -q genome1.fa -r genome2.fa -o output.txt And looking further into the help menu we can see `-q` is for our "query" genome, and `-r` is for our "reference" genome (for our purposes here it doesn't matter too much which we use for which – differences in the output are negligible), and it notes it can take gzipped files (gz) – so we don't need to decompress ours before running them. Here's how we'll run the command on our GLDS-361 *Staphylococcus epidermidis* (GCF_017167385.1) and the *S. epidermidis* from ATCC (GCF_000011925.1): ```bash fastANI -q GCF_017167385.1.fa.gz -r GCF_000011925.1.fa.gz -o GLDS-361-S-epi-vs-ATCC-S-epi.tsv ``` --- > **Code Breakdown** > * `fastANI` > - main command > * `-q` > - one of our genomes is provided as the "query" > * `-r` > - the other is provided as the "reference" > * `-o` > - here we specify the name we want for the output file --- That finishes lightning fast, printing out some detailed run info that I looked into once and don't remember anymore, and it created our ouput file, which looks like this: ```bash head GLDS-361-S-epi-vs-ATCC-S-epi.tsv ``` ```bash # GCF_017167385.1.fa.gz GCF_000011925.1.fa.gz 98.2025 758 811 ``` The **1st** column is the query genome, **2nd** is the reference genome. The **3rd** column is the calculated percent average nucleotide identity (here 98.2025). The **4th** column is how many fragments aligned successfully (758) of the total number of fragments in the query genome (811). (Genomes are broken into pieces and aligned to account for the fact that they may not be oriented the same from start to finish, not super important right now, but just to note why we are talking about fragments of the genomes). This output is described [here on the fastANI github](https://github.com/ParBLiSS/FastANI#an-example-run). **So I would say GCF_017167385.1 has a 98.2% ANI to GCF_000011925.1, with ~94% of the genome aligning (758/811).** ## Comparing 1 genome to multiple genomes The second example usage in the `fastANI -h` menu is this: > fastANI -q genome1.fa --rl genome_list.txt -o output.txt Where we can give it a list of reference genomes in a file, rather than just a single one. So if we want to compare the GLDS-361 *S. epidermidis* (GCF_017167385.1) to all 3 from the ATCC list you sent (*S. saprophyticus*: GCF_000010125.1 | *Micrococcus luteus*: GCF_000023205.1 | *S. epidermidis*: GCF_000011925.1), we just need to put those 3 reference genome filenames into their own file – similar to when we made a list of the accessions we wanted to download in one file above. Here's one way we can do that again, just so there is a codeblock to copy and paste: ```bash printf "GCF_000010125.1.fa.gz\nGCF_000023205.1.fa.gz\nGCF_000011925.1.fa.gz\n" > ref-genome-list.txt ``` > See above for explanation of `printf` command we used last time. Now we have the 3 filenames we want to compare the GLDS-361 *S. epidermidis* to in the "ref-genome-list.txt" file: ``` head ref-genome-list.txt ``` ![](https://i.imgur.com/XiGXCfp.png) And we can run the `fastANI` command like so: ```bash fastANI -q GCF_017167385.1.fa.gz --rl ref-genome-list.txt -o GLDS-361-S-epi-vs-3-ATCC-refs.tsv ``` --- > **Code Breakdown** > - main command > * `-q` > - one of our genomes is provided as the "query" > * `--rl` > - a plain text file holding the filenames of the genomes we want to compare our "query" genome against > * `-o` > - here we specify the name we want for the output file --- That finishes virtually instantly also, and if we look at our ouput file, there is a row for only 2 of the 3 comparisons: ``` head GLDS-361-S-epi-vs-3-ATCC-refs.tsv ``` ``` # GCF_017167385.1.fa.gz GCF_000011925.1.fa.gz 98.2025 758 811 # GCF_017167385.1.fa.gz GCF_000010125.1.fa.gz 78.4674 237 811 ``` With the missing one being *Micrococcus luteus*. This is just because average nucleotide identity is really only useful/meaningful on genomes with near 80% ID or higher, and for those that are too divergent, `fastANI` doesn't report them. Here we see the same results as before when comparing to the *S. epidermidis* (~98% ANI with ~93% of the query genome aligning), and to the *S. saprophyticus* we see ~78% ANI with only ~30% of the query genome aligning. # Bonus: calculating percent aligned with `awk` We've been calculating percent aligned manually. We could easily take this output file, open in Excel, and that's totally fine to do! If we are working on "science", always do what works and don't think twice about it. If we are in "learning/practicing" mode, or we needed to do something programatically (like if a file were too large to open on our computers or something), then I say try new things 🙂 But one way we can do it programatically is with `awk`. We briefly touched upon `awk` in our [six glorious commands page](https://astrobiomike.github.io/unix/six-glorious-commands#awk), but it's definitely not very intuitive at first. So don't worry if this looks strange. I almost always need to look at previous examples whenever I'm building a new `awk` command in order to remind me of how to use it. Here's how we can add the percent aligned info we've been doing manually: ``` awk ' { palign = $4 / $5 * 100 } { print $0"\t"palign } ' GLDS-361-S-epi-vs-3-ATCC-refs.tsv ``` --- > **Code Breakdown** > * `awk` > - the main command > * `' { palign...'` > - this whole messy thing in single quotes is being provided to `awk` as one positional argument, here's what's going on > - there are two steps being performed *for each line in the file we are giving it*, these are in each of the curly brackets `{}` > - the first is creating a variable named "palign" and setting it to the 4th column divided by the 5th column times 100 (to get our percent aligned) > - the second one is saying to print the full line (which is what `$0` means in awk, then put at tab character (signified by the `"\t"`, then the contents of our variable `palign` that we calculated in step 1 for that line) > * lastly, after the single-quote expression ends, we have the file we want to act on, which we are passing as a positional argument > > Again, don't worry that this is super-confusing! That's normal until we see it a billion times 🙂 That prints out a new column added on with our percent alignments: ![](https://i.imgur.com/CpGAxT3.png) And if we wanted to write it to a file instead of printint it out, we'd just add the `>` redirector with a filename at the end, e.g.: ```bash awk ' { palign = $4 / $5 * 100 } { print $0"\t"palign } ' GLDS-361-S-epi-vs-3-ATCC-refs.tsv > GLDS-361-S-epi-vs-3-ATCC-refs-with-percent-of-query-aligned.tsv ``` And now it's saved to a file: ```bash head GLDS-361-S-epi-vs-3-ATCC-refs-with-percent-of-query-aligned.tsv ``` ![](https://i.imgur.com/Ok21uBe.png) > Again, this bonus section was just to show it. There is likely no reason *not* to just take these tables and work with them in a speadsheet program or wherever you're comfortable. I do that all the time too 🙂