Multiple Genomes Alignment
Date: August 2020
This pipeline performs multiple genome alignments and consists in three main steps:
- Perform pairwise alignments with lastz.
- Alignment sensitivity and specificity improvement with Michael Hiller's Genome Alignment Tools chain/net pipeline.
- Multiple genome alignment with multiz-TBA tool.
Table of Contents
Software required
Query files:
- Make sure your genomes are softmasked.
- Edit your genomes headers for a non generic and meaningful header. For example the headers from human genome hg38 are >chr_1, >chr_2, etc. Change the headers for >hsap_1, >hsap_2 etc.
- Convert your fasta file to 2bit format
Reference file:
Lastz align a query genome against a reference genome.
-
Make sure your reference genome is softmasked.
-
Edit your genome headers for a non generic and meaningful header.
-
Convert your fasta file to 2bit format
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
Note: For vertebrate genomes, lastz run will take from 1 to 2 weeks. Because the time-wall for the jobs in nocona is 48 hours, we can split the reference genome in multiple chunks and run lastz in each chunk.
Split the reference genome
Depend on the size of the reference genome, set up a number of sequences in each chunk. In this case I'll split it to 100 sequences.
Convert each chunk to 2bit format.
Pairwise alignment:
Standard run
Lastz can be run with a simple command:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
Note: '[multiple]' is a flag that we need to put if the reference has more than one sequence, otherwise lastz will assume that the reference genome is a single huge sequence.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
Important
To align placental mammals I set the following parameters1
K=2400, L=3000, Y=3400 and H=2000
For closely related placental mammals I set the following parameters2
K=4500 and L=4500
Lastz for big data
If the reference genome was splitted in multiple chunks, lastz can be run simultaneously with less physical memory, when mapping a large set of different query genomes against the same reference sequence.
For this, we need to create a target capsule file and run each query against the capsule file:
Chaining
The following pipeline was written following GenomeAlignmentTools documentation.
1. Run axtChain
Chain together axt alignments
2. Run RepeatFiller
To incorporate repetitive regions into pairwise alignment chains, to improve the annotation of conserved non-coding regions3.
3. chainCleaner
4. Nets
5. Convert nets to maf alignment
Multiz
Multiz/TBA needs a specific headers for the genome file. The FASTA header format should be as follows:
string1:string2:int1:char:int2
For more information check the TBA documentation.
The headers can be obtained with the multiz_headers.sh script
Before runing MULTIZ make sure that:
- You have change the headers as indicated above.
- Your reference genome has the same prefix in all the sequences (you set up this at the first step).
- You have the original fasta files of your genomes in the same directory.
- You have a parentetic species tree as an input for multiz, make sure that the name of the species on the tree are the same as the fasta file and header.
- Example: One of your species is human. If in the tree you type ((human),bonobo)) your fasta genome file must be human.fasta and the headers inside that file must be >human_1
References
- Sharma V. and Hiller M. (2017). Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Research, 45(14):8369-8377.
- Hecker N. and Hiller M. (2020). A genome alignment of 120 mammals highlights ultraconserved element variability and placenta-associated enhancers. GigaScience, 9(1).