This pipeline performs multiple genome alignments and consists in three main steps:
Lastz align a query genome against a reference genome.
Make sure your reference genome is softmasked.
Edit your genome headers for a non generic and meaningful header.
Convert your fasta file to 2bit format
Note: For vertebrate genomes, lastz run will take from 1 to 2 weeks. Because the time-wall for the jobs in nocona is 48 hours, we can split the reference genome in multiple chunks and run lastz in each chunk.
Depend on the size of the reference genome, set up a number of sequences in each chunk. In this case I'll split it to 100 sequences.
Convert each chunk to 2bit format.
Lastz can be run with a simple command:
Note: '[multiple]' is a flag that we need to put if the reference has more than one sequence, otherwise lastz will assume that the reference genome is a single huge sequence.
Important
To align placental mammals I set the following parameters1
K=2400, L=3000, Y=3400 and H=2000
For closely related placental mammals I set the following parameters2
K=4500 and L=4500
If the reference genome was splitted in multiple chunks, lastz can be run simultaneously with less physical memory, when mapping a large set of different query genomes against the same reference sequence.
For this, we need to create a target capsule file and run each query against the capsule file:
The following pipeline was written following GenomeAlignmentTools documentation.
Chain together axt alignments
To incorporate repetitive regions into pairwise alignment chains, to improve the annotation of conserved non-coding regions3.
Multiz/TBA needs a specific headers for the genome file. The FASTA header format should be as follows:
string1:string2:int1:char:int2
For more information check the TBA documentation.
The headers can be obtained with the multiz_headers.sh script
Before runing MULTIZ make sure that:
Genome analysis
, Diana