owned this note
owned this note
Published
Linked with GitHub
# Multiple Genomes Alignment
###### Author: Diana Moreno Santillan dmorenos@ttu.edu
###### Date: August 2020
This pipeline performs multiple genome alignments and consists in three main steps:
1. Perform pairwise alignments with [lastz](http://www.bx.psu.edu/~rsharris/lastz/).
2. Alignment sensitivity and specificity improvement with Michael Hiller's [Genome Alignment Tools](https://github.com/hillerlab/GenomeAlignmentTools) chain/net pipeline.
3. Multiple genome alignment with [multiz-TBA](https://www.bx.psu.edu/miller_lab/) tool.
## Table of Contents
[TOC]
### Software required
- [seqkit](https://bioinf.shenwei.me/seqkit/)
- [faToTwoBit](http://hgdownload.soe.ucsc.edu/admin/exe/)
- [lastz](http://www.bx.psu.edu/~rsharris/lastz/)
- [GenomeAlignmentTools](https://github.com/hillerlab/GenomeAlignmentTools)
- [TBA/multiz](https://www.bx.psu.edu/miller_lab/)
### Input files
#### Query files:
1. Make sure your genomes are softmasked.
2. Edit your genomes headers for a non generic and meaningful header. *For example the headers from human genome hg38 are >chr_1, >chr_2, etc. Change the headers for >hsap_1, >hsap_2 etc.*
```
sed -i 's/chr/hsap/g' hg38.fa
```
3. Convert your fasta file to 2bit format
```
faToTwoBit query.fa query.2bit
```
#### Reference file:
Lastz align a query genome against a reference genome.
1. Make sure your reference genome is softmasked.
2. Edit your genome headers for a non generic and meaningful header.
3. Convert your fasta file to 2bit format
```
faToTwoBit reference.fa reference.2bit
```
:::warning
:pencil: **Note:** For vertebrate genomes, lastz run will take from 1 to 2 weeks. Because the time-wall for the jobs in nocona is 48 hours, we can split the reference genome in multiple chunks and run lastz in each chunk.
:::
##### Split the reference genome
Depend on the size of the reference genome, set up a number of sequences in each chunk. In this case I'll split it to 100 sequences.
```
seqkit split2 -s 100 reference.fa
cd reference.fa.split
```
Convert each chunk to 2bit format.
```
for f in *fa; do OUT=$(basename $f | sed 's/fa/2bit/g'); faToTwoBit $f $OUT; done
```
### Pairwise alignment:
##### Standard run
Lastz can be run with a simple command:
```
lastz reference.2bit'[multiple]' query.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > output_alignment.axt
```
:::warning
:pencil: **Note:** <span style="color:black"> '[multiple]' is a flag that we need to put if the reference has more than one sequence, otherwise lastz will assume that the reference genome is a single huge sequence.</span>
:::
:::info
:bulb: <span style="color:blue">**Important**</span>
To align placental mammals I set the following parameters[^1^](#References)
K=2400, L=3000, Y=3400 and H=2000
For closely related placental mammals I set the following parameters[^2^](#References)
K=4500 and L=4500
:::
##### Lastz for big data
If the reference genome was splitted in multiple chunks, lastz can be run simultaneously with less physical memory, when mapping a large set of different query genomes against the same reference sequence.
For this, we need to create a [target capsule file](http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html#adv_capsule) and run each query against the capsule file:
```[bash]
#!/bin/bash
#SBATCH --job-name lastz_1
#SBATCH --output=lastz.log
#SBATCH -p extended-28core
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=28
#SBATCH -t 07-00:00:00
#Usage sbatch lastz_subjob.sh <refseq.2bit>
REFSEQ=$1
REFX=$(basename $REFSEQ | 's/\.2bit//g')
CAPSULE=${REFX}_capsule_file
export PATH="/gpfs/home/dmorenosanti/lastz-distrib/bin/:$PATH"
#Check if capsule file exists, if not, create it.
if [ ! -f ${CAPSULE} ]; then
lastz ${REFSEQ} --writecapsule=$CAPSULE
else
echo "capsule file already exists"
fi
echo "Running lastz for pairwise alignments against reference genome, ${date}"
lastz --targetcapsule=${CAPSULE} query1.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query1_vs_${REFX}_pw_alignment.axt &
lastz1=$!
lastz --targetcapsule=${CAPSULE} query2.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query2_vs_${REFX}_pw_alignment.axt &
lastz2=$!
lastz --targetcapsule=${CAPSULE} query3.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query3_vs_${REFX}_pw_alignment.axt &
lastz3=$!
lastz --targetcapsule=${CAPSULE} query4.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query4_vs_${REFX}_pw_alignment.axt &
lastz4=$!
#wait for the termination of all runs
wait $lastz1 $lastz2 $lastz3 $lastz4
echo "lastz run has finished for all the genomes"
```
### Chaining
The following pipeline was written following [GenomeAlignmentTools](https://github.com/hillerlab/GenomeAlignmentTools) documentation.
##### 1. Run axtChain
Chain together axt alignments
```bash=
AXTOUT=alignment.axt
REFERENCE=reference_genome.fa
QUERY=query_genome.fa
CHAINOUT=chain_output.chain
axtChain -linearGap=medium ${AXTOUT} -faT ${REFERENCE} -faQ ${QUERY} ${CHAINOUT}
```
##### 2. Run RepeatFiller
To incorporate repetitive regions into pairwise alignment chains, to improve the annotation of conserved non-coding regions[^3^](#References).
```bash=
CHAINOUT=chain_output.chain
REFERENCE=reference.2bit
GENOME=query_genome.2bit
RFOUT=output_refill.chain
python RepeatFiller.py -c ${CHAINOUT} -T2 ${REFERENCE} -Q2 ${GENOME} -o ${RFOUT}
```
##### 3. chainCleaner
```bash=
CHAINOUT=chain_output.chain
REFERENCE=reference.fa
GENOME=query_genome.fa
REF2=reference.2bit
GEN2=query_genome.2bit
RFOUT=output_refill.chain
REFX=$(basename ${REFSEQ} | awk -F '.' '{print $1}')
QUEX=$(basename ${GENOME} | awk -F '_' '{print $1}')
#Get genome size
faSize -detailed ${GENOME} > ${QUEX}.size
faSize -detailed ${REFERENCE} > ${REFX}.size
#Run chainCleanner
chainCleaner ${RFOUT} ${REF2} ${GEN2} ${QUEX}_clean.chain ${QUEX}_clean.bed -tSizes=${REFX}.size -qSizes=${QUEX}.size -linearGap=medium
```
##### 4. Nets
```bash=
chainNet output_clean.chain reference_genome.sizes query_genome.sizes reference.net query.net -rescore -tNibDir=reference_genome.2bit -qNibDir=query_genome.2bit -linearGap=medium
NetFilterNonNested.perl -doUCSCSynFilter -keepSynNetsWithScore 5000 -keepInvNetsWithScore 5000 ref.query.net.gz > ref.query.filtered.net
```
##### 5. Convert nets to maf alignment
```bash=
netToAxt ref.query.filtered.net output_clean.chain reference.2bit query.2bit output_alignment.axt
axtSort output_alignment.axt output_alignment_sort.axt
axtToMaf output_alignment_sort.axt reference.sizes query.sizes final_alignment.maf
```
### Multiz
Multiz/TBA needs a specific headers for the genome file. The FASTA header format should be as follows:
string1:string2:int1:char:int2
For more information check the [TBA documentation](https://www.bx.psu.edu/miller_lab/dist/tba_howto.pdf).
The headers can be obtained with the multiz_headers.sh script
```
sh multiz_headers.sh genome_file.fa
```
```
#!/bin/bash
#SBATCH -J multiz
#SBATCH -o %x.o%j
#SBATCH -e %x.e%j
#SBATCH -p nocona
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 48:00:00
GETHEADER=/home/dmorenos/software/multiz/get_standard_headers
RENAME=/home/dmorenos/scripts/rename_headers.py
#1. Modify genome headers
GENOME=$1
PREFIX=$(basename $GENOME | awk -F'[_]' '{print $1}')
get_standard_headers $GENOME > standard_headers.tmp
sed -e 's/\(1-\).*\(:+:\)/\1\2/' standard_headers.tmp | sed ':a;N;$!ba;s/==>\n/:/g;s/ :1-/:1/g' > multiz_headers.tmp
python rename_headers.py -input $GENOME -headers multiz_headers.tmp -output $PREFIX'_multiz_headers.fa'
rm *tmp
```
Before runing MULTIZ make sure that:
- You have change the headers as indicated above.
- Your reference genome has the same prefix in all the sequences (you set up this at the [first step](https://hackmd.io/wucGwX0ZTOOFb3W1-R0mYw#Reference-file)).
- You have the original fasta files of your genomes in the same directory.
- You have a parentetic species tree as an input for multiz, make sure that the name of the species on the tree are the same as the fasta file and header.
- Example: One of your species is human. If in the tree you type ((human),bonobo)) your fasta genome file must be human.fasta and the headers inside that file must be >human_1
```
PATH=$PATH:"/home/dmorenos/software/multiz/"
TEMPDIR="temp_dir"
mkdir ${TEMPDIR}
date
echo "starting multiz"
roast + T=$TEMPDIR E=referenceprefix "(((),()" *.*.sing.maf whole_multiz_bats.maf
```
### References
1. Sharma V. and Hiller M. (2017). Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Research, 45(14):8369-8377.
2. Hecker N. and Hiller M. (2020). A genome alignment of 120 mammals highlights ultraconserved element variability and placenta-associated enhancers. GigaScience, 9(1).
###### tags: `Genome analysis` , `Diana`