Try โ€‚โ€‰HackMD

Multiple Genomes Alignment

Author: Diana Moreno Santillan dmorenos@ttu.edu
Date: August 2020

This pipeline performs multiple genome alignments and consists in three main steps:

  1. Perform pairwise alignments with lastz.
  2. Alignment sensitivity and specificity improvement with Michael Hiller's Genome Alignment Tools chain/net pipeline.
  3. Multiple genome alignment with multiz-TBA tool.

Table of Contents

Software required

Input files

Query files:

  1. Make sure your genomes are softmasked.
  2. Edit your genomes headers for a non generic and meaningful header. For example the headers from human genome hg38 are >chr_1, >chr_2, etc. Change the headers for >hsap_1, >hsap_2 etc.
sed -i 's/chr/hsap/g' hg38.fa 
  1. Convert your fasta file to 2bit format
faToTwoBit query.fa query.2bit

Reference file:

Lastz align a query genome against a reference genome.

  1. Make sure your reference genome is softmasked.

  2. Edit your genome headers for a non generic and meaningful header.

  3. Convert your fasta file to 2bit format

faToTwoBit reference.fa reference.2bit

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Note: For vertebrate genomes, lastz run will take from 1 to 2 weeks. Because the time-wall for the jobs in nocona is 48 hours, we can split the reference genome in multiple chunks and run lastz in each chunk.

Split the reference genome

Depend on the size of the reference genome, set up a number of sequences in each chunk. In this case I'll split it to 100 sequences.

seqkit split2 -s 100 reference.fa

cd reference.fa.split

Convert each chunk to 2bit format.

for f in *fa; do OUT=$(basename $f | sed 's/fa/2bit/g'); faToTwoBit $f $OUT; done

Pairwise alignment:

Standard run

Lastz can be run with a simple command:

lastz reference.2bit'[multiple]' query.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > output_alignment.axt

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Note: '[multiple]' is a flag that we need to put if the reference has more than one sequence, otherwise lastz will assume that the reference genome is a single huge sequence.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Important
To align placental mammals I set the following parameters1
K=2400, L=3000, Y=3400 and H=2000

For closely related placental mammals I set the following parameters2
K=4500 and L=4500

Lastz for big data

If the reference genome was splitted in multiple chunks, lastz can be run simultaneously with less physical memory, when mapping a large set of different query genomes against the same reference sequence.

For this, we need to create a target capsule file and run each query against the capsule file:

#!/bin/bash
#SBATCH --job-name lastz_1
#SBATCH --output=lastz.log
#SBATCH -p extended-28core
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=28
#SBATCH -t 07-00:00:00

#Usage sbatch lastz_subjob.sh <refseq.2bit>


REFSEQ=$1
REFX=$(basename $REFSEQ | 's/\.2bit//g') 
CAPSULE=${REFX}_capsule_file

export PATH="/gpfs/home/dmorenosanti/lastz-distrib/bin/:$PATH"

#Check if capsule file exists, if not, create it.

if [ ! -f ${CAPSULE} ]; then
    lastz ${REFSEQ} --writecapsule=$CAPSULE
else
    echo "capsule file already exists"
fi


echo "Running lastz for pairwise alignments against reference genome, ${date}"

lastz --targetcapsule=${CAPSULE} query1.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query1_vs_${REFX}_pw_alignment.axt &
lastz1=$!
lastz --targetcapsule=${CAPSULE} query2.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query2_vs_${REFX}_pw_alignment.axt &
lastz2=$!
lastz --targetcapsule=${CAPSULE} query3.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query3_vs_${REFX}_pw_alignment.axt &
lastz3=$!
lastz --targetcapsule=${CAPSULE} query4.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query4_vs_${REFX}_pw_alignment.axt &
lastz4=$!

#wait for the termination of all runs

wait $lastz1 $lastz2 $lastz3 $lastz4 

echo "lastz run has finished for all the genomes"

Chaining

The following pipeline was written following GenomeAlignmentTools documentation.

1. Run axtChain

Chain together axt alignments

AXTOUT=alignment.axt REFERENCE=reference_genome.fa QUERY=query_genome.fa CHAINOUT=chain_output.chain axtChain -linearGap=medium ${AXTOUT} -faT ${REFERENCE} -faQ ${QUERY} ${CHAINOUT}
2. Run RepeatFiller

To incorporate repetitive regions into pairwise alignment chains, to improve the annotation of conserved non-coding regions3.

CHAINOUT=chain_output.chain REFERENCE=reference.2bit GENOME=query_genome.2bit RFOUT=output_refill.chain python RepeatFiller.py -c ${CHAINOUT} -T2 ${REFERENCE} -Q2 ${GENOME} -o ${RFOUT}
3. chainCleaner
CHAINOUT=chain_output.chain REFERENCE=reference.fa GENOME=query_genome.fa REF2=reference.2bit GEN2=query_genome.2bit RFOUT=output_refill.chain REFX=$(basename ${REFSEQ} | awk -F '.' '{print $1}') QUEX=$(basename ${GENOME} | awk -F '_' '{print $1}') #Get genome size faSize -detailed ${GENOME} > ${QUEX}.size faSize -detailed ${REFERENCE} > ${REFX}.size #Run chainCleanner chainCleaner ${RFOUT} ${REF2} ${GEN2} ${QUEX}_clean.chain ${QUEX}_clean.bed -tSizes=${REFX}.size -qSizes=${QUEX}.size -linearGap=medium
4. Nets
chainNet output_clean.chain reference_genome.sizes query_genome.sizes reference.net query.net -rescore -tNibDir=reference_genome.2bit -qNibDir=query_genome.2bit -linearGap=medium NetFilterNonNested.perl -doUCSCSynFilter -keepSynNetsWithScore 5000 -keepInvNetsWithScore 5000 ref.query.net.gz > ref.query.filtered.net
5. Convert nets to maf alignment
netToAxt ref.query.filtered.net output_clean.chain reference.2bit query.2bit output_alignment.axt axtSort output_alignment.axt output_alignment_sort.axt axtToMaf output_alignment_sort.axt reference.sizes query.sizes final_alignment.maf

Multiz

Multiz/TBA needs a specific headers for the genome file. The FASTA header format should be as follows:

string1:string2:int1:char:int2

For more information check the TBA documentation.

The headers can be obtained with the multiz_headers.sh script

sh multiz_headers.sh genome_file.fa
#!/bin/bash
#SBATCH -J multiz
#SBATCH -o %x.o%j
#SBATCH -e %x.e%j
#SBATCH -p nocona
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 48:00:00

GETHEADER=/home/dmorenos/software/multiz/get_standard_headers
RENAME=/home/dmorenos/scripts/rename_headers.py

#1. Modify genome headers
GENOME=$1
PREFIX=$(basename $GENOME | awk -F'[_]' '{print $1}')

get_standard_headers $GENOME > standard_headers.tmp

sed -e 's/\(1-\).*\(:+:\)/\1\2/' standard_headers.tmp | sed ':a;N;$!ba;s/==>\n/:/g;s/ :1-/:1/g' > multiz_headers.tmp

python rename_headers.py -input $GENOME -headers multiz_headers.tmp -output $PREFIX'_multiz_headers.fa'

rm *tmp

Before runing MULTIZ make sure that:

  • You have change the headers as indicated above.
  • Your reference genome has the same prefix in all the sequences (you set up this at the first step).
  • You have the original fasta files of your genomes in the same directory.
  • You have a parentetic species tree as an input for multiz, make sure that the name of the species on the tree are the same as the fasta file and header.
    • Example: One of your species is human. If in the tree you type ((human),bonobo)) your fasta genome file must be human.fasta and the headers inside that file must be >human_1
PATH=$PATH:"/home/dmorenos/software/multiz/"
TEMPDIR="temp_dir"
mkdir ${TEMPDIR}
date
echo "starting multiz"

roast + T=$TEMPDIR E=referenceprefix "(((),()"  *.*.sing.maf whole_multiz_bats.maf

References

  1. Sharma V. and Hiller M. (2017). Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Research, 45(14):8369-8377.
  2. Hecker N. and Hiller M. (2020). A genome alignment of 120 mammals highlights ultraconserved element variability and placenta-associated enhancers. GigaScience, 9(1).
tags: Genome analysis , Diana