Multiple Genomes Alignment

Author: Diana Moreno Santillan dmorenos@ttu.edu

Date: August 2020

This pipeline performs multiple genome alignments and consists in three main steps:

Perform pairwise alignments with lastz.
Alignment sensitivity and specificity improvement with Michael Hiller's Genome Alignment Tools chain/net pipeline.
Multiple genome alignment with multiz-TBA tool.

Multiple Genomes Alignment
- Table of Contents

Software required

Input files

Query files:

Make sure your genomes are softmasked.
Edit your genomes headers for a non generic and meaningful header. For example the headers from human genome hg38 are >chr_1, >chr_2, etc. Change the headers for >hsap_1, >hsap_2 etc.

sed -i 's/chr/hsap/g' hg38.fa

Convert your fasta file to 2bit format

faToTwoBit query.fa query.2bit

Reference file:

Lastz align a query genome against a reference genome.

Make sure your reference genome is softmasked.
Edit your genome headers for a non generic and meaningful header.
Convert your fasta file to 2bit format

faToTwoBit reference.fa reference.2bit

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Note: For vertebrate genomes, lastz run will take from 1 to 2 weeks. Because the time-wall for the jobs in nocona is 48 hours, we can split the reference genome in multiple chunks and run lastz in each chunk.

Split the reference genome

Depend on the size of the reference genome, set up a number of sequences in each chunk. In this case I'll split it to 100 sequences.

seqkit split2 -s 100 reference.fa

cd reference.fa.split

Convert each chunk to 2bit format.

for f in *fa; do OUT=$(basename $f | sed 's/fa/2bit/g'); faToTwoBit $f $OUT; done

Pairwise alignment:

Standard run

Lastz can be run with a simple command:

lastz reference.2bit'[multiple]' query.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > output_alignment.axt

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Note: '[multiple]' is a flag that we need to put if the reference has more than one sequence, otherwise lastz will assume that the reference genome is a single huge sequence.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Important
To align placental mammals I set the following parameters¹
K=2400, L=3000, Y=3400 and H=2000

For closely related placental mammals I set the following parameters²
K=4500 and L=4500

Lastz for big data

If the reference genome was splitted in multiple chunks, lastz can be run simultaneously with less physical memory, when mapping a large set of different query genomes against the same reference sequence.

For this, we need to create a target capsule file and run each query against the capsule file:

#!/bin/bash
#SBATCH --job-name lastz_1
#SBATCH --output=lastz.log
#SBATCH -p extended-28core
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=28
#SBATCH -t 07-00:00:00

#Usage sbatch lastz_subjob.sh <refseq.2bit>


REFSEQ=$1
REFX=$(basename $REFSEQ | 's/\.2bit//g') 
CAPSULE=${REFX}_capsule_file

export PATH="/gpfs/home/dmorenosanti/lastz-distrib/bin/:$PATH"

#Check if capsule file exists, if not, create it.

if [ ! -f ${CAPSULE} ]; then
    lastz ${REFSEQ} --writecapsule=$CAPSULE
else
    echo "capsule file already exists"
fi


echo "Running lastz for pairwise alignments against reference genome, ${date}"

lastz --targetcapsule=${CAPSULE} query1.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query1_vs_${REFX}_pw_alignment.axt &
lastz1=$!
lastz --targetcapsule=${CAPSULE} query2.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query2_vs_${REFX}_pw_alignment.axt &
lastz2=$!
lastz --targetcapsule=${CAPSULE} query3.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query3_vs_${REFX}_pw_alignment.axt &
lastz3=$!
lastz --targetcapsule=${CAPSULE} query4.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query4_vs_${REFX}_pw_alignment.axt &
lastz4=$!

#wait for the termination of all runs

wait $lastz1 $lastz2 $lastz3 $lastz4 

echo "lastz run has finished for all the genomes"

Chaining

The following pipeline was written following GenomeAlignmentTools documentation.

1. Run axtChain

Chain together axt alignments






AXTOUT=alignment.axt
REFERENCE=reference_genome.fa
QUERY=query_genome.fa
CHAINOUT=chain_output.chain

axtChain -linearGap=medium ${AXTOUT} -faT ${REFERENCE} -faQ ${QUERY} ${CHAINOUT}

2. Run RepeatFiller

To incorporate repetitive regions into pairwise alignment chains, to improve the annotation of conserved non-coding regions³.







CHAINOUT=chain_output.chain
REFERENCE=reference.2bit
GENOME=query_genome.2bit
RFOUT=output_refill.chain

python RepeatFiller.py -c ${CHAINOUT} -T2 ${REFERENCE} -Q2 ${GENOME} -o ${RFOUT}

3. chainCleaner

















CHAINOUT=chain_output.chain
REFERENCE=reference.fa
GENOME=query_genome.fa
REF2=reference.2bit
GEN2=query_genome.2bit
RFOUT=output_refill.chain
REFX=$(basename ${REFSEQ} | awk -F '.' '{print $1}')
QUEX=$(basename ${GENOME} | awk -F '_' '{print $1}')

#Get genome size

faSize -detailed ${GENOME} > ${QUEX}.size
faSize -detailed ${REFERENCE} > ${REFX}.size

#Run chainCleanner
chainCleaner ${RFOUT} ${REF2} ${GEN2} ${QUEX}_clean.chain ${QUEX}_clean.bed -tSizes=${REFX}.size -qSizes=${QUEX}.size -linearGap=medium

4. Nets



chainNet output_clean.chain reference_genome.sizes query_genome.sizes reference.net query.net -rescore -tNibDir=reference_genome.2bit -qNibDir=query_genome.2bit -linearGap=medium

NetFilterNonNested.perl -doUCSCSynFilter -keepSynNetsWithScore 5000 -keepInvNetsWithScore 5000 ref.query.net.gz > ref.query.filtered.net

5. Convert nets to maf alignment






netToAxt ref.query.filtered.net output_clean.chain reference.2bit query.2bit output_alignment.axt 

axtSort output_alignment.axt output_alignment_sort.axt 


axtToMaf output_alignment_sort.axt reference.sizes query.sizes final_alignment.maf

Multiz

Multiz/TBA needs a specific headers for the genome file. The FASTA header format should be as follows:

string1:string2:int1:char:int2

For more information check the TBA documentation.

The headers can be obtained with the multiz_headers.sh script

sh multiz_headers.sh genome_file.fa

#!/bin/bash
#SBATCH -J multiz
#SBATCH -o %x.o%j
#SBATCH -e %x.e%j
#SBATCH -p nocona
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 48:00:00

GETHEADER=/home/dmorenos/software/multiz/get_standard_headers
RENAME=/home/dmorenos/scripts/rename_headers.py

#1. Modify genome headers
GENOME=$1
PREFIX=$(basename $GENOME | awk -F'[_]' '{print $1}')

get_standard_headers $GENOME > standard_headers.tmp

sed -e 's/\(1-\).*\(:+:\)/\1\2/' standard_headers.tmp | sed ':a;N;$!ba;s/==>\n/:/g;s/ :1-/:1/g' > multiz_headers.tmp

python rename_headers.py -input $GENOME -headers multiz_headers.tmp -output $PREFIX'_multiz_headers.fa'

rm *tmp

Before runing MULTIZ make sure that:

You have change the headers as indicated above.
Your reference genome has the same prefix in all the sequences (you set up this at the first step).
You have the original fasta files of your genomes in the same directory.
You have a parentetic species tree as an input for multiz, make sure that the name of the species on the tree are the same as the fasta file and header.
- Example: One of your species is human. If in the tree you type ((human),bonobo)) your fasta genome file must be human.fasta and the headers inside that file must be >human_1

PATH=$PATH:"/home/dmorenos/software/multiz/"
TEMPDIR="temp_dir"
mkdir ${TEMPDIR}
date
echo "starting multiz"

roast + T=$TEMPDIR E=referenceprefix "(((),()"  *.*.sing.maf whole_multiz_bats.maf

References

Sharma V. and Hiller M. (2017). Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Research, 45(14):8369-8377.
Hecker N. and Hiller M. (2020). A genome alignment of 120 mammals highlights ultraconserved element variability and placenta-associated enhancers. GigaScience, 9(1).

Multiple Genomes Alignment

Author: Diana Moreno Santillan dmorenos@ttu.edu

Date: August 2020

Table of Contents

Software required

Input files

Query files:

Reference file:

Split the reference genome

Pairwise alignment:

Standard run

Lastz for big data

Chaining

1. Run axtChain

2. Run RepeatFiller

3. chainCleaner

4. Nets

5. Convert nets to maf alignment

Multiz

References

tags: Genome analysis , Diana

Read more

TE curation analysis, post RAM

**Genome assembly: Flye software**

Bat1k de novo TE curation

README

tags: `Genome analysis` , `Diana`

Genome assembly: Flye software