# Multiple Genomes Alignment ###### Author: Diana Moreno Santillan dmorenos@ttu.edu ###### Date: August 2020 This pipeline performs multiple genome alignments and consists in three main steps: 1. Perform pairwise alignments with [lastz](http://www.bx.psu.edu/~rsharris/lastz/). 2. Alignment sensitivity and specificity improvement with Michael Hiller's [Genome Alignment Tools](https://github.com/hillerlab/GenomeAlignmentTools) chain/net pipeline. 3. Multiple genome alignment with [multiz-TBA](https://www.bx.psu.edu/miller_lab/) tool. ## Table of Contents [TOC] ### Software required - [seqkit](https://bioinf.shenwei.me/seqkit/) - [faToTwoBit](http://hgdownload.soe.ucsc.edu/admin/exe/) - [lastz](http://www.bx.psu.edu/~rsharris/lastz/) - [GenomeAlignmentTools](https://github.com/hillerlab/GenomeAlignmentTools) - [TBA/multiz](https://www.bx.psu.edu/miller_lab/) ### Input files #### Query files: 1. Make sure your genomes are softmasked. 2. Edit your genomes headers for a non generic and meaningful header. *For example the headers from human genome hg38 are >chr_1, >chr_2, etc. Change the headers for >hsap_1, >hsap_2 etc.* ``` sed -i 's/chr/hsap/g' hg38.fa ``` 3. Convert your fasta file to 2bit format ``` faToTwoBit query.fa query.2bit ``` #### Reference file: Lastz align a query genome against a reference genome. 1. Make sure your reference genome is softmasked. 2. Edit your genome headers for a non generic and meaningful header. 3. Convert your fasta file to 2bit format ``` faToTwoBit reference.fa reference.2bit ``` :::warning :pencil: **Note:** For vertebrate genomes, lastz run will take from 1 to 2 weeks. Because the time-wall for the jobs in nocona is 48 hours, we can split the reference genome in multiple chunks and run lastz in each chunk. ::: ##### Split the reference genome Depend on the size of the reference genome, set up a number of sequences in each chunk. In this case I'll split it to 100 sequences. ``` seqkit split2 -s 100 reference.fa cd reference.fa.split ``` Convert each chunk to 2bit format. ``` for f in *fa; do OUT=$(basename $f | sed 's/fa/2bit/g'); faToTwoBit $f $OUT; done ``` ### Pairwise alignment: ##### Standard run Lastz can be run with a simple command: ``` lastz reference.2bit'[multiple]' query.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > output_alignment.axt ``` :::warning :pencil: **Note:** <span style="color:black"> '[multiple]' is a flag that we need to put if the reference has more than one sequence, otherwise lastz will assume that the reference genome is a single huge sequence.</span> ::: :::info :bulb: <span style="color:blue">**Important**</span> To align placental mammals I set the following parameters[^1^](#References) K=2400, L=3000, Y=3400 and H=2000 For closely related placental mammals I set the following parameters[^2^](#References) K=4500 and L=4500 ::: ##### Lastz for big data If the reference genome was splitted in multiple chunks, lastz can be run simultaneously with less physical memory, when mapping a large set of different query genomes against the same reference sequence. For this, we need to create a [target capsule file](http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html#adv_capsule) and run each query against the capsule file: ```[bash] #!/bin/bash #SBATCH --job-name lastz_1 #SBATCH --output=lastz.log #SBATCH -p extended-28core #SBATCH --nodes=2 #SBATCH --ntasks-per-node=28 #SBATCH -t 07-00:00:00 #Usage sbatch lastz_subjob.sh <refseq.2bit> REFSEQ=$1 REFX=$(basename $REFSEQ | 's/\.2bit//g') CAPSULE=${REFX}_capsule_file export PATH="/gpfs/home/dmorenosanti/lastz-distrib/bin/:$PATH" #Check if capsule file exists, if not, create it. if [ ! -f ${CAPSULE} ]; then lastz ${REFSEQ} --writecapsule=$CAPSULE else echo "capsule file already exists" fi echo "Running lastz for pairwise alignments against reference genome, ${date}" lastz --targetcapsule=${CAPSULE} query1.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query1_vs_${REFX}_pw_alignment.axt & lastz1=$! lastz --targetcapsule=${CAPSULE} query2.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query2_vs_${REFX}_pw_alignment.axt & lastz2=$! lastz --targetcapsule=${CAPSULE} query3.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query3_vs_${REFX}_pw_alignment.axt & lastz3=$! lastz --targetcapsule=${CAPSULE} query4.2bit --format=axt K=2400 L=3000 Y=3400 H=2000 > query4_vs_${REFX}_pw_alignment.axt & lastz4=$! #wait for the termination of all runs wait $lastz1 $lastz2 $lastz3 $lastz4 echo "lastz run has finished for all the genomes" ``` ### Chaining The following pipeline was written following [GenomeAlignmentTools](https://github.com/hillerlab/GenomeAlignmentTools) documentation. ##### 1. Run axtChain Chain together axt alignments ```bash= AXTOUT=alignment.axt REFERENCE=reference_genome.fa QUERY=query_genome.fa CHAINOUT=chain_output.chain axtChain -linearGap=medium ${AXTOUT} -faT ${REFERENCE} -faQ ${QUERY} ${CHAINOUT} ``` ##### 2. Run RepeatFiller To incorporate repetitive regions into pairwise alignment chains, to improve the annotation of conserved non-coding regions[^3^](#References). ```bash= CHAINOUT=chain_output.chain REFERENCE=reference.2bit GENOME=query_genome.2bit RFOUT=output_refill.chain python RepeatFiller.py -c ${CHAINOUT} -T2 ${REFERENCE} -Q2 ${GENOME} -o ${RFOUT} ``` ##### 3. chainCleaner ```bash= CHAINOUT=chain_output.chain REFERENCE=reference.fa GENOME=query_genome.fa REF2=reference.2bit GEN2=query_genome.2bit RFOUT=output_refill.chain REFX=$(basename ${REFSEQ} | awk -F '.' '{print $1}') QUEX=$(basename ${GENOME} | awk -F '_' '{print $1}') #Get genome size faSize -detailed ${GENOME} > ${QUEX}.size faSize -detailed ${REFERENCE} > ${REFX}.size #Run chainCleanner chainCleaner ${RFOUT} ${REF2} ${GEN2} ${QUEX}_clean.chain ${QUEX}_clean.bed -tSizes=${REFX}.size -qSizes=${QUEX}.size -linearGap=medium ``` ##### 4. Nets ```bash= chainNet output_clean.chain reference_genome.sizes query_genome.sizes reference.net query.net -rescore -tNibDir=reference_genome.2bit -qNibDir=query_genome.2bit -linearGap=medium NetFilterNonNested.perl -doUCSCSynFilter -keepSynNetsWithScore 5000 -keepInvNetsWithScore 5000 ref.query.net.gz > ref.query.filtered.net ``` ##### 5. Convert nets to maf alignment ```bash= netToAxt ref.query.filtered.net output_clean.chain reference.2bit query.2bit output_alignment.axt axtSort output_alignment.axt output_alignment_sort.axt axtToMaf output_alignment_sort.axt reference.sizes query.sizes final_alignment.maf ``` ### Multiz Multiz/TBA needs a specific headers for the genome file. The FASTA header format should be as follows: string1:string2:int1:char:int2 For more information check the [TBA documentation](https://www.bx.psu.edu/miller_lab/dist/tba_howto.pdf). The headers can be obtained with the multiz_headers.sh script ``` sh multiz_headers.sh genome_file.fa ``` ``` #!/bin/bash #SBATCH -J multiz #SBATCH -o %x.o%j #SBATCH -e %x.e%j #SBATCH -p nocona #SBATCH -N 1 #SBATCH -n 1 #SBATCH -t 48:00:00 GETHEADER=/home/dmorenos/software/multiz/get_standard_headers RENAME=/home/dmorenos/scripts/rename_headers.py #1. Modify genome headers GENOME=$1 PREFIX=$(basename $GENOME | awk -F'[_]' '{print $1}') get_standard_headers $GENOME > standard_headers.tmp sed -e 's/\(1-\).*\(:+:\)/\1\2/' standard_headers.tmp | sed ':a;N;$!ba;s/==>\n/:/g;s/ :1-/:1/g' > multiz_headers.tmp python rename_headers.py -input $GENOME -headers multiz_headers.tmp -output $PREFIX'_multiz_headers.fa' rm *tmp ``` Before runing MULTIZ make sure that: - You have change the headers as indicated above. - Your reference genome has the same prefix in all the sequences (you set up this at the [first step](https://hackmd.io/wucGwX0ZTOOFb3W1-R0mYw#Reference-file)). - You have the original fasta files of your genomes in the same directory. - You have a parentetic species tree as an input for multiz, make sure that the name of the species on the tree are the same as the fasta file and header. - Example: One of your species is human. If in the tree you type ((human),bonobo)) your fasta genome file must be human.fasta and the headers inside that file must be >human_1 ``` PATH=$PATH:"/home/dmorenos/software/multiz/" TEMPDIR="temp_dir" mkdir ${TEMPDIR} date echo "starting multiz" roast + T=$TEMPDIR E=referenceprefix "(((),()" *.*.sing.maf whole_multiz_bats.maf ``` ### References 1. Sharma V. and Hiller M. (2017). Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Research, 45(14):8369-8377. 2. Hecker N. and Hiller M. (2020). A genome alignment of 120 mammals highlights ultraconserved element variability and placenta-associated enhancers. GigaScience, 9(1). ###### tags: `Genome analysis` , `Diana`