# Alignment exercise using a transformed HAL file for DESCHRAMBLER. ###### Author: Manuel Hoyos mhoyosro@ttu.edu ###### - [ ] Now, just to keep the perspective of things, I will outline our inmediate plan/objectives: 1. We will transform a Hal file into Chains to feed it into DESCHRAMBLER because the process may be faster this way. 2. You will kindly help me align pKuh.softmasked.fa and rFer.softmasked.fa using your traditional parallelized process to ensure that everything matches in the end. **The relevant files to achieve our objectives are available at this link:** https://drive.google.com/drive/folders/1GGLArs1LDVYmtKVSq_RlI9DbL5LKPxLj?usp=sharing Next, I'm going to bore you with the procedure I have followed to transform a Hal file into Chains (I know it's long and maybe unnecessary, but I think it completes the message). ### 1. The first step is to describe the species and the tree I am using. As I am interested in phyllostomid bats. I performed an alignment (with just five species) As follows: rFer.softmasked.fa, rMic.softmasked.fa, aJam.softmasked.fa, tBra.softmasked.fa, pKuh.softmasked.fa This is the tree: ![](https://hackmd.io/_uploads/H1ngZ6oS3.jpg) In each node, there is a label that starts with "manAnc" which stands for "manuel Ancestor". I wrote that to make a distinction from another tree I was working on at that time, where I didn't include any ancestors. As you can see, there are 19 ancestors. I'm telling you all this to make it very clear how the Cactus alignment program presents the output of the entire alignment process, which looks like this: ``` 2683837899 Dec 28 20:31 manAnc16.cigar 3881846925 Dec 28 20:31 manAnc16.cigar.secondary 2060541479 Dec 28 20:31 manAnc16.cigar.og_fragment_0 982793920 Dec 28 20:31 manAnc16.cigar.og_fragment_1 334937139 Dec 28 20:31 manAnc16.cigar.og_fragment_2 661554506 Dec 28 20:31 manAnc16.cigar.ig_coverage_0 897735910 Dec 28 20:31 manAnc16.cigar.ig_coverage_1 3923833430 Dec 29 05:40 manAnc16.hal 1788551383 Dec 29 05:40 manAnc16.fa 2627281668 Dec 29 20:42 manAnc7.cigar 4121715727 Dec 29 20:43 manAnc7.cigar.secondary 2066631165 Dec 29 20:43 manAnc7.cigar.og_fragment_0 233070744 Dec 29 20:43 manAnc7.cigar.og_fragment_2 982218008 Dec 29 20:43 manAnc7.cigar.og_fragment_1 983275386 Dec 29 20:43 manAnc7.cigar.ig_coverage_0 1021480342 Dec 29 20:43 manAnc7.cigar.ig_coverage_1 2075106341 Jan 3 17:05 rFer.softmasked.fa 2564589532 Jan 3 17:06 rMic.softmasked.fa 2224126925 Jan 3 17:06 aJam.softmasked.fa 1014 Jan 3 17:06 fivegenomes.txt 2283176224 Jan 3 20:31 tBra.softmasked.fa 1775724452 Jan 3 20:31 pKuh.softmasked.fa 4372705968 Jan 4 19:31 manAnc7.hal 1931873984 Jan 4 19:32 manAnc7.fa 1782200946 Jan 5 02:12 manAnc17.cigar 2952881705 Jan 5 02:12 manAnc17.cigar.secondary 1871910340 Jan 5 02:12 manAnc17.cigar.og_fragment_0 572994967 Jan 5 02:12 manAnc17.cigar.ig_coverage_0 455145950 Jan 5 02:12 manAnc17.cigar.ig_coverage_1 3570997337 Jan 5 08:45 manAnc17.hal 1816394037 Jan 5 08:46 manAnc17.fa 630186533 Jan 5 11:33 manAnc19.cigar 742204635 Jan 5 11:33 manAnc19.cigar.secondary 1688338849 Jan 5 13:47 manAnc19.fa 3418283662 Jan 5 14:31 fivegenomes.hal ``` Of all this, the most important thing, of course, is the .hal file, which is called **fivegenomes.hal** in this case. ### 2. Now we need the .hal file to be useful as an input file for DESCHRAMBLER, which involves producing UCSC browser-style pairwise chains from the fivegenomes.hal file. But first we need HalTools and DoBlastzChainNet For that, we need to use haltools and DoBlastzChainNet. The installation of both can be a bit cumbersome, so I will provide my installation recipe for both programs below (off course all the paths apply to my HPCC because I am lazy today). * My recipe for HalTools instalation: ``` #Downloading HAL cd /lustre/work/mhoyosro/software git clone https://github.com/ComparativeGenomicsToolkit/hal.git #Installing dependencies for HAL cd hal mkdir -p DIR/hdf5 cd /lustre/work/mhoyosro/software/hal/DIR/hdf5 wget http://www.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.1/src/hdf5-1.10.1.tar.gz tar xzf hdf5-1.10.1.tar.gz cd hdf5-1.10.1 ./configure --enable-cxx --prefix /lustre/work/mhoyosro/software/hal/DIR/hdf5 make && make install #Before building HAL, update the following environment variables export PATH=/lustre/work/mhoyosro/software/hal/DIR/hdf5/bin:${PATH} export h5prefix=-prefix=/lustre/work/mhoyosro/software/hal/DIR/hdf5 #sonLib (A compact C/Python library for sequence analysis in bioinformatics from Benedict Paten) #HAL and sonLib must be sibling directories cd /lustre/work/mhoyosro/software git clone https://github.com/ComparativeGenomicsToolkit/sonLib.git pushd sonLib && make && popd # Finish installation (HPPC solved this part on april 25th) cd /lustre/work/mhoyosro/software/hal module load gcc/10.1.0 hdf5/1.10.6 Make # Set the path every time and everywhere the program is used (Otherwise nothing will work!!) export PATH=/lustre/work/mhoyosro/software/hal/DIR/hdf5/bin:${PATH} export h5prefix=-prefix=/lustre/work/mhoyosro/software/hal/DIR/hdf5 export PATH=/lustre/work/mhoyosro/software/hal/bin:${PATH} ``` * My recipe for DoBlastzChainNet instalation: ``` # Go to the operational directory cd /lustre/work/mhoyosro/software/ # Prepare the subfolder mkdir DoBlastzChainNet && cd DoBlastzChainNet mkdir data && cd data mkdir scripts bin && cd bin # Download some software rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ . # I made this code to replace the cannonical one git clone git://genome-source.soe.ucsc.edu/kent.git cp -r /lustre/work/mhoyosro/software/DoBlastzChainNet/kent/src/hg/utils/automation/* /lustre/work/mhoyosro/software/DoBlastzChainNet/data/scripts/ #PATH setup # The two directories /data/bin and /data/scripts are added to the shell PATH environment # Originally the instructions from https://genomewiki.ucsc.edu/index.php?title=DoBlastzChainNet.pl said this: ----- echo 'export PATH=/data/bin:/data/scripts:$PATH' >> $HOME/.bashrc ----- # But I don't get very much what that thing does so I prefer this path export PATH=/lustre/work/mhoyosro/software/DoBlastzChainNet/data/bin:${PATH} export PATH=/lustre/work/mhoyosro/software/DoBlastzChainNet/data/scripts:${PATH} ``` The important thing about this is that to convert HAL to CHAINS, I will need to set up these paths. ``` # HalTools path export PATH=/lustre/work/mhoyosro/software/hal/DIR/hdf5/bin:${PATH} export h5prefix=-prefix=/lustre/work/mhoyosro/software/hal/DIR/hdf5 export PATH=/lustre/work/mhoyosro/software/hal/bin:${PATH} # doblastchain path export PATH=/lustre/work/mhoyosro/software/DoBlastzChainNet/data/bin:${PATH} export PATH=/lustre/work/mhoyosro/software/DoBlastzChainNet/data/scripts:${PATH} ``` ### 3. In this step, I will show you what I have done to transform that HAL into CHAINS. Everything is based on this page https://github.com/ComparativeGenomicsToolkit/hal/blob/chaining-doc/doc/chaining-mapping.md ``` # Go to the location of the file fivegenomes.hal cd /lustre/work/mhoyosro/software/cactus/steps-output # Set the paths for HalTools and DoBlastzChainNet export PATH=/lustre/work/mhoyosro/software/hal/DIR/hdf5/bin:${PATH} export h5prefix=-prefix=/lustre/work/mhoyosro/software/hal/DIR/hdf5 export PATH=/lustre/work/mhoyosro/software/hal/bin:${PATH} export PATH=/lustre/work/mhoyosro/software/DoBlastzChainNet/data/bin:${PATH} export PATH=/lustre/work/mhoyosro/software/DoBlastzChainNet/data/scripts:${PATH} # print the list of genomes in alignment halStats --genomes fivegenomes.hal # Convert to fasta the genomes of interest and then transform them to 2bit hal2fasta fivegenomes.hal pKuh.softmasked.fa | faToTwoBit stdin pKuh.softmasked.2bit hal2fasta fivegenomes.hal rFer.softmasked.fa | faToTwoBit stdin rFer.softmasked.2bit # So far, this is similar to the regular DoBlastzChainNet procedure # Next, the source genome sequences are obtained in BED format. # Importantly, here the target is the source genome!! # So for this exercise the TARGET/SOURCE is pKuh.softmasked.fa # --bedSequences Prints sequences of given genome in bed format halStats --bedSequences pKuh.softmasked.fa fivegenomes.hal > pKuh.bed # When we inspect the resultant file (pKuh.bed) We observe that the only thing we have done so far is defining the scaffolds. cat pKuh.bed # Next, following the instructions provided by the mentioned website, # the previous .bed file is then used by halLiftover to create pairwise alignments. halLiftover --outPSL fivegenomes.hal pKuh.softmasked.fa \ pKuh.bed rFer.softmasked.fa /dev/stdout | \ pslPosTarget stdin pKuh_to_rFer.psl # The resulting PSL files are then concatenated together. These alignments are then chained using the UCSC Browser axtChain program: axtChain -psl -linearGap=loose pKuh_to_rFer.psl rFer.softmasked.2bit pKuh.softmasked.2bit pKuh_to_rFer.chain ``` Well, I think that's it. Today, I'm going to focus on trying to use DESCHRAMBLER with the file I obtained because I haven't tried it yet. So if I encounter any issues, I'll definitely bother you. I'm embarrassed for sending such a long letter, but it gives me peace of mind to present a complete idea. Once again, thank you very much for your help.