# Create new version HLA reference for Hisat2 ###### tags: `c4lab` Hisat2 Kim, D., Paggi, J.M., Park, C. et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915 (2019). https://daehwankimlab.github.io/hisat2/ https://daehwankimlab.github.io/hisat-genotype/ https://github.com/DaehwanKimLab/hisat2 https://github.com/DaehwanKimLab/hisat-genotype ## Step0 ### Setup hisat-genotype Download latest version of hisat-genotype ``` bash git clone https://github.com/DaehwanKimLab/hisat-genotype.git ./hisatgenotype echo '{"sanity_check": false}' > hisatgenotype/devel/settings.json export PATH=$PWD/hisatgenotype:$PATH export PYTHONPATH=$PWD/hisatgenotype/hisatgenotype_modules:$PYTHONPATH ``` ### Run HLA typing with default index Make sure you have install `hisat2` with version `2.2.1` and the below command should work ``` bash hisatgenotype --base hla \ --threads 30 \ --keep-alignment -v --keep-extract \ -z ${hisat_index} \ -1 ${sample_name}.R1.fq.gz \ -2 ${sample_name}.R2.fq.gz \ --out-dir ./tmp_hisat ``` Where `hisat_index` is `hisatgenotype/indicies/` in default ## Step1: Prepare hisat2 base index You can either copy from `hisatgenotype/indicies/` or download it by yourself. * Copy ``` bash cp -r hisatgenotype/indicies/ hisat_index rm -rf hisat_index/hla* hisat_index/hisatgenotype_db ``` * Download genome ``` bash mkdir hisat_index cd hisat_index wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat-genotype/data/genotype_genome_20180128.tar.gz tar xvzf genotype_genome_20180128.tar.gz wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz tar xvzf grch38.tar.gz rm grch38.tar.gz hisat2-inspect grch38/genome > genome.fa samtools faidx genome.fa cd .. ``` ## Step2: Prepare Hisat2 HLA data ### Latest version You can download from IMGT/HLA FTP http://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/ (version 3.45.0 at 2021/08/12) ``` mkdir -p hisat_index/hisatgenotype_db/HLA/ cd hisat_index/hisatgenotype_db/HLA/ wget -nd -mL ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/* -P fasta wget -nd -mL ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/msf/* -P msf wget ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla.dat cd ../../.. ``` ### Previous version You can find older version in github repo https://github.com/ANHIG/IMGTHLA by searching in branch. In this exmpale, I want to use `3.42.0`, then choose `3420` branch(https://github.com/ANHIG/IMGTHLA/tree/3420) ``` bash mkdir -p hisat_index/hisatgenotype_db/ cd hisat_index/hisatgenotype_db/ git clone https://github.com/ANHIG/IMGTHLA HLA cd HLA git checkout 3420 # change this git lfs install git lfs pull cd ../../.. ``` if `git lfs` not found, please download from https://git-lfs.github.com/ ## Step3: Build Index and run Hisatgenotype will automatically build index for you as long as you don't have `hla.graph*` in `hisat_index` ``` bash hisat_index="./hisat_index" hisatgenotype --base hla \ --threads 30 \ --keep-alignment -v --keep-extract \ -z ${hisat_index} \ -1 ${sample_name}.R1.fq.gz \ -2 ${sample_name}.R2.fq.gz \ --out-dir ./tmp_hisat ``` ### Index building output ``` txt No hisat_index/hla_backbone.fa file found Building hla Database HLA-A's reference allele is A*03:01:01:01 on '+' strand of chromosome 6 HLA-B's reference allele is B*07:02:01:01 on '-' strand of chromosome 6 HLA-C's reference allele is C*07:02:01:03 on '-' strand of chromosome 6 ... U exon counts: {0: 5} V exon counts: {0: 3, 1: 3, 2: 3} W exon counts: {0: 11, 1: 11, 2: 11, 3: 11, 4: 11, 5: 11} A: number of alleles is 6291. Number of variants is 3012. Length of additional sequences for haplotypes: 10718064 B: number of alleles is 7561. Number of variants is 2973. Length of additional sequences for haplotypes: 18906000 ... W: number of alleles is 11. Number of variants is 84. Length of additional sequences for haplotypes: 32571 Running Extraction for : hla No hisat_index/hla.graph.1.ht2 file found Running: hisat2-build -p 30 --snp hisat_index/hla.index.snp --haplotype hisat_index/hla.haplotype hisat_index/hla_backbone.fa hisat_index/hla.graph ``` After successfully built it, you can run `hisatgenotype` with hla prebuilt index, it will be much quicker. ### Directory ``` $ ls -alh hisat_index total 19G drwxrwxr-x. 6 linnil1 linnil1 4.0K Aug 14 18:11 . drwxrwxr-x. 10 linnil1 linnil1 4.0K Aug 16 15:25 .. -rw-rw-r--. 1 linnil1 linnil1 3.0G Aug 14 14:30 genome.fa -rw-rw-r--. 1 linnil1 linnil1 6.3K Aug 14 14:31 genome.fa.fai -rw-r--r--. 1 linnil1 linnil1 2.0G Jan 29 2018 genotype_genome.1.ht2 -rw-r--r--. 1 linnil1 linnil1 794M Jan 29 2018 genotype_genome.2.ht2 -rw-r--r--. 1 linnil1 linnil1 12K Jan 29 2018 genotype_genome.3.ht2 -rw-r--r--. 1 linnil1 linnil1 703M Jan 29 2018 genotype_genome.4.ht2 -rw-r--r--. 1 linnil1 linnil1 2.1G Jan 29 2018 genotype_genome.5.ht2 -rw-r--r--. 1 linnil1 linnil1 751M Jan 29 2018 genotype_genome.6.ht2 -rw-r--r--. 1 linnil1 linnil1 488M Jan 29 2018 genotype_genome.7.ht2 -rw-r--r--. 1 linnil1 linnil1 152M Jan 29 2018 genotype_genome.8.ht2 -rw-r--r--. 1 linnil1 linnil1 225K Jan 29 2018 genotype_genome.allele -rw-r--r--. 1 linnil1 linnil1 0 Jan 29 2018 genotype_genome.clnsig -rw-r--r--. 1 linnil1 linnil1 5.6K Jan 29 2018 genotype_genome.coord -rw-r--r--. 1 linnil1 linnil1 3.0G Jan 29 2018 genotype_genome.fa -rw-r--r--. 1 linnil1 linnil1 6.3K Jan 29 2018 genotype_genome.fa.fai -rw-r--r--. 1 linnil1 linnil1 555M Jan 29 2018 genotype_genome.haplotype -rw-r--r--. 1 linnil1 linnil1 441M Jan 29 2018 genotype_genome.index.snp -rw-r--r--. 1 linnil1 linnil1 4.1M Jan 29 2018 genotype_genome.link -rw-r--r--. 1 linnil1 linnil1 4.7K Jan 29 2018 genotype_genome.locus -rw-r--r--. 1 linnil1 linnil1 189K Jan 29 2018 genotype_genome.partial -rw-r--r--. 1 linnil1 linnil1 441M Jan 29 2018 genotype_genome.snp drwxr-xr-x. 2 linnil1 linnil1 4.0K Mar 17 2016 grch38 -rw-rw-r--. 1 linnil1 linnil1 4.0G Aug 14 14:27 grch38.tar.gz drwxr-xr-x. 3 linnil1 linnil1 16 Aug 14 16:24 hisatgenotype_db -rw-rw-r--. 1 linnil1 linnil1 279K Aug 14 18:07 hla.allele -rw-rw-r--. 1 linnil1 linnil1 211K Aug 14 18:07 hla_backbone.fa drwxrwxr-x. 4 linnil1 linnil1 55 Aug 14 14:02 HLA_backup -rw-rw-r--. 1 linnil1 linnil1 39M Aug 14 18:10 hla.graph.1.ht2 -rw-rw-r--. 1 linnil1 linnil1 15M Aug 14 18:10 hla.graph.2.ht2 -rw-rw-r--. 1 linnil1 linnil1 314 Aug 14 18:07 hla.graph.3.ht2 -rw-rw-r--. 1 linnil1 linnil1 52K Aug 14 18:07 hla.graph.4.ht2 -rw-rw-r--. 1 linnil1 linnil1 598K Aug 14 18:14 hla.graph.5.ht2 -rw-rw-r--. 1 linnil1 linnil1 140K Aug 14 18:14 hla.graph.6.ht2 -rw-rw-r--. 1 linnil1 linnil1 3.5M Aug 14 18:07 hla.graph.7.ht2 -rw-rw-r--. 1 linnil1 linnil1 98K Aug 14 18:07 hla.graph.8.ht2 -rw-rw-r--. 1 linnil1 linnil1 208M Aug 14 18:14 hla.graph.rf -rw-rw-r--. 1 linnil1 linnil1 6.2M Aug 14 18:07 hla.haplotype -rw-rw-r--. 1 linnil1 linnil1 461K Aug 14 18:07 hla.index.snp -rw-rw-r--. 1 linnil1 linnil1 14M Aug 14 18:07 hla.link -rw-rw-r--. 1 linnil1 linnil1 3.3K Aug 14 18:07 hla.locus -rw-rw-r--. 1 linnil1 linnil1 139K Aug 14 18:07 hla.partial -rw-rw-r--. 1 linnil1 linnil1 153M Aug 14 18:07 hla_sequences.fa -rw-rw-r--. 1 linnil1 linnil1 720K Aug 14 18:07 hla.snp -rw-rw-r--. 1 linnil1 linnil1 267K Aug 14 18:07 hla.snp.freq -rw-rw-r--. 1 linnil1 linnil1 68 Aug 14 17:06 hla.version ``` ### Bugs If you encouter some bugs, check previous step especially this one ``` bash echo '{"sanity_check": false}' > hisatgenotype/devel/settings.json ```