# Create new version HLA reference for Hisat2
###### tags: `c4lab`
Hisat2
Kim, D., Paggi, J.M., Park, C. et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915 (2019).
https://daehwankimlab.github.io/hisat2/
https://daehwankimlab.github.io/hisat-genotype/
https://github.com/DaehwanKimLab/hisat2
https://github.com/DaehwanKimLab/hisat-genotype
## Step0
### Setup hisat-genotype
Download latest version of hisat-genotype
``` bash
git clone https://github.com/DaehwanKimLab/hisat-genotype.git ./hisatgenotype
echo '{"sanity_check": false}' > hisatgenotype/devel/settings.json
export PATH=$PWD/hisatgenotype:$PATH
export PYTHONPATH=$PWD/hisatgenotype/hisatgenotype_modules:$PYTHONPATH
```
### Run HLA typing with default index
Make sure you have install `hisat2` with version `2.2.1` and
the below command should work
``` bash
hisatgenotype --base hla \
--threads 30 \
--keep-alignment -v --keep-extract \
-z ${hisat_index} \
-1 ${sample_name}.R1.fq.gz \
-2 ${sample_name}.R2.fq.gz \
--out-dir ./tmp_hisat
```
Where `hisat_index` is `hisatgenotype/indicies/` in default
## Step1: Prepare hisat2 base index
You can either copy from `hisatgenotype/indicies/` or download it by yourself.
* Copy
``` bash
cp -r hisatgenotype/indicies/ hisat_index
rm -rf hisat_index/hla* hisat_index/hisatgenotype_db
```
* Download genome
``` bash
mkdir hisat_index
cd hisat_index
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat-genotype/data/genotype_genome_20180128.tar.gz
tar xvzf genotype_genome_20180128.tar.gz
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz
tar xvzf grch38.tar.gz
rm grch38.tar.gz
hisat2-inspect grch38/genome > genome.fa
samtools faidx genome.fa
cd ..
```
## Step2: Prepare Hisat2 HLA data
### Latest version
You can download from IMGT/HLA FTP
http://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/
(version 3.45.0 at 2021/08/12)
```
mkdir -p hisat_index/hisatgenotype_db/HLA/
cd hisat_index/hisatgenotype_db/HLA/
wget -nd -mL ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/* -P fasta
wget -nd -mL ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/msf/* -P msf
wget ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla.dat
cd ../../..
```
### Previous version
You can find older version in github repo https://github.com/ANHIG/IMGTHLA by searching in branch.
In this exmpale, I want to use `3.42.0`, then choose `3420` branch(https://github.com/ANHIG/IMGTHLA/tree/3420)
``` bash
mkdir -p hisat_index/hisatgenotype_db/
cd hisat_index/hisatgenotype_db/
git clone https://github.com/ANHIG/IMGTHLA HLA
cd HLA
git checkout 3420 # change this
git lfs install
git lfs pull
cd ../../..
```
if `git lfs` not found, please download from https://git-lfs.github.com/
## Step3: Build Index and run
Hisatgenotype will automatically build index for you as long as you don't have `hla.graph*` in `hisat_index`
``` bash
hisat_index="./hisat_index"
hisatgenotype --base hla \
--threads 30 \
--keep-alignment -v --keep-extract \
-z ${hisat_index} \
-1 ${sample_name}.R1.fq.gz \
-2 ${sample_name}.R2.fq.gz \
--out-dir ./tmp_hisat
```
### Index building output
``` txt
No hisat_index/hla_backbone.fa file found
Building hla Database
HLA-A's reference allele is A*03:01:01:01 on '+' strand of chromosome 6
HLA-B's reference allele is B*07:02:01:01 on '-' strand of chromosome 6
HLA-C's reference allele is C*07:02:01:03 on '-' strand of chromosome 6
...
U exon counts: {0: 5}
V exon counts: {0: 3, 1: 3, 2: 3}
W exon counts: {0: 11, 1: 11, 2: 11, 3: 11, 4: 11, 5: 11}
A: number of alleles is 6291.
Number of variants is 3012.
Length of additional sequences for haplotypes: 10718064
B: number of alleles is 7561.
Number of variants is 2973.
Length of additional sequences for haplotypes: 18906000
...
W: number of alleles is 11.
Number of variants is 84.
Length of additional sequences for haplotypes: 32571
Running Extraction for : hla
No hisat_index/hla.graph.1.ht2 file found
Running: hisat2-build -p 30 --snp hisat_index/hla.index.snp --haplotype hisat_index/hla.haplotype hisat_index/hla_backbone.fa hisat_index/hla.graph
```
After successfully built it, you can run `hisatgenotype` with hla prebuilt index, it will be much quicker.
### Directory
```
$ ls -alh hisat_index
total 19G
drwxrwxr-x. 6 linnil1 linnil1 4.0K Aug 14 18:11 .
drwxrwxr-x. 10 linnil1 linnil1 4.0K Aug 16 15:25 ..
-rw-rw-r--. 1 linnil1 linnil1 3.0G Aug 14 14:30 genome.fa
-rw-rw-r--. 1 linnil1 linnil1 6.3K Aug 14 14:31 genome.fa.fai
-rw-r--r--. 1 linnil1 linnil1 2.0G Jan 29 2018 genotype_genome.1.ht2
-rw-r--r--. 1 linnil1 linnil1 794M Jan 29 2018 genotype_genome.2.ht2
-rw-r--r--. 1 linnil1 linnil1 12K Jan 29 2018 genotype_genome.3.ht2
-rw-r--r--. 1 linnil1 linnil1 703M Jan 29 2018 genotype_genome.4.ht2
-rw-r--r--. 1 linnil1 linnil1 2.1G Jan 29 2018 genotype_genome.5.ht2
-rw-r--r--. 1 linnil1 linnil1 751M Jan 29 2018 genotype_genome.6.ht2
-rw-r--r--. 1 linnil1 linnil1 488M Jan 29 2018 genotype_genome.7.ht2
-rw-r--r--. 1 linnil1 linnil1 152M Jan 29 2018 genotype_genome.8.ht2
-rw-r--r--. 1 linnil1 linnil1 225K Jan 29 2018 genotype_genome.allele
-rw-r--r--. 1 linnil1 linnil1 0 Jan 29 2018 genotype_genome.clnsig
-rw-r--r--. 1 linnil1 linnil1 5.6K Jan 29 2018 genotype_genome.coord
-rw-r--r--. 1 linnil1 linnil1 3.0G Jan 29 2018 genotype_genome.fa
-rw-r--r--. 1 linnil1 linnil1 6.3K Jan 29 2018 genotype_genome.fa.fai
-rw-r--r--. 1 linnil1 linnil1 555M Jan 29 2018 genotype_genome.haplotype
-rw-r--r--. 1 linnil1 linnil1 441M Jan 29 2018 genotype_genome.index.snp
-rw-r--r--. 1 linnil1 linnil1 4.1M Jan 29 2018 genotype_genome.link
-rw-r--r--. 1 linnil1 linnil1 4.7K Jan 29 2018 genotype_genome.locus
-rw-r--r--. 1 linnil1 linnil1 189K Jan 29 2018 genotype_genome.partial
-rw-r--r--. 1 linnil1 linnil1 441M Jan 29 2018 genotype_genome.snp
drwxr-xr-x. 2 linnil1 linnil1 4.0K Mar 17 2016 grch38
-rw-rw-r--. 1 linnil1 linnil1 4.0G Aug 14 14:27 grch38.tar.gz
drwxr-xr-x. 3 linnil1 linnil1 16 Aug 14 16:24 hisatgenotype_db
-rw-rw-r--. 1 linnil1 linnil1 279K Aug 14 18:07 hla.allele
-rw-rw-r--. 1 linnil1 linnil1 211K Aug 14 18:07 hla_backbone.fa
drwxrwxr-x. 4 linnil1 linnil1 55 Aug 14 14:02 HLA_backup
-rw-rw-r--. 1 linnil1 linnil1 39M Aug 14 18:10 hla.graph.1.ht2
-rw-rw-r--. 1 linnil1 linnil1 15M Aug 14 18:10 hla.graph.2.ht2
-rw-rw-r--. 1 linnil1 linnil1 314 Aug 14 18:07 hla.graph.3.ht2
-rw-rw-r--. 1 linnil1 linnil1 52K Aug 14 18:07 hla.graph.4.ht2
-rw-rw-r--. 1 linnil1 linnil1 598K Aug 14 18:14 hla.graph.5.ht2
-rw-rw-r--. 1 linnil1 linnil1 140K Aug 14 18:14 hla.graph.6.ht2
-rw-rw-r--. 1 linnil1 linnil1 3.5M Aug 14 18:07 hla.graph.7.ht2
-rw-rw-r--. 1 linnil1 linnil1 98K Aug 14 18:07 hla.graph.8.ht2
-rw-rw-r--. 1 linnil1 linnil1 208M Aug 14 18:14 hla.graph.rf
-rw-rw-r--. 1 linnil1 linnil1 6.2M Aug 14 18:07 hla.haplotype
-rw-rw-r--. 1 linnil1 linnil1 461K Aug 14 18:07 hla.index.snp
-rw-rw-r--. 1 linnil1 linnil1 14M Aug 14 18:07 hla.link
-rw-rw-r--. 1 linnil1 linnil1 3.3K Aug 14 18:07 hla.locus
-rw-rw-r--. 1 linnil1 linnil1 139K Aug 14 18:07 hla.partial
-rw-rw-r--. 1 linnil1 linnil1 153M Aug 14 18:07 hla_sequences.fa
-rw-rw-r--. 1 linnil1 linnil1 720K Aug 14 18:07 hla.snp
-rw-rw-r--. 1 linnil1 linnil1 267K Aug 14 18:07 hla.snp.freq
-rw-rw-r--. 1 linnil1 linnil1 68 Aug 14 17:06 hla.version
```
### Bugs
If you encouter some bugs, check previous step especially this one
``` bash
echo '{"sanity_check": false}' > hisatgenotype/devel/settings.json
```