[log] Sample + Azure:CPU-D8s-v3

[log] Sample + Azure:CPU-D8s-v3 === ###### tags: `Parabricks-v3.5` ###### tags: `基因體`, `NVIDIA`, `Clara`, `Parabricks`, `二級分析` [TOC] ## 資料集來源 [parabricks sample](/Bx3R_i2wSfmDl1cQE0w3OA#Sample) ## Azure VM 硬體 - D8s v3 系列 ![](https://i.imgur.com/9rfYsCW.png) ### CPU 資訊 - model name: `Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz` - 1(插槽) x 4(核/插槽) x 2(超執行緒/核) = 8條超執行緒 - physical id: 0 - cpu cores: 4 - processor: 0~7 - 超執行緒/核: 2 (每核是雙執行緒) ## CPU 工具 ### GATK - [4.2.0.0](https://github.com/broadinstitute/gatk/releases) #latest 2016/06/25 ```bash= wget https://github.com/broadinstitute/gatk/releases/download/4.2.0.0/gatk-4.2.0.0.zip ``` - 安裝 gatk ```bash= # gatk file: #!/usr/bin/env python sudo apt install -y python2 sudo ln -s /usr/bin/python2.7 /usr/bin/python python --version # sudo apt install unzip unzip gatk-4.2.0.0.zip sudo ln -s <full_path>/gatk-4.2.0.0/gatk /usr/bin/gatk gatk --version ``` ### BWA-MEM - 台大基因案版本 - 0.7.17-r1188 ``` Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.17-r1188 Contact: Heng Li <lh3@sanger.ac.uk> ``` - [[github] lh3 / bwa](https://github.com/lh3/bwa) ### VCF 工具 - bcftools ``` sudo apt install bcftools ``` - vcftools ``` sudo apt install -y vcftools ``` `#vcf-concat` `#vcf-stats` <hr> ## [CPU 指令](https://docs.nvidia.com/clara/parabricks/v3.5/text/germline_pipeline.html#compatible-cpu-based-bwa-mem-gatk4-commands) (VM: D8s-v3 + germline) :::warning :warning: **注意：sample 給的路徑錯誤** - 修改前：S1_1.fastq.gz S1_2.fastq.gz - 修改後：Data/sample_1.fq.gz Data/sample_2.fq.gz ::: ### commands > `nohup sh run_gatk_cpu_with_sample.sh | tee run_gatk_cpu_with_sample.log &` ```shell= start_time=`date +%s` echo start_time=$start_time # 建立暫存工作資料夾 rm -rf tmp-dir mkdir tmp-dir # Run bwa-mem and pipe output to create sorted bam bwa mem -t 32 -K 10000000 -R '@RG\tID:sample_rg1\tLB:lib1\tPL:bar\tSM:sample\tPU:sample_rg1' \ Ref/Homo_sapiens_assembly38.fasta Data/sample_1.fq.gz Data/sample_2.fq.gz | gatk \ SortSam --java-options -Xmx30g --MAX_RECORDS_IN_RAM 5000000 -I /dev/stdin \ -O cpu.bam --SORT_ORDER coordinate --TMP_DIR tmp-dir # Mark Duplicates gatk MarkDuplicates --java-options -Xmx30g -I cpu.bam -O mark_dups_cpu.bam \ -M metrics.txt --TMP_DIR tmp-dir # Generate BQSR Report gatk BaseRecalibrator --java-options -Xmx30g --input mark_dups_cpu.bam --output \ recal_cpu.txt --known-sites Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \ --reference Ref/Homo_sapiens_assembly38.fasta # Run ApplyBQSR Step gatk ApplyBQSR --java-options -Xmx30g -R Ref/Homo_sapiens_assembly38.fasta \ -I mark_dups_cpu.bam --bqsr-recal-file recal_cpu.txt -O cpu_nodups_BQSR.bam #Run Haplotype Caller gatk HaplotypeCaller --java-options -Xmx30g --input cpu_nodups_BQSR.bam --output \ result_cpu.vcf --reference Ref/Homo_sapiens_assembly38.fasta \ --native-pair-hmm-threads 16 end_time=`date +%s` echo end_time=$end_time echo $end_time - $start_time | bc ``` ### `gatk-4.1.0.0` 的錯誤訊息 > ``` $ gatk BaseRecalibrator --java-options -Xmx30g --input mark_dups_cpu.bam --output \ > recal_cpu.txt --known-sites Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \ > --reference Ref/Homo_sapiens_assembly38.fasta Using GATK jar /uploads/everythings/misc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar ... ... 10:14:21.533 INFO BaseRecalibrator - Shutting down engine [June 25, 2021 at 10:14:21 AM UTC] org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator done. Elapsed time: 0.02 minutes. Runtime.totalMemory()=645922816 Exception in thread "main" java.lang.IncompatibleClassChangeError: Inconsistent constant pool data in classfile for class org/broadinstitute/hellbender/transformers/ReadTransformer. Method 'org.broadinstitute.hellbender.utils.read.GATKRead lambda$identity$d67512bf$1(org.broadinstitute.hellbender.utils.read.GATKRead)' at index 65 is CONSTANT_MethodRef and should be CONSTANT_InterfaceMethodRef at org.broadinstitute.hellbender.transformers.ReadTransformer.identity(ReadTransformer.java:30) at org.broadinstitute.hellbender.engine.GATKTool.makePreReadFilterTransformer(GATKTool.java:290) at org.broadinstitute.hellbender.engine.GATKTool.getTransformedReadStream(GATKTool.java:319) at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:88) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205) at org.broadinstitute.hellbender.Main.main(Main.java:291) ``` > Exception in thread "main" java.lang.IncompatibleClassChangeError: Inconsistent constant pool data in classfile for class org/broadinstitute/hellbender/transformers/ReadTransformer. Method 'org.broadinstitute.hellbender.utils.read.GATKRead lambda$identity$d67512bf$1(org.broadinstitute.hellbender.utils.read.GATKRead)' at index 65 is CONSTANT_MethodRef and should be CONSTANT_InterfaceMethodRef ### `gatk-4.2.0.0` 的錯誤訊息不支援 -I=XXX or --input=XXX 等號這種參數指定用法 ``` A USER ERROR has occurred: Can't parse option name containing an embedded '=' (bqsr-recal-file=recal_cpu.txt) ``` ### 注意參數 | 指令 | 暫存目錄 | | --------------------- | ------------------- | | `bwa mem` | `--TMP_DIR tmp-dir` | | `gatk SortSam` | `--TMP_DIR tmp-dir` | | `gatk MarkDuplicates` | `--TMP_DIR tmp-dir` | | `gatk BaseRecalibrator` | `--tmp-dir tmp-dir` | | `gatk ApplyBQSR` | `--tmp-dir tmp-dir` :star: | | `gatk HaplotypeCaller` | `--tmp-dir tmp-dir` | <hr> ## 官方樣本測試時間 ### 沒有 ApplyBQSR，須重測 > ApplyBQSR 的輸入參數錯誤 > - BaseRecalibrator: --output recal_cpu.txt > - ApplyBQSR: --bqsr-recal-file=recal_file.txt > > 沒有對接上 | 回合 | 秒數 | 時間 | | --- | ---- | --- | | 1 | 4225 | 1h 10m 25s | | 2 | 4214 | 1h 10m 14s | | 3 | 4183 | 1h 09m 43s | | 平均 | 4207 | 1h 10m 07s | ### GATK-4.2.0.0 重測「官方測試樣本」結果 | Program | Round1 | Round2 | Round3 | Average | | ---------------- | ---------- | ---------- | ---------- | ---------- | | BWA-MEM+sort | 0h 34m 19s | 0h 34m 32s | 0h 34m 38s | 0h 34m 30s (5.5%) | | MarkDuplicates | 0h 06m 46s | 0h 07m 21s | 0h 07m 17s | 0h 07m 8s (1.1%) | | BaseRecalibrator | 0h 17m 41s | 0h 16m 09s | 0h 16m 48s | 0h 16m 53s (2.7%) | | ApplyBQSR | 0h 16m 22s | 0h 16m 26s | 0h 17m 32s | 0h 16m 47s (2.7%) | | HaplotypeCaller | 8h 54m 22s | 9h 35m 03s | 8h 54m 23s | 9h 07m 56s (87.9%) | | **Total** | **10h 09m 30s** | **10h 49m 31s** | **10h 10m 38s** | **10h 23m 14s** - program 印出的時間，都比夾出的時間少 2~4 秒 - BWA-MEM+sort: 多執行緒 [![](https://i.imgur.com/HpcGH7h.png)](https://i.imgur.com/HpcGH7h.png) - MarkDuplicates: 單執行緒 ![](https://i.imgur.com/WG9Kp6t.png) - HaplotypeCaller 一直都使用單執行緒，所以才會花這麼多時間？ [![](https://i.imgur.com/svgz80Y.png)](https://i.imgur.com/svgz80Y.png) [![](https://i.imgur.com/yaE9Rou.png)](https://i.imgur.com/yaE9Rou.png) [![](https://i.imgur.com/Mdao7Ey.png)](https://i.imgur.com/Mdao7Ey.png) 註記： 1. 發現預設是單執行緒 2. 如果 bam 檔是有 sorted + indexed，則此階段變成多執行緒 ### ApplyBQSR vs NotApplyBQSR | Program | ==GATK (預設) (+ApplyBQSR) average== | GATK -ApplyBQSR +SortSam2 | ==GATK (預設) +SortSam1== | Parabricks NC12s-v2 | GATK (預設) +SortSam1 +SortSam2 | | ---------------- | :--------: | :--------: | :--------: | :--------: | :--------: | | BWA-MEM+sort | 0h 34m 30s | 0h 34m 42s | 0h 34m 42s | 05m 06s | 0h 34m 48s | | SortSam1 | - | - | 0h 07m 40s | - | 0h 07m 37s | | MarkDuplicates | 0h 07m 08s | 0h 07m 27s | 0h 07m 35s | (計到下列) | 0h 06m 58s | | BaseRecalibrator | 0h 16m 53s | 0h 16m 39s | 0h 16m 27s | 00m 57s | 0h 16m 33s | | ApplyBQSR | 0h 16m 47s | - | 0h 16m 59s | - | - | | SortSam2 | - | 0h 07m 50s | - | - | 0h 07m 35s | | HaplotypeCaller | 9h 07m 56s | 2h 14m 34s | 2h 04m 41s | 05m 19s | 2h 14m 12s | | **Total** | **10h 09m 30s** | **3h 21m 12s** | **3h 28m 04s** | **11m 22s** | **3h 27m 43s** | - 備註 - 實驗2: 沒有 ApplyBQSR，需要加上 SortSam2 - 實驗3: MarkDuplicates 是不是要先 sort 才會比較快 - 實驗5: 針對實驗2，先sort，但看起來不會比較快 - 時間效益比 - GATK+A 是 Parabricks 的 53.7 倍 **(太誇張了!! 不太正常...)** - GATK-A 是 Parabricks 的 17.7 倍 - GATK+S1+A 是 Parabricks 的 18.3 倍 :+1: - Parabricks - 跑了 BQSR report，卻沒有套用 BQSR，為何？ ## 不同實驗間的效果比較 > - benchmark(truth): GATK (含 ApplyBQSR) > - #record=127720, #SNPs=110282, #indels=17466 > - GATK (含 ApplyBQSR) 簡記為 `GATK(+ApplyBQSR)` 或 `GATK+A` > - GATK (不含 ApplyBQSR) 簡記為 `GATK(-ApplyBQSR)` 或 `GATK-A` ### GATK(預設) vs GATK(-ApplyBQSR)(+SortSam2)  | type | TP | FP | FN | RECALL | PRECISION | |:-----:|:------:|:----:|:---:|:------:|:---------:| | SNP | 109718 | 5527 | 536 | 0.995 | 0.952 | | INDEL | 17087 | 886 | 379 | 0.978 | 0.951 | - SNP - PRECISION = 109718 / (109718+5527) = 0.9520 - RECALL = 109718 / (109718+536) = 0.9951 - #SNP = 109718 + 536 = 110254 < **110282 (real #SNP)** - INDEL - #INDEL = 17087 + 379 = 17466 = **17466 (real #SNP)** ### GATK(預設) vs Parabricks(-ApplyBQSR)  | type | TP | FP | FN | RECALL | PRECISION | |:-----:|:------:|:---:|:----:|:------:|:---------:| | SNP | 109640 | 286 | 614 | 0.994 | 0.997 | | INDEL | 16444 | 318 | 1022 | 0.941 | 0.981 | - Parabricks 雖然大幅降低了偽陽性，但也遺漏不少真陽案例 - GATK(-ApplyBQSR) 寧可誤抓，也不要有漏網之魚 ### GATK(預設) vs GATK(+SortSam1)  | type | TP | FP | FN | RECALL | PRECISION | |:-----:|:------:|:--:|:--:|:------:|:---------:| | SNP | 110254 | 0 | 0 | 1.0 | 1.0 | | INDEL | 17466 | 0 | 0 | 1.0 | 1.0 | ### VCF 計算方式 - ### [How Can I Count Snps In My Final Vcf Files](https://www.biostars.org/p/332366/) ``` $ bcftools stats result_cpu.vcf | head -n 35 # SN [2]id [3]key [4]value SN 0 number of samples: 1 SN 0 number of records: 127720 SN 0 number of no-ALTs: 0 SN 0 number of SNPs: 110282 SN 0 number of MNPs: 0 SN 0 number of indels: 17466 SN 0 number of others: 0 SN 0 number of multiallelic sites: 535 SN 0 number of multiallelic SNP sites: 28 ``` - #SNPs + #indels = 110282 + 17466 = 127748 > #records 有重疊 ``` $ bcftools query -f '%POS\n' result_cpu.vcf | wc -l 127720 ``` - ### [How Can I Count Snps In My Vcf Files?](https://www.biostars.org/p/49386/) ``` $ grep -v '#' result_cpu.vcf | wc -l 127720 ``` <hr> ## [v1] 帶有時間戳記的完整腳本(含 ApplyBQSR) > `nohup sh run_gatk_cpu_with_sample.sh | tee run_gatk_cpu_with_sample.log &` :::warning ```bash= printElapsedTime(){ label=$1 start_time=$2 end_time=$3 elapsed_time=`echo $end_time-$start_time | bc` hour=`echo $elapsed_time/3600 | bc` minute=`echo "($elapsed_time-$hour*3600)/60" | bc` second=`echo $elapsed_time-$hour*3600-$minute*60 | bc` echo "[$label] elapsed_time: ${hour}h ${minute}m ${second}s (${elapsed_time} sec)" } rm -rf tmp-dir mkdir tmp-dir export start_time0=`date +%s` echo start_time0=$start_time0 # ---------------------------------------- export start_time1=`date +%s` # Run bwa-mem and pipe output to create sorted bam bwa mem -t 32 -K 10000000 -R '@RG\tID:sample_rg1\tLB:lib1\tPL:bar\tSM:sample\tPU:sample_rg1' \ Ref/Homo_sapiens_assembly38.fasta Data/sample_1.fq.gz Data/sample_2.fq.gz | gatk \ SortSam --java-options -Xmx30g --MAX_RECORDS_IN_RAM 5000000 -I /dev/stdin \ -O cpu.bam --SORT_ORDER coordinate --TMP_DIR tmp-dir export end_time1=`date +%s` printElapsedTime "[task1] bwa-mem" $start_time1 $end_time1 # ---------------------------------------- export start_time2=`date +%s` # Mark Duplicates gatk MarkDuplicates --java-options -Xmx30g -I cpu.bam -O mark_dups_cpu.bam \ -M metrics.txt --TMP_DIR tmp-dir export end_time2=`date +%s` printElapsedTime "[task2] Mark Duplicates" $start_time2 $end_time2 # ---------------------------------------- export start_time3=`date +%s` # Generate BQSR Report gatk BaseRecalibrator --java-options -Xmx30g --input mark_dups_cpu.bam --output \ recal_cpu.txt --known-sites Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \ --reference Ref/Homo_sapiens_assembly38.fasta export end_time3=`date +%s` printElapsedTime "[task3] Generate BQSR Report" $start_time3 $end_time3 # ---------------------------------------- export start_time4=`date +%s` # Run ApplyBQSR Step gatk ApplyBQSR --java-options -Xmx30g -R Ref/Homo_sapiens_assembly38.fasta \ -I mark_dups_cpu.bam --bqsr-recal-file recal_cpu.txt -O cpu_nodups_BQSR.bam export end_time4=`date +%s` printElapsedTime "[task4] ApplyBQSR" $start_time4 $end_time4 # ---------------------------------------- export start_time5=`date +%s` #Run Haplotype Caller gatk HaplotypeCaller --java-options -Xmx30g --input cpu_nodups_BQSR.bam --output \ result_cpu.vcf --reference Ref/Homo_sapiens_assembly38.fasta \ --native-pair-hmm-threads 16 export end_time5=`date +%s` printElapsedTime "[task5] Haplotype Caller" $start_time5 $end_time5 # ---------------------------------------- export end_time0=`date +%s` echo end_time0=$end_time0 printElapsedTime "Total time" $start_time0 $end_time0 ``` ::: <hr> ## [v2] 腳本(不含 ApplyBQSR) - `mark_dups_cpu.bam` 是沒有排序過的 (sorted) 若直接傳遞到 Haplotype Caller，會有 error ``` *********************************************************************** A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed. Please index all input files: samtools index /uploads/workspace/parabricks_sample/mark_dups_cpu.bam *********************************************************************** ``` - [Default GATK complains that my bam file isn't indexed](http://seqanswers.com/forums/showthread.php?t=14760) ```java= java -jar \ /home/efoss/sequencing/picard-tools-1.52/SortSam.jar \ VALIDATION_STRINGENCY=LENIENT \ INPUT=SRR098359.bam \ OUTPUT=SRR098359_sorted.bam \ SORT_ORDER=coordinate ``` ```java= java -jar SortSam.jar \ SO=coordinate \ INPUT=~/NGS/data/SRR062641.filt.sam \ OUTPUT=~/NGS/data/SRR062641.filt.bam \ VALIDATION_STRINGENCY=LENIENT \ CREATE_INDEX=true <--- ``` - [SortSam (Picard)](https://gatk.broadinstitute.org/hc/en-us/articles/360036510732-SortSam-Picard-) ``` java -jar picard.jar SortSam \ I=input.bam \ O=sorted.bam \ SORT_ORDER=coordinate ``` - ### 作法：將 `gatk SortSam` 取代 `gatk ApplyBQSR` :::warning ```bash= gatk SortSam \ -I mark_dups_cpu.bam \ -O mark_dups_cpu_sorted.bam \ --SORT_ORDER coordinate \ --CREATE_INDEX true #gatk HaplotypeCaller ... --input cpu_nodups_BQSR.bam gatk HaplotypeCaller ... --input mark_dups_cpu_sorted.bam ``` ::: <hr> ## [v3] 帶有時間戳記的完整腳本(enhanced) > `nohup sh run_gatk_sample.sh | tee run_gatk_sample.log &` :::warning ```bash= printElapsedTime(){ label=$1 start_time=$2 end_time=$3 elapsed_time=`echo $end_time-$start_time | bc` hour=`echo $elapsed_time/3600 | bc` minute=`echo "($elapsed_time-$hour*3600)/60" | bc` second=`echo $elapsed_time-$hour*3600-$minute*60 | bc` echo "[$label] elapsed_time: ${hour}h ${minute}m ${second}s (${elapsed_time} sec)" } rm -rf tmp-dir mkdir tmp-dir export start_time0=`date +%s` echo start_time0=$start_time0 # ---------------------------------------- export start_time1=`date +%s` # Run bwa-mem and pipe output to create sorted bam bwa mem -t 32 -K 10000000 -R '@RG\tID:sample_rg1\tLB:lib1\tPL:bar\tSM:sample\tPU:sample_rg1' \ Ref/Homo_sapiens_assembly38.fasta Data/sample_1.fq.gz Data/sample_2.fq.gz | gatk \ SortSam --java-options -Xmx30g --MAX_RECORDS_IN_RAM 5000000 -I /dev/stdin \ -O cpu.bam --SORT_ORDER coordinate --CREATE_INDEX true --TMP_DIR tmp-dir export end_time1=`date +%s` printElapsedTime "[task1] bwa-mem" $start_time1 $end_time1 # ---------------------------------------- export start_time2=`date +%s` # SortSam gatk SortSam \ -I cpu.bam \ -O cpu_sorted.bam \ --SORT_ORDER coordinate \ --CREATE_INDEX true \ --TMP_DIR tmp-dir export end_time2=`date +%s` printElapsedTime "[task2] SortSam1" $start_time2 $end_time2 # ---------------------------------------- export start_time3=`date +%s` # Mark Duplicates gatk MarkDuplicates --java-options -Xmx30g -I cpu_sorted.bam -O mark_dups_cpu.bam \ -M metrics.txt --TMP_DIR tmp-dir export end_time3=`date +%s` printElapsedTime "[task3] Mark Duplicates" $start_time3 $end_time3 # ---------------------------------------- export start_time4=`date +%s` # Generate BQSR Report gatk BaseRecalibrator --java-options -Xmx30g --input mark_dups_cpu.bam --output \ recal_cpu.txt --known-sites Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \ --reference Ref/Homo_sapiens_assembly38.fasta \ --tmp-dir tmp-dir export end_time4=`date +%s` printElapsedTime "[task4] Generate BQSR Report" $start_time4 $end_time4 # ---------------------------------------- export start_time5=`date +%s` # Run ApplyBQSR Step gatk ApplyBQSR --java-options -Xmx30g -R Ref/Homo_sapiens_assembly38.fasta \ -I mark_dups_cpu.bam --bqsr-recal-file recal_cpu.txt -O cpu_nodups_BQSR.bam \ --tmp-dir tmp-dir export end_time5=`date +%s` printElapsedTime "[task5] ApplyBQSR" $start_time5 $end_time5 # ---------------------------------------- export start_time6=`date +%s` #Run Haplotype Caller gatk HaplotypeCaller --java-options -Xmx30g --input cpu_nodups_BQSR.bam --output \ result_cpu.vcf --reference Ref/Homo_sapiens_assembly38.fasta \ --native-pair-hmm-threads 16 \ --tmp-dir tmp-dir # threads: 16 -> 8 (for 8 vCPU) export end_time6=`date +%s` printElapsedTime "[task6] Haplotype Caller" $start_time6 $end_time6 # ---------------------------------------- export end_time0=`date +%s` echo end_time0=$end_time0 printElapsedTime "Total time" $start_time0 $end_time0 ``` :::