Datasets === ###### tags: `Parabricks` ###### tags: `基因體`, `NVIDIA`, `Clara`, `Parabricks`, `二級分析`, `WGS`, `WES`, `Somatic`, `pipeline`, `fastq-dump`, `NA12878`, `HG002`, `SRR7890824 + SRR7890827`, `SRR3406492` <br> [TOC] <br> ## References ### b37 > source: https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37;tab=objects?prefix=&pli=1&forceOnObjectsSortingFiltering=false > 216 files, 53.2 GiB - ### 安裝 gsutil 套件 ``` $ pip install gsutil $ gsutil version -l ... compiled crcmod: True ... ``` - `$ gsutil version -l` :::spoiler ``` $ gsutil version -l gsutil version: 5.17 checksum: PACKAGED_GSUTIL_INSTALLS_DO_NOT_HAVE_CHECKSUMS (!= 73ec35fa8706dcc7850a8fc7c6c2caa3) boto version: 2.49.0 python version: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] OS: Linux 4.12.14-041214-generic multiprocessing available: True using cloud sdk: False pass cloud sdk credentials to gsutil: False config path(s): No config found gsutil path: /usr/local/bin/gsutil compiled crcmod: False installed via package manager: True editable install: False shim enabled: False ``` ::: - ### [warning] but your crcmod installation isn't using the module's C extension - `gsutil cp -r gs://gatk-legacy-bundles/b37` :::spoiler ``` $ gsutil cp -r gs://gatk-legacy-bundles/b37 . Copying gs://gatk-legacy-bundles/b37/1000G_omni2.5.b37.vcf... ==> NOTE: You are downloading one or more large file(s), which would run significantly faster if you enabled sliced object downloads. This feature is enabled by default but requires that compiled crcmod be installed (see "gsutil help crcmod"). Copying gs://gatk-legacy-bundles/b37/1000G_omni2.5.b37.vcf.gz... Copying gs://gatk-legacy-bundles/b37/1000G_omni2.5.b37.vcf.gz.md5... \ [3 files][241.2 MiB/241.2 MiB] 15.0 MiB/s ==> NOTE: You are performing a sequence of gsutil operations that may run significantly faster if you instead use gsutil -m cp ... Please see the -m section under "gsutil help options" for further information about when gsutil -m can be advantageous. Copying gs://gatk-legacy-bundles/b37/1000G_omni2.5.b37.vcf.idx... Copying gs://gatk-legacy-bundles/b37/1000G_omni2.5.b37.vcf.idx.gz... Copying gs://gatk-legacy-bundles/b37/1000G_omni2.5.b37.vcf.idx.gz.md5... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.indels.b37.vcf... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.indels.b37.vcf.gz... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.indels.b37.vcf.gz.md5... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.indels.b37.vcf.idx... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.indels.b37.vcf.idx.gz... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.indels.b37.vcf.idx.gz.md5... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.snps.high_confidence.b37.vcf... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz.md5... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.snps.high_confidence.b37.vcf.idx... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.snps.high_confidence.b37.vcf.idx.gz... Copying gs://gatk-legacy-bundles/b37/1000G_phase1.snps.high_confidence.b37.vcf.idx.gz.md5... Copying gs://gatk-legacy-bundles/b37/1000G_phase3_v4_20130502.sites.vcf... Copying gs://gatk-legacy-bundles/b37/1000G_phase3_v4_20130502.sites.vcf.gz... Copying gs://gatk-legacy-bundles/b37/1000G_phase3_v4_20130502.sites.vcf.gz.tbi... Copying gs://gatk-legacy-bundles/b37/1000G_phase3_v4_20130502.sites.vcf.idx... Copying gs://gatk-legacy-bundles/b37/Broad.human.exome.b37.bed... Copying gs://gatk-legacy-bundles/b37/Broad.human.exome.b37.interval_list.gz... Copying gs://gatk-legacy-bundles/b37/Broad.human.exome.b37.interval_list.gz.md5... Copying gs://gatk-legacy-bundles/b37/ExAC.r0.3.nonTCGA.sites.vep.b37.vcf.gz... CommandException: GiB/ 17.5 GiB] 11.1 MiB/s Downloading this composite object requires integrity checking with CRC32c, but your crcmod installation isn't using the module's C extension, so the hash computation will likely throttle download performance. For help installing the extension, please see "gsutil help crcmod". To download regardless of crcmod performance or to skip slow integrity checks, see the "check_hashes" option in your boto config file. NOTE: It is strongly recommended that you not disable integrity checks. Doing so could allow data corruption to go undetected during uploading/downloading. ``` ::: - 解法 - [CRC32C and Installing crcmod](https://cloud.google.com/storage/docs/gsutil/addlhelp/CRC32CandInstallingcrcmod) ``` $ gsutil version -l ... compiled crcmod: True ... ``` - If your crcmod library is compiled to a native binary, this value will be `True`. (原生二進位檔) - If using the pure-Python version, the value will be `False`. (Python 程式) - 結論:查看 crcmod 狀態 `compiled crcmod: True` (二進位檔) `compiled crcmod: False` (Python 程式) <br> - compiled crcmod 顯示為 Fasle,需要重新安裝 crcmod 才能變成 True ```bash $ sudo apt-get install -y gcc python3-dev python3-setuptools $ sudo pip3 uninstall -y crcmod $ sudo pip3 install --no-cache-dir -U crcmod ``` - crcmod 套件經過重新安裝後,就正常了 - 套件版號都沒變、安裝位置也沒變 - ### 下載資料 ``` gsutil cp -r gs://gatk-legacy-bundles/b37 /where/to/download/ ``` - ### 資料來源 - [Create a local copy of the GATK resource bundle](https://github.com/ESR-NZ/human_genomics_pipeline/blob/master/docs/running_on_a_single_machine.md#5-create-a-local-copy-of-the-gatk-resource-bundle-either-b37-or-hg38) <br> ### hg38 > Google Cloud: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0 > 76 objects, 32.3 GiB - 安裝 gsutil 套件 > 參考 b37 說明 - 下載資料 ``` gsutil cp -r gs://genomics-public-data/resources/broad/hg38 /where/to/download/ ``` - 資料來源 - [Create a local copy of the GATK resource bundle](https://github.com/ESR-NZ/human_genomics_pipeline/blob/master/docs/running_on_a_single_machine.md#5-create-a-local-copy-of-the-gatk-resource-bundle-either-b37-or-hg38) <br> ## NA24631 > [[github][ESR-NZ] human_genomics_pipeline / test / single / fastq /](https://github.com/ESR-NZ/human_genomics_pipeline/tree/master/test/single/fastq) ### 下載來源 ```= $ wget https://github.com/ESR-NZ/human_genomics_pipeline/blob/master/test/single/fastq/NA24631_1.fastq.gz?raw=true -O NA24631_1.fastq.gz $ wget https://github.com/ESR-NZ/human_genomics_pipeline/blob/master/test/single/fastq/NA24631_2.fastq.gz?raw=true -O NA24631_2.fastq.gz ``` <br> ## Parabricks Sample > [官方資料集](https://docs.nvidia.com/clara/parabricks/v3.5/text/getting_started.html#step-4-example-run) ### 下載來源 ```= $ wget -O parabricks_sample.tar.gz \ "https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz" $ tar -xvzf parabricks_sample.tar.gz ``` - AWS S3 -> Azure,平均速度 46.40MB/s :exclamation::exclamation::exclamation: (是 MB) - 9.24GB 花了 6m24s - 官方提供的資料集,下載 link 已經失效,修正方式如上所示 ```bash= wget -O parabricks_sample.tar.gz \ "https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz?Expires=1613069864&Signature=WxLeyitbvR%2B0rO4MX%2B0GohDw89g%3D&AWSAccessKeyId=AKIAJGDUNN2G2ZAH3Q3A" ``` ### parabricks_sample.tar.gz - md5: `a2b7d8f99fcc307f2e63a1699f01dbb9` ### germline ```bash= pbrun germline \ --ref parabricks_sample/Ref/Homo_sapiens_assembly38.fasta \ --in-fq parabricks_sample/Data/sample_1.fq.gz parabricks_sample/Data/sample_2.fq.gz \ --knownSites parabricks_sample/Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \ --out-bam output.bam \ --out-variants output.vcf \ --out-recal-file report.txt \ --x3 ``` <br> <hr> <br> ## WES ### HG002 - ### 下載來源 - [[Google] brain-genomics-public/research/sequencing/fastq/novaseq/wes_agilent/100x/HG002.novaseq.wes_agilent.100x.R1.fastq.gz](https://console.cloud.google.com/storage/browser/_details/brain-genomics-public/research/sequencing/fastq/novaseq/wes_agilent/100x/HG002.novaseq.wes_agilent.100x.R1.fastq.gz;tab=live_object) - [[Google] brain-genomics-public/research/sequencing/fastq/novaseq/wes_agilent/100x/HG002.novaseq.wes_agilent.100x.R2.fastq.gz](https://console.cloud.google.com/storage/browser/_details/brain-genomics-public/research/sequencing/fastq/novaseq/wes_agilent/100x/HG002.novaseq.wes_agilent.100x.R2.fastq.gz;tab=live_object) <br> <hr> <br> ## WGS > ESC4000 上,台大基因定序案那份 ### WGS-LIS-AI018A - ### 下載來源 ``` $ rsync --progress -zh \ --parial --append \ diatango_lin@10.78.26.241:/mnt/ssdraid/Gene/GATK/everythings-misc.zip . ``` - `--progress` 顯示進度列 - `-z` 壓縮檔案後再上傳 - `-h` 檔案大小,易讀 - ### 下載來源2 (ESC4000 建立的副本) > [Azure](https://portal.azure.com/) > [儲存體帳戶 (傳統)](https://portal.azure.com/#blade/HubsExtension/BrowseResource/resourceType/Microsoft.ClassicStorage%2FStorageAccounts) > mlstudioteststorage2 > [容器] wgs [![](https://i.imgur.com/vUmoe34.png)](https://i.imgur.com/vUmoe34.png) (速度很慢) - [~~WGS-LIS-AI018A_R1.fastq.gz~~](https://mlstudioteststorage2.blob.core.windows.net/wgs/WGS-LIS-AI018A_R1.fastq.gz) - [~~WGS-LIS-AI018A_R2.fastq.gz~~](https://mlstudioteststorage2.blob.core.windows.net/wgs/WGS-LIS-AI018A_R2.fastq.gz) - ### WGS-LIS-AI018A_R1.fastq.gz - md5: `59444a39ffe845d390e110be16c4430a` - ### WGS-LIS-AI018A_R2.fastq.gz - md5: `a3f1d78c156761f6a27bb74c0a4b87b1` - ### germline ```bash= pbrun germline \ --ref parabricks_sample/Ref/Homo_sapiens_assembly38.fasta \ --in-fq dataset/WGS-LIS-AI018A_R1.fastq.gz \ dataset/WGS-LIS-AI018A_R2.fastq.gz \ --knownSites parabricks_sample/Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \ --out-bam output.bam \ --out-variants output.vcf \ --out-recal-file report.txt \ --x3 ``` - 效能 - [Azure:GPU-NC12s-v2(P100-16GB x2)](/fS60joh8TNqroKm2L_0WYg) ## NA24385 (HG002) > https://precision.fda.gov/challenges/truth - ### 下載來源 - [[Google] brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R1.fastq.gz](https://console.cloud.google.com/storage/browser/_details/brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R1.fastq.gz;tab=live_object) - [[Google] brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R2.fastq.gz](https://console.cloud.google.com/storage/browser/_details/brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R2.fastq.gz;tab=live_object) - ### 資料詳情 - [[NIST Reference Materials] Standardization to Ensure Accuracy](https://www.coriell.org/1/NIGMS/Collections/NIST-Reference-Materials) | Coriell ID | NIST ID | iPSC ID (Parental Cell Type) | RM Number | Description | Gender | Race | Ethnicity | Relationship to proband | Collection | |------------|---------|------------------------------|-----------------------|-------------|--------|-------|-------------------|-------------------------|-------------------------| | NA12878 | HG001 | | RM8398 | CEPH/UTAH | Female | White | Utah/Mormon | Mother | Apparently Healthy | | **NA24385** | **HG002** | GM26105 (LCL) GM27730 (PBMC) | RM8391 RM8392 (trio) | PGP | **Male** | White | Ashkenazim Jewish | Son | Personal Genome Project | | NA24149 | HG003 | | RM8392 (trio) | PGP | Male | White | Ashkenazim Jewish | Father | Personal Genome Project | | NA24143 | HG004 | GM26077 (LCL) | RM8392 (trio) | PGP | Female | White | Ashkenazim Jewish | Mother | Personal Genome Project | | NA24631 | HG005 | GM26107 (LCL) | RM8393 | PGP | Male | Asian | Chinese | Son | Personal Genome Project | | NA24694 | N/A | | N/A | PGP | Male | Asian | Chinese | Father | Personal Genome Project | | NA24695 | N/A | | N/A | PGP | Female | Asian | Chinese | Mother | Personal Genome Project | [![](https://i.imgur.com/ijsYQkh.png)](https://i.imgur.com/ijsYQkh.png) - [Human HG002 (GM24385) dataset](https://labs.epi2me.io/dataindex/) > We are happy to announce the release of a nanopore sequencing dataset of the Genome in a Bottle human genome sample GM24385 (HG002). <br> <hr> <br> ## NA12878 (HG001) > https://precision.fda.gov/challenges/truth - ### [關係圖](https://blog.goldenhelix.com/the-state-of-ngs-variant-calling-dont-panic/utah-pedigree-1463-with-na12878/) ![](https://i.imgur.com/2i82OgL.png) - ### [Sample NA12878](https://www.internationalgenome.org/data-portal/sample/NA12878) - Father: [NA12891](https://www.internationalgenome.org/data-portal/sample/NA12891) (SRR622/SRR622458) - Mother: [NA12892](https://www.internationalgenome.org/data-portal/sample/NA12892) (SRR622/SRR622459) - Child: [NA12878]() (SRR622/SRR622457) > 勾選 **PCR-free high coverage**,並下載 R1, R2 > [![](https://i.imgur.com/m7SoZxf.png)](https://i.imgur.com/m7SoZxf.png) [![](https://i.imgur.com/GwuHTsQ.png)](https://i.imgur.com/GwuHTsQ.png) - ### Q&A - [How to access specifically 30x NA12878 sequencing runs](https://www.biostars.org/p/368335/) <br> <hr> <br> ## Somatic: SRR7890824+SRR7890827 > - [Somatic Pipeline](/JvREarrASMOQUcfeI54BWg) > - [NCBI SRA](/XHcj7Iy_Rr2JouDENfga1g) | File Name | Size | MD5 | | --------- | ----- | --- | | `SRR7890824_1.fastq.gz` | 47972307728<br>(45G) (44.68GiB) | `66b4d19b071ecd6e7adcaff3972032ca` | | `SRR7890824_2.fastq.gz` | 53556417175<br>(50G) (49.88GiB) | `0d61e81efec5f697068bf54678ecb37f` | | `SRR7890827_1.fastq.gz` | 52052708052<br>(48G) (48.48GiB) | `d71237644ffe1b9229a7ab064d4b6d70` | | `SRR7890827_1.fastq.gz` | 57180426937<br>(53G) (53.25GiB) | `35cb18bd1f532b3a4bfb5717ca0f1c4c` | | **Total** | 210761859892<br>(196G) (196.2GiB) | - ### 網路傳輸:合計 17h 49m 21s - **[@ESC4000]** cp A B --- 4個檔案共 196G,約需 49 分 (HD, non-SSD) - cp HD:A SSD:B --- 24m59s - **[@ESC4000]** split --- 8個檔案約需 9m 23s - **[@ESC4000]** docker build --- 8個檔案約需 49m 08s - **[@ESC4000]** docker push ociscloud/datasets:* --- 約需 5h 22m 22s (10.4MiB/s) - **[@work01]** docker pull ociscloud/datasets:* --- 約需 5h 45m 35s (9.68MiB/s) - 內含一次 retry: 76m - **[@work01]** docker push bioaimaker-registry.nhri.edu.tw/... --- 約需 5h 17m 54s (10.52MiB/s) - ### **[@國衛院prod]** image 載入: | image | size | loading time | | ----- | ---- | ------------ | | SRR7890824-1-a | 22.19GB | 10m | | SRR7890824-1-b | 22.21GB | | | SRR7890824-2-a | 24.88GB | | | SRR7890824-2-b | 24.88GB | 11m 54s | | SRR7890827-1-a | 24.08GB | 1h 10m 50s | | SRR7890827-1-b | 24.09GB | | | SRR7890827-2-a | 26.52GB | | | SRR7890827-2-b | 26.57GB | | <br> <hr> <br> ## SRR3406492 - ### 下載來源 ``` fastq-dump -X 5000 --split-files SRR3406492 ``` - 執行結果 - `SRR3406492_1.fastq` - `SRR3406492_2.fastq` - 資料來源 - [RNA-Sick@Day9 > 斷開序列,斷開一切的牽連|把品質不佳的序列剔除掉 feat. Trimmomatic](https://chenhsieh.com/post/bioinfo/09-trimmomatic/) <br> <hr> <br> ## 上傳資料方式 ### scp ```bash= $ scp -i parabricks-test_key.pem \ parabricks.tar.gz \ azureuser@70.37.107.238:/mnt/parabricks ``` <br> ### rsync (不中斷+續傳) ```bash= $ rsync -e 'ssh -i parabricks-test_key.pem' \ --progress -zh --partial --append \ WGS-LIS-AI018A_R* \ azureuser@70.37.107.238:/mnt/parabricks ```