基因體 / 檔案格式

基因體 / 檔案格式 === ###### tags: `基因體` ###### tags: `生物資訊`, `基因體`, `檔案格式 file format`, `bed`, `fasta`, `fastq`, `scaffold`, `contig` [TOC] ## 各階段檔案 1. 定序儀 (樣本混合定序) 2. → 產生 bcl 檔 (sam 檔的 raw data) 3. → 壓縮成 bgzf 檔 4. →→→ 上傳到雲端 ![](https://i.imgur.com/LpuZDRq.png) 5. → 解壓縮為 bcl 檔 6. → 對 bcl 檔分析，解析出各樣本的 fastq 檔 (邊上傳，邊分析 bcl 檔) 　**進入二級分析** 7. → 將 fastq 中的 DNA 序列，和全基因體參考序列做比對，產生 sam 檔 - Nvidia: FASTQ (sequencing reads) [![](https://i.imgur.com/MsFRoeu.png)](https://drive.google.com/drive/folders/1wdHDLIiHThIyu4FXNVVyU61X3qQqxzja) 8. → 將 sam 檔(供人類閱讀) 轉成 bam 檔(供電腦快速讀取) - Nvidia: BAM (mapped reads) [![](https://i.imgur.com/MsFRoeu.png)](https://drive.google.com/drive/folders/1wdHDLIiHThIyu4FXNVVyU61X3qQqxzja) 9. → 對 DNA 序列做最後的鹼基校正，將最終的 DNA 序列定下來，產生 fasta 檔 10. → 對 fasta 檔進行變異位點偵測(variant calling)，產生 vcf 檔 - **補充** - fastq _v.s._ sam _v.s._ fasta - fastq: 尚未定位的 DNA 片段 - sam: 已定位的 DNA 片段，但未做最終鹼基判定 - fasta: 已定案的 DNA 片段 - vcf - 「個人基因體序列」和「全基因體序列」的差異點 - [Illumina 定序資料分析](https://www.cdc.gov.tw/En/File/Get/sqrAKrJg_Uq8Ki5B0HtO3g) > 定序原始檔(reads data, 檔案格式 Fastq)，以 quality trimming 去除 adapter、定序品質不良、過短等序列後，與 M. tuberculosis H37Rv 參考序列比對(mapping)後可得到經排序後的 reads 檔案(檔案格式如.bam)，排序後的 reads 資訊經由演算可擷取出與參考序列相異的 SNPs、indel 清單(檔案格式: variant call format,vcf)。 ## ==All Formats== - ### [[UCSC] Data File Formats](http://genome.ucsc.edu/FAQ/FAQformat.html) ![](https://i.imgur.com/yyOuYZq.png) - ### [生信分析過程中這些常見文件（fastq/bed/gtf/sam/bam/wig）的格式以及查看方式你都知道嗎？](https://www.twblogs.net/a/5c7c2780bd9eee31cea5fdff) ![](https://i.imgur.com/UrPkRAI.png) ## ==BAM (Binary SAM)== - [[UCSC] Data File Formats](http://genome.ucsc.edu/FAQ/FAQformat.html) - [BAM Track Format](http://genome.ucsc.edu/goldenPath/help/bam.html) - Sort and create an index for the BAM: ``` samtools sort my.bam my.sorted samtools index my.sorted.bam ``` ## ==BCL (binary base call )== - [[illumina] Sequencing file formats that can be used for a variety of data analysis options](https://www.illumina.com/informatics/sequencing-data-analysis/sequence-file-formats.html) - [[illumina] API Data Model Overview](https://developer.basespace.illumina.com/docs/content/documentation/rest-api/data-model-overview) ## ==BCF (Binary variant Call Format)== - [VCF AND BCF](https://evomics.org/vcf-and-bcf/) - VCF 為文字檔 - BCF 為二進位檔 - 如同 SAM 和 BAM 關係 - tools - [bcftools — utilities for variant calling and manipulating VCFs and BCFs.](http://samtools.github.io/bcftools/bcftools.html) - [基因组变异的表示形式](https://www.plob.org/article/11569.html) - BCFtools是一套处理VCF和BCF格式的工具。它有提供许多子命令实现不同功能，我个人用的比较多有以下几个： - mpileup + call: 根据参考基因组寻找变异位点 - VCFtools: 用于描述性统计数据，计算数据，过滤数据以及数据格式转换。 ## ==BED (Browser Extensible Data)== - [[UCSC] BED format](http://genome.ucsc.edu/FAQ/FAQformat.html#format1) ![](https://i.imgur.com/QLYLtLj.png) - [根據bed檔案從fasta檔案中獲取基因序列](https://www.itread01.com/content/1542712282.html) > bed檔：儲存註釋基因資訊 1. chrom - 染色體號 2. chromStart - feature在染色體上起始位置（其實編號為0） 3. chromEnd - feature在染色體上末尾位置（不包括此編號） 4. name - 基因的名稱 - [BED (文件格式)](https://zh.m.wikipedia.org/zh-tw/BED_(%E6%96%87%E4%BB%B6%E6%A0%BC%E5%BC%8F)) - ### [怎麼把這樣的操作轉換成生物意義來解釋呢？](https://ithelp.ithome.com.tw/articles/10268645) ![](https://i.imgur.com/A25vhxQ.png) - 我們想要在一個變異點資料庫中篩選出只在100個基因區域的變異點出來 - 我們想要篩選出在23對染色體上，只出現在外顯子區域的變異點 - 我們想要找出位在調控BRCA相關的轉錄因子上面之變異點 - 我們想要看廠商A建庫試劑的區域可以定序出哪些基因出來 - 我們手上有一個跟COVID感染相關的基因，想看在某一個病人身上的變異區域是否剛好位在這個區域中 - ### bed文件每一列对应信息 - 必须包含的3列信息： 1）chrom：染色体名字 (e.g. chr3, chrY, chr2_random或者scaffold10671)。 2）chromStart：基因在染色体或scaffold上的起始位置（0-based）。 3）chromEnd：基因在染色体或scaffold上的终止位置（前闭后开）。 - 可选的9列信息： 4）name：bed文件的行名。 5）score：本条基因在注释数据集文件中的评分（0-1000），在Genome Browser中会根据不同区段的评分显示对应的阴影强度（评分越高灰度越高）。 6）strand：链的方向+、-或. (.表示不确定链的方向) 7）thickStart：CDS区（编码区）的起始位置，即起始密码子的位置。 8）thickEnd：The ending position at which the feature is drawn thickly (for example the stop codon in gene displays). 9）itemRgb：RGB颜色值（如：255,0,0），方便在Genome Browser中查看。 10）blockCount：bed行中外显子的数目。 11）blockSizes：逗号分割的列，数目与blockCount值对应，每个数表示对应外显子的碱基数。 12）blockStarts：逗号分割的列，数目与blockCount值对应，每个数表示对应外显子的起始位置（数值是相对ChromStart计算的）。 - ### 範例下載 - [TruSeq DNA Exome Targeted Regions Manifest v1.2 (BED Format)](https://support.illumina.com/sequencing/sequencing_kits/truseq-dna-exome/downloads.html) 壓縮檔 3MB，解開後範例如下：(有 214126 行) ``` chr1 12098 12258 CEX-chr1-12099-12258 chr1 12553 12721 CEX-chr1-12554-12721 chr1 13331 13701 CEX-chr1-13332-13701 chr1 30334 30503 CEX-chr1-30335-30503 chr1 35045 35544 CEX-chr1-35046-35544 chr1 35618 35778 CEX-chr1-35619-35778 chr1 69077 70017 CEX-chr1-69078-70017 ... chrM 3306 4262 CEX-chrM-3307-4262 chrM 4469 5511 CEX-chrM-4470-5511 chrM 5903 7445 CEX-chrM-5904-7445 chrM 7585 8266 CEX-chrM-7586-8266 chrM 8365 9204 CEX-chrM-8366-9204 ... chrX 200852 200983 CEX-chrX-200853-200983 chrX 205397 205538 CEX-chrX-205398-205538 chrX 207312 207445 CEX-chrX-207313-207445 chrX 208163 208323 CEX-chrX-208164-208323 chrX 209699 209887 CEX-chrX-209700-209887 ... chrY 150853 150981 CEX-chrY-150854-150981 chrY 155398 155536 CEX-chrY-155399-155536 chrY 157313 157443 CEX-chrY-157314-157443 chrY 158164 158321 CEX-chrY-158165-158321 chrY 159700 159885 CEX-chrY-159701-159885 ... ``` ## ==BGEN (Binary GEN file)== - [The BGEN format](https://www.well.ox.ac.uk/~gav/bgen_format/) - [PDF](https://biobank.ndph.ox.ac.uk/showcase/ukb/docs/bgen12formats.pdf) ## ==FASTA== - 說明 > 已確定/已定案的基因序列 - 其他說明 - [FASTA @ 有勁的基因資訊:: 痞客邦::](https://yourgene.pixnet.net/blog/post/83580005-fasta) - [[Wiki] FASTA](https://zh.wikipedia.org/wiki/FASTA%E6%A0%BC%E5%BC%8F) - [[Wiki][en] FASTA](https://en.wikipedia.org/wiki/FASTA) > stands for "FAST-All", because it works with any alphabet, an extension of the original "FAST-P" (protein) and "FAST-N" (nucleotide) alignment tools. - 副檔名 - .fasta (fast-all) - .faa (fasta amino acid) ([Convert .fna file from NCBI to .fa or .fasta file](https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/52942)) - .fna (fasta nucleic acid) ([Convert .fna file from NCBI to .fa or .fasta file](https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/52942)) - ### Reference Sequence Types - https://varnomen.hgvs.org/bg-material/refseq/ - #### 前綴(Prefix) - #### DNA - g. = linear genomic reference sequence - o. = circular genomic reference sequence - m. = mitochondrial reference (special case of a circular genomic reference sequence) - c. = coding DNA reference sequence (based on a protein coding transcript) - n. = non-coding DNA reference sequence (based on a transcript not coding for a protein) - #### RNA - r. = RNA reference sequence - #### protein - p. = protein reference sequence - ### RefSeq categories - https://en.wikipedia.org/wiki/RefSeq | Category | Description | | -------- | ----------- | | NC | Complete genomic molecules | | NG | Incomplete genomic region | | NM | mRNA | | NR | ncRNA | | NP | Protein | | XM | predicted mRNA model | | XR | predicted ncRNA model | | XP | predicted Protein model (eukaryotic sequences) | | WP | predicted Protein model (prokaryotic sequences) | ## ==FASTQ== - ### [[UCSC] Data File Formats](http://genome.ucsc.edu/FAQ/FAQformat.html) - [FASTQ Format Specification](http://maq.sourceforge.net/fastq.shtml) - ### [FASTQ格式簡介@ 有勁的基因資訊:: 痞客邦::](https://yourgene.pixnet.net/blog/post/84563506-fastq%E6%A0%BC%E5%BC%8F%E7%B0%A1%E4%BB%8B) > FASTA格式 + quality值 - ### [[NGS]NGS的產物-FASTQ格式介紹 @ 威健生技Welgene](https://welgene.blogspot.com/2012/05/ngsngs-fastq.html) - ### [[Wiki] FASTQ格式](https://zh.wikipedia.org/wiki/FASTQ%E6%A0%BC%E5%BC%8F) - ### [了解fastq文件](https://cloud.tencent.com/developer/article/1922544) - N 代表的是测序时那些无法被识别出来的碱基 - ### [DNA/RNA 序列中的「N」代表什麼呢? - 有勁的基因資訊](https://yourgene.pixnet.net/blog/post/82260503) #ATCG+N ![](https://i.imgur.com/KipZuKM.png) - ### [[ASUS] NVIDIA Clara Parabricks 3.7 - UXQ](https://docs.google.com/presentation/d/1SyazL8odDrnYuoCJ5sM5WYMBQngFxlhfRg257Q_EVII/edit#slide=id.g128ddc5b935_0_7) - 每一筆定序序列包含 4 列 - 第一列：註解，由 @ 開頭，用於描述機台名稱、flowcell 相關資訊、檢體資訊等 - 第二列：定序出來的序列，由 ATCG & N 五種英文字母組成 (N：無法辨識的鹼基) - 第三列：單純是分隔線 (通常只有一個字元 ‘+’，亦可同第一列資訊)，沒有意義 - 第四列：對應第二列的品質分數 (牽涉到化學反應＆影像處理，定序品質有好有壞) - FASTQ 的編輯 - [RNA-Sick@Day9 > 斷開序列，斷開一切的牽連｜把品質不佳的序列剔除掉 feat. Trimmomatic](https://ithelp.ithome.com.tw/articles/10219913) - Illumina FASTQ file standard ![](https://i.imgur.com/CFuQES6.png) ([新版資訊](https://support.illumina.com/help/BaseSpace_Sequence_Hub/Source/Informatics/BS/UploadFastqReq_swBS.htm?Highlight=fastq)) - [基因组的那些事儿--基础](https://www.jianshu.com/p/bf871522ea20) ![](https://i.imgur.com/x1Tx90o.png) ## ==GVCF (Genome Variant Call Format)== - [[GATK] General comparison of VCF vs. GVCF](https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format) ## ==GVF (Genome Variant Format)== - [Genome Variation Format 1.10](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gvf.md) - [RSAT - convert-variations manual](http://rsat.eead.csic.es/plants/help.convert-variations.html#Genome-Variant-Format-GVF) ![](https://i.imgur.com/QvjpsvM.png) polymorphic variations 的三種格式： - Genome Variant Format (GVF) - Variant Call Format (VCF) - RSAT variation format (varBed) - 根據上面工具，可互轉的樣子 ## ==PAF== - [PAF: Pairwise mApping Format](https://github.com/lh3/miniasm/blob/master/PAF.md) - #### 總覽 - 序列兩兩比對結果的格式 - Pairwise 逐對意思是 ```for (seq in seqList)``` 一個序列接著一個序列映射到目標序列 - Pairwise 概念類似 bitwise 逐位元概念意思是 ```for (bit in bits)``` 一個位元處理完，再處理下一個位元，一個接著一個處理 - #### PAF 用途 - 用來描述兩組序列之間的近似映射位置 - #### PAF 格式 - 純文字格式，每行用 tab 隔開 - 各欄位的意義 | 欄位 | 意義 | |-----|-----------| | 1 | 來源序列-名稱 | | 2 | 來源序列-長度 | | 3 | 來源序列-起始位置 (起始索引為0) | | 4 | 來源序列-結束位置 (起始索引為0) | | 5 | 表示相對股 「來源序列」與「目標序列」位於： - 同方向(一條5'->3'，另一條 5'->3')，以 + 表示 - 反方向(一條5'->3'，另一條3'->5')，以 - 表示 | | 6 | 目標序列-名稱 | | 7 | 目標序列-長度 | | 8 | 目標序列-起始位置 (起始索引為0) | | 9 | 目標序列-結束位置 (起始索引為0) | | 10 | 匹配到的鹼基片段個數 (片段為單位) | | 11 | 匹配、不匹配、間隙的鹼基片段個數 (片段為單位) | | 12 | 比對品質分數 (0-255; 255 表示缺失) | - [Manual Reference Pages - minimap2 (1)](https://lh3.github.io/minimap2/minimap2.html) ## ==QC report== - [RNAseq’s 資料前處理:Quality Control](https://weitinglin.com/2016/01/17/rnaseqs-%E8%B3%87%E6%96%99%E5%89%8D%E8%99%95%E7%90%86quality-control/) - [Incorrect encoding of Phred scores](https://sequencing.qcfail.com/articles/incorrect-encoding-of-phred-scores/) - illumina 1.5 (or lower): Phred+64 encoding - Illumina 1.9/Sanger: Phred+33 encoding - [圖解](https://drive5.com/usearch/manual/quality_score.html) ![](https://i.imgur.com/19G3HR5.png) ## ==SAM / BAM / BGZF== - ### SAM > 當我們將 NGS 資料(reads data)利用各種方法去和參考序列(reference sequence)做比對之後，我們該如何表達比對之後的結果呢？這個問題的答案就是我們耳熟能詳的SAM檔案。SAM的縮寫是Sequence Alignment/Map - [SAM file format @ 有勁的基因資訊 :: 痞客邦 ::](https://yourgene.pixnet.net/blog/post/93887292-sam-file-format) - [[我們的基因體時代 OUR "GENE"RATION] 常見的Alignment Genomic Data Archive Format](https://weitinglin.com/2016/01/27/sambam-and-cram/) ![](https://i.imgur.com/VcYnzht.png) - SAM(Sequence Alignment/Map), 人看得懂的 - BAM(Binary Alignment/Map), 壓縮成binary的 - CRAM(compressed reference-oriented alignment map) - CRAM 新版的 BAM，採差異性儲存 (類似 git diff 概念) - 格式解讀 https://www.itread01.com/content/1534522933.html - SAM FORMAT Spec https://samtools.github.io/hts-specs/SAMv1.pdf - ### BGZF - [1.3 BAM的压缩和索引算法](https://core.ac.uk/download/pdf/41458599.pdf) - BAM是通过BGZF库压缩得到的，在兼容gzip/zlib压缩标准下实现。BGZF的目标是提供良好的压缩率的同时实现对BAM的快速随机访问。 - BAM使用的BGZF是一种通用的可索引的压缩格式。BGZF被分成多个64KB的gzip/zlib块，最后将这些级联的数据块封装起来。 - [如何优雅的随机读 gzip 压缩文件](https://zhuanlan.zhihu.com/p/62302150) > 实现在有一种称为 Block GZip Format（简称为 BGZF） ## VCF - [[UCSC] Data File Formats](http://genome.ucsc.edu/FAQ/FAQformat.html) - [VCF+tabix Track Format](http://genome.ucsc.edu/goldenPath/help/vcf.html) - [[HackMD] VCF Format](/6rATKTvURVSKia8K_9kBeQ) - tools - [基因组变异的表示形式](https://www.plob.org/article/11569.html) - VCFtools: 用于描述性统计数据，计算数据，过滤数据以及数据格式转换。 # [UCSC - Data File Formats :+1:](http://genome.ucsc.edu/FAQ/FAQformat.html#format1)