品質控制(QC)

品質控制(QC) === ###### tags: `基因體/一級分析` ###### tags: `生物資訊`, `基因體`, `一級分析`, `品質控制`, `Quality Control`, `FastQC`, `定序深度`, `fastq`, `Trimmomatic` [TOC] ## Pre-processing of sequencing reads - ### [Pre-processing of sequencing reads](https://biocorecrg.github.io/RNAseq_course_2019/qc.html) [![](https://hackmd.io/_uploads/BJA80O8O3.png)](https://hackmd.io/_uploads/BJA80O8O3.png) ## FastQC ### 官方資料 - ### doc https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ - QC: Quality Control - FastQC: A quality control tool for high throughput sequence data. - [Downalod](https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc) - [Installation and setup instructions](https://raw.githubusercontent.com/s-andrews/FastQC/master/INSTALL.txt) - [[zip] FastQC v0.12.1 (Win/Linux zip file)](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip) - ### github https://github.com/s-andrews/FastQC ### 安裝方式 - [How To Install fastqc on Ubuntu 20.04](https://installati.one/install-fastqc-ubuntu-20-04/) ```bash sudo apt-get update sudo apt-get -y install fastqc ``` - 需要 timezone ``` ... debconf: falling back to frontend: Readline Configuring tzdata ------------------ Please select the geographic area in which you live. Subsequent configuration questions will narrow this down by presenting a list of cities, representing the time zones in which they are located. 1. Africa 2. America 3. Antarctica 4. Australia 5. Arctic 6. Asia 7. Atlantic 8. Europe 9. Indian 10. Pacific 11. SystemV 12. US 13. Etc Geographic area: ``` - [手動安裝FastQC-0.11.9到指定目錄](https://2formosa.blogspot.com/2021/12/Install-FastQC-on-Linux-manually.html) ### 執行方式 ```bash $ fastqc --version FastQC v0.11.9 # 每一個 fq.gz 對應一個 report $ fastqc file1.fastq.gz file2.fastq.gz file3.fastq.gz ``` ### 指令說明 - [FastQC - high throughput sequence QC analysis tool](https://manpages.debian.org/stretch/fastqc/fastqc.1) - -t --threads Specifies the number of files which can be processed simultaneously. Each thread will be allocated 250MB of memory so you shouldn't run more threads than your available memory will cope with, and not more than 6 threads on a 32 bit machine 指定可以同時處理的檔案數量。每個執行緒將被分配250MB的記憶體，因此您不應該執行超過可用記憶體容量的執行緒數量，而在32位機器上不應該超過6個執行緒。 ### 教學 - ### [fastq期末報告說明.pptx](https://docs.google.com/presentation/d/1rC2wkAS6gwLAvqPFjA3ythWhRirl5uZBpw9cnin0D7s/) - ### [pE.DNA 定序](https://oj.nerde.pw/static2/nehs-toi-2018/NGS.pdf) > 國⽴科學⼯業園區實驗⾼級中學 2018 資奧⼊營考校內排名賽 pE.DNA 定序 > (提昇 QC 的規則) > - Phred quality score 是⽤來標⽰定序出來每個鹼基對的「品質」 - 過濾方法 - adapter filter - polyN filter - s35 filter - Q20 filter ## Trimmomatic ### 官網 - [Trimmomatic: A flexible read trimming tool for Illumina NGS data](http://www.usadellab.org/cms/?page=trimmomatic) - 安裝方式 - Version 0.39: [binary](http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip) ![](https://hackmd.io/_uploads/r1AtWZVv3.png) - trimmomatic-0.39.jar - Quick start - Paired End `java -jar trimmomatic-0.39.jar PE input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 MINLEN:36` - 功能 - **ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.** 從讀取序列中剪切適配器和其他 Illumina 特定序列。 - **SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.** 執行滑動窗口修剪，當窗口內的平均品質低於閾值時進行剪切。 - **LEADING: Cut bases off the start of a read, if below a threshold quality** 如果序列開頭的品質低於閾值，則剪切序列開頭的鹼基。 - **TRAILING: Cut bases off the end of a read, if below a threshold quality** 如果序列末尾的品質低於閾值，則剪切序列末尾的鹼基。 - **CROP: Cut the read to a specified length** 將序列剪切為指定的長度。 - **HEADCROP: Cut the specified number of bases from the start of the read** 從序列開頭剪切指定數量的鹼基。 - **MINLEN: Drop the read if it is below a specified length** 如果序列長度低於指定的長度，則丟棄該序列。 - **TOPHRED33: Convert quality scores to Phred-33** 將品質分數轉換為 Phred-33 格式。 - **TOPHRED64: Convert quality scores to Phred-64** 將品質分數轉換為 Phred-64 格式。 - Manual - [Trimmomatic Manual: V0.32](http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf) - For input and output files adding .gz/.bz2 to an extension tells Trimmomatic that the file is provided in gzip/bzip2 format or that Trimmomatic should gzip/bzip2 the file, respectively. ### github (old) > [timflutre](https://github.com/timflutre) / [trimmomatic](https://github.com/timflutre/trimmomatic) - license GNU GENERAL PUBLIC LICENSE (在 zip 檔裡找到) - adapters - NexteraPE-PE.fa - TruSeq2-PE.fa - TruSeq2-SE.fa - TruSeq3-PE-2.fa - [TruSeq3-PE.fa](https://github.com/timflutre/trimmomatic/blob/master/adapters/TruSeq3-PE.fa) - TruSeq3-SE.fa ### github (new) - https://github.com/usadellab/Trimmomatic - ### build ```xml=33 <target name="compile" depends="import"> <javac srcdir="${src}" destdir="${dist_build}" debug="true" source="1.5" target="1.5" includeantruntime="false"> <classpath> <fileset dir="${dist_lib}"> <include name="*.jar" /> </fileset> </classpath> </javac> ... </target> ``` - source="1.5" target="1.5" - 要根據環境將 1.5 改成 1.8 (比如 241) - $ javac -version - $ java -version - [Error "Source option 5 is no longer supported. Use 6 or later" on Maven compile](https://stackoverflow.com/questions/53034953/) - ### OOM: java.lang.OutOfMemoryError: Java heap space ``` ILLUMINACLIP: Using 1 prefix pairs, 5 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences [28.174s][warning][gc,alloc] pool-1-thread-1: Retried waiting for GCLocker too often allocating 1048578 words Exception in thread "main" java.lang.Exception: Trim Stats Collector Exception at org.usadellab.trimmomatic.threading.trimstats.SelfThreadedTrimStatsCollector.run(SelfThreadedTrimStatsCollector.java:75) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) at org.usadellab.trimmomatic.threading.trimstats.SelfThreadedTrimStatsCollector.run(SelfThreadedTrimStatsCollector.java:68) ... 1 more Caused by: java.lang.OutOfMemoryError: Java heap space at java.base/java.util.Arrays.copyOf(Arrays.java:3745) at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120) at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95) at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156) at java.base/sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:233) at java.base/sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:303) at java.base/sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:281) at java.base/sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125) at java.base/java.io.OutputStreamWriter.write(OutputStreamWriter.java:208) at java.base/java.io.BufferedWriter.flushBuffer(BufferedWriter.java:120) at java.base/java.io.BufferedWriter.write(BufferedWriter.java:233) at java.base/java.io.Writer.write(Writer.java:249) at org.usadellab.trimmomatic.fastq.FastqSerializer.writeRecord(FastqSerializer.java:58) at org.usadellab.trimmomatic.util.compression.UncompressedBlockData.<init>(UncompressedBlockData.java:26) at org.usadellab.trimmomatic.threading.BlockOfWork.processPE(BlockOfWork.java:154) at org.usadellab.trimmomatic.threading.BlockOfWork.process(BlockOfWork.java:235) at org.usadellab.trimmomatic.threading.BlockOfWork.call(BlockOfWork.java:247) at org.usadellab.trimmomatic.threading.BlockOfWork.call(BlockOfWork.java:16) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ... 1 more ``` ### 範例 - ### case1 `$ java -jar trimmomatic-0.39.jar PE -phred33 SRR3406492_1.fastq SRR3406492_2.fastq SRR3406492_1_paired.fastq SRR3406492_1_unpaired.fastq SRR3406492_2_paired.fastq SRR3406492_2_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10` TrimmomaticPE: Started with arguments: -phred33 SRR3406492_1.fastq SRR3406492_2.fastq SRR3406492_1_paired.fastq SRR3406492_1_unpaired.fastq SRR3406492_2_paired.fastq SRR3406492_2_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 Multiple cores found: Using 2 threads Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT' ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Input Read Pairs: 5000 Both Surviving: 4990 (99.80%) Forward Only Surviving: 10 (0.20%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%) TrimmomaticPE: Completed successfully - ### case 2 `-baseout test1/SRR3406492` 自動產生： ![](https://hackmd.io/_uploads/SkNKgewPh.png) - test1/SRR3406492_1P - test1/SRR3406492_1U - test1/SRR3406492_2P - test1/SRR3406492_2U - 1P, 1U, 2P, 2U 都是明碼，未加密過 - 即便是 SRR3406492_1.fastq.gz，所產生的 1P, 1U 也是明碼 ### 教學 - [RNA-Sick@Day9 > 斷開序列，斷開一切的牽連｜把品質不佳的序列剔除掉 feat. Trimmomatic](https://chenhsieh.com/post/bioinfo/09-trimmomatic/) - [[ithome] RNA-Sick@Day9 > 斷開序列，斷開一切的牽連｜把品質不佳的序列剔除掉 feat. Trimmomatic](https://ithelp.ithome.com.tw/articles/10219913) - 原理 ![](https://hackmd.io/_uploads/rkVL7WNP3.png) - [Trimmomatic 裁切功能簡介以及比對方式的疑問](https://summer-inequality.blogspot.com/2022/11/trimmomatic-adapter-and-quality-trimming.html) - Trimmomatic 是專門用於 Illumina 定序資料的品管軟體 - [Trimming and Filtering](https://carpentries-lab.github.io/metagenomics-analysis/03-trimming-filtering/index.html) ![](https://hackmd.io/_uploads/BkhXkPi9n.png) ``` $ trimmomatic PE JP4D_R1.fastq.gz JP4D_R2.fastq.gz \ JP4D_R1.trim.fastq.gz JP4D_R1un.trim.fastq.gz \ JP4D_R2.trim.fastq.gz JP4D_R2un.trim.fastq.gz \ SLIDINGWINDOW:4:20 MINLEN:35 ILLUMINACLIP:TruSeq3-PE.fa:2:40:15 ``` ### Q&A - 疑惑 - 什麼情況 read 變成 unpaired 好好的 read 、有品質的 read 為何變成 unpaired？ - 當 read 變成 unpaired，就放在 unpaired 檔案裡 - 接著，unpaired 的 read 還能用？為何不能用？ - paired-end [![](https://hackmd.io/_uploads/r1NuRcLw3.png)](https://hackmd.io/_uploads/r1NuRcLw3.png) - sliding window 情況，編號會做排序嗎？ - 僅是用於 Illumina 定序? - ### [What's the unpaired reads in the raw-data](https://www.biostars.org/p/192788/) #unpaired reads - When a trimming program trims data one of the reads (assuming you are using paired-end data) may become short and fail a criteria you have set (e.g. minimum length 25 bp). At the point that read is removed by the trimming program. Let us say that was from R2 file. A trimming program should remove the corresponding R1 read from R1-file (even though that may have passed) when R2 read is dropped. If it does not do that you are left with an unpaired read in R1 file. - Aligners expect reads to be in the same order in R1/R2 files. If they are not then you can get strange results (e.g. discordant mapping). Presence of unpaired reads in main sequence files (As @mastal points out trimmomatic should collect them in separate files) generally signifies improper use of a trimming program (or using a trimmer that is not paired-end aware). If that happens repair.sh tool from BBMap suite can be used to remove those unpaired reads and bring R1/R2 files back in sync. - ### [Trimmomatic: What is the difference between paired and unpaired output files in paired-end mode](https://www.biostars.org/p/287477/) - After trimming with trimmomatic, some reads can be discared, and their mate become "unpaired". Usually, only a small fraction of the reads become unpaired, and they can be ignored. Alternatively, you can map them as single reads along with the remaining paired reads. - ### [ChatGPT] R1 & R2 的序列差異在那？ R1和R2是成對定序中的兩個序列，它們在以下方面有所差異： - **序列的順序**：R1和R2的序列順序不同。R1代表第一條序列，通常是在5'端的序列，而R2代表第二條序列，通常是在3'端的序列。 - **定序方向**：R1和R2的序列可能具有相同或相反的定序方向，具體取決於定序平台和實驗設計。例如，在某些平台上，R1和R2的序列可能都是從5'端到3'端的正向序列，而在其他平台上，R1和R2的序列可能分別是正向和反向序列。 - **配對關係**：R1和R2的序列是成對出現的，它們具有相同的樣本標識符（通常是序列標識符的一部分），並且它們通過定序過程中的插入片段相互配對。通過這種配對關係，可以對成對序列進行比對、組裝和分析。總之，R1和R2是成對定序中兩個不同的序列，它們在順序、定序方向和配對關係等方面有所差異，但通常通過樣本標識符和插入片段相互關聯。 #forward, reverse, orientation, position - ### [R1 and R2 in sequencing data](https://www.biostars.org/p/214944/) - Q: I find myself a little confused about the diractionalityof R1 and R2 reads. The R2 is always the opposite direction of R1, but R1 can be either reverse or forward strand? - A: All sequencing always occurs in 5'-->3' orientation. Keep that in mind as you think about R1/R2 orientation. - ### [Forward And Reverse Strand Conventions](https://www.biostars.org/p/3423/) - A gene can live on a DNA strand in one of two orientations. The gene is said to have a coding strand (also known as its sense strand), and a template strand (also known as its antisense strand). For 50% of genes, its coding strand will correspond to the chromosome's forward strand, and for the other 50% it will correspond to the reverse strand. - https://en.wikipedia.org/wiki/File:Simple_transcription_elongation1.svg ![](https://hackmd.io/_uploads/Hyf5zhHP3.png) - ### [Illumina paired-end reads R1 and R2 mixed together?](https://www.biostars.org/p/244732/) - ### [Orientation of reads in .fastq files from RNA-Seq](https://www.biostars.org/p/343245/) > Since the RNA-Seq was done using stranded library kit, it was under my assumption that R1.fastq contains forward reads and R2.fastq contains reverse reads. > That assumption was wrong, as you can see with your own data. In Truseq standed kits, read 1 is in the reverse orientation. - ### [Forward and reverse reads in paired-end sequencing](https://www.cureffi.org/2012/12/19/forward-and-reverse-reads-in-paired-end-sequencing/) - ### [Mixed orientation of reads in R1 and R2 files #5](https://github.com/claraqin/neonMicrobe/issues/5) ## Phred Quality Score $Q = −10\log_{10}{P}$ $P = 10^{\frac{-Q}{10}}$ | ASCII | ASCII Simbol | Phred Quality Score (=Q) | Error Rate(=P) | accuracy | |-------|--------------| ------------------- | ---------- | -------- | | 33 | `!` | $$0 = -10\times0$$ | $$10^0 = 1$$ | 0% | | 34 | `"` | $$1 = -10\times(-0.1)$$ | $$10^{-0.1} = 0.7943$$ | 20.57% | | 35 | `#` | $$2 = -10\times(-0.2)$$ | $$10^{-0.2} = 0.6310$$ | 36.90% | | 36 | `$` | $$3 = -10\times(-0.3)$$ | $$10^{-0.3} = 0.5012$$ | 49.88% | | 37 | `%` | $$4 = -10\times(-0.4)$$ | $$10^{-0.4} = 0.3981$$ | 60.19% | | 38 | `&` | $$5 = -10\times(-0.5)$$ | $$10^{-0.5} = 0.3162$$ | 68.38% | | 43 | `+` | $$10 = -10\times(-1)$$ | $$10^{-1} = 0.1$$ | 90% | | 53 | `5` | $$20 = -10\times(-2)$$ | $$10^{-2} = 0.01$$ | 99% | | 73 | `I` | $$40 = -10\times(-4)$$ | $$10^{-4} = 0.0001$$ | 99.99% | | 126 | `~` | $$93 = -10\times(-9.3)$$ | $$10^{-9.3} = 5.0119\times10^{-10}$$ = 0.00000000050119 | 99.99999995% | - [Quality score encoding](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/slides-plain.html) [![](https://hackmd.io/_uploads/SkdfdiLu3.png)](https://hackmd.io/_uploads/SkdfdiLu3.png) - Speaker Notes - Earlier saw sequence + quality were same length of text - Each score encoded as a single character - Uses ASCII (common computer encoding, maps numbers to characters) - Each technology has a different mapping - Mostly see illumina 1.8 these days, others only if historical ## [Illumina Pattern Flow Cell 平台定序注意事項](https://yourgene.pixnet.net/blog/post/119028609) > 所謂「鹼基比例分布極端不平衡」指的是四種鹼基的其中一種含量比例過多或過少；特別若是在前25個鹼基的定序過程中出現了這樣的序列特徵時，假如不為樣品另行加入「平衡文庫」，定序品質會變得非常糟糕（見圖二）。 > > ![](https://i.imgur.com/K9nkn6F.png) > - 長序列溢出到其他定序晶片(flowcell) - 短序列擴增快，短序列數量比長序列多，導致長序列資料量過低 - 四種鹼基的其中一種，含量比例過多或過少，會導致定序品質不佳 ## [什麼是GC含量？](http://www.detaibio.com/tools/gc-content.html) - ### 什麼是GC含量？ GC含量又稱為G+C比值或GC比值， - ### GC含量計算公式為： ```[（G+C的總數量）/（A+T+C+G的總數量）] * 100%``` ## 議題探討 - 不同平台的定序結果與品質，可能不同 ![](https://i1.read01.com/SIG=3v4ebn9/30497556626e78544f69.jpg) ([圖片來源：因PCR擴增引入的假陽性突變](https://read01.com/zh-tw/E8EyRa6.html#.XTUMSXUzbCL)) - [降低定序錯誤率的實驗方法](https://yourgene.pixnet.net/blog/post/114235940) - 目標：正確地偵測到頻率很低的變異點位(variant) - 會在樣品文庫製備的步驟。使用PCR polymerase所造成的錯誤或是鹼基受到損害，如氧化或是脫胺基作用。 - 2012年： - Illumina 定序平台之隨機 **錯誤率** 基準值在 0.1~0.5% 之間 - 其他定序平台則在 >1% - 透過化學配方與反應酵素，來降低錯誤率 - 最好的定序技術：Duplex Sequencing - 錯誤頻率小於 5x10E-8 - 以人類 31 億個鹼基對來說，錯誤個數才 150 個 ![](https://pic.pimg.tw/yourgene/1458629162-3519233994_n.png) ## 定序深度 > 平均覆蓋深度(average coverage), 飽和度測試、飽和度分析 - ### [基因總長度：GRCh38.p13](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39) - Total sequence length: 3,099,706,404 - Total ungapped length: 2,948,583,725 - ### [定序深度](http://toolsbiotech.blog.fc2.com/blog-entry-19.html) (圖爾思生物科技 / 次世代定序知識櫥窗) - 定義：定序得到的鹼基總量(bp)與基因組大小(genome)的比值 - 範例： - 對一個物種的全基因組，進行重定序(Re-sequencing) - 基因組大小約為5G，定序獲得100G的數據量 - 則定序深度為20x - 定序品質：==臨床等級(clinical grade)建議使用30x以上== > When sequencing is 30x, this means it is clinical-grade. The term 'clinical-grade' indicates that the genome data is very reliable and, because of this, can be used for health-related purposes. ([see more](https://sequencing.com/what-30x-and-04x-whole-genome-sequencing)) - ### [華大基因：生命大數據「算力引擎」再發力 30X WGS信息分析僅需15分鐘](https://kknews.cc/tech/ljb3y5b.html) > 一個人類全基因組有30億鹼基，==一個30X的WGS測序數據量大約在100G==(???)。 - ### [基因组的那些事儿--基础](https://www.jianshu.com/p/bf871522ea20) [![](https://i.imgur.com/wnKiE0F.png)](https://i.imgur.com/wnKiE0F.png) [![](https://i.imgur.com/bBTlD0U.png)](https://i.imgur.com/bBTlD0U.png) ### 覆蓋度30X > ==**這個30的標準怎麼定的？為什麼不是20X或者更高的40、50X？**== > > 有研究做過飽和度分析～看看5~60X的模擬梯度對尋找遺傳變異的能力差異大小，結果發現平均深度達到30X的時候，可以覆蓋基因組的95%；另外測序深度越高，價格越貴，30X的高性價比足夠挖掘到一定量的遺傳變異Sequencing depth and coverage: key considerations in genomic analyses - Nature Reviews (2014) > ### 測序策略PE150: > 也就是標準的雙端測序模式（Paired End），目前雙端比單端價格還要便宜，而且一條序列這邊測一次，另一邊測一次，更準確。所以一般分析基本也沒有用單端的了。 150就是這邊測150bp，那邊測150bp。 > > 【當然打斷的片段一般是大於300bp的，所以每個reads中間會有一部分測不到，這就對了！畢竟reads是隨機打斷，也就是打斷的位置不同。 > > 雖然這一條reads的中間部分區域測不到，但是另外的reads就能測到。 > > 如果說，一條reads長度200卻採用雙端150bp，那麼中間就會有重疊區域，被測了兩遍，這在高通量測序中是非常浪費資源的，每次測都是要花錢的啊！】 - ### [基因檢測解讀秘籍十二：看了這麼多報告，你關注過數據質量嗎？](https://kknews.cc/news/a34r4lg.html) - 需要關注的主要指標有： - ### 測序數據量 & 平均測序深度 ![](https://i.imgur.com/DIS9cOO.png) - 人類基因組為3G大小，這時需要約100Gb數據 - ### 捕獲效率 ![](https://i.imgur.com/rOVe0Hm.png) - ### PCR重複率 ![](https://i.imgur.com/a4Kv7xn.png) - ### \> 20X reads比例 ![](https://i.imgur.com/246eG6O.png) - ### 測序質量 - ### [How to calculate the coverage for a NGS experiment](https://www.ecseq.com/support/ngs/how-to-calculate-the-coverage-for-a-sequencing-experiment) > 如何計算平均深度 > ![](https://i.imgur.com/vqpgYcK.png) > 六條總長度 = 28+32+36+30+31+31 = 188 > 平均覆蓋深度 = 188/112 = 1.68 > > ![](https://i.imgur.com/qwzCTKE.png) - ### [人体30亿个碱基对的基因组，容量有多少兆？](https://www.zhihu.com/question/21672220) > 3 ==我们一般说的人类基因组大小3G，是指对每个染色体长度总和的描述==，注意是长度；这也就解释了为什么你说的是60亿个碱基，双链， 3G X 2 = 6G。 - ### Mb (megabits? mega bases?), Gb (gigabits? giga bases?) 分不清？ - Mb: mega bases - Gb: giga bases - 範例 - https://www.illumina.com/techniques/sequencing/dna-sequencing/whole-genome-sequencing.html [![](https://i.imgur.com/SV137WR.png)](https://i.imgur.com/SV137WR.png) - **原生 fastq 檔案會有多大？**(詳細的分析過程) ``` --------------------- 全基因體有 30 億個鹼基 --------------------- 生物資訊在描述基因體時，會用 3G 描述全基因體也就是指 30 億個鹼基用的 G 是指 10^9 (科學記號 1e9) 有時也會用 Gb 並不是指 G bits(位元) 也不是指 G byte(位元組) 而是 G base (鹼基) 所以，看到 G 的用法都是指「鹼基的數量」 ------------------------------------------ 原生 fastq 檔案會有多大？ (不壓縮) - Part1 ------------------------------------------ fastq 第一行：定序資訊 fastq 第二行：序列ACTG符號 fastq 第三行：+ (加號) fastq 第四行：對應序列的品質分數以 WGS-LIS-AI018A 為例子 fastq 第一行： "@A00355:26:H3WFJDSXX:1:1101:1081:1047 2:N:0:GAATTCGT+ATAGAGGC" 長度為 61 字元 fastq 第二行：151 字元 fastq 第三行：1 字元，只有加號一個字元 fastq 第四行：151 字元（同第二行） 151 鹼基，實際上會耗掉 61+151+1+151+4(換行字元) = 368 下指令可實際測得： $ head -n 4 samples/WGS-LIS-AI018A/WGS-LIS-AI018A_R1.fastq | wc 4 5 368 150 或 151 通常是指 PE150 定序策略 (忘了 1 鹼基的特殊用途) 實際上是定序 150 個鹼基然後需要 368 字元儲存這 150 個鹼基 ------------------------------------------ 原生 fastq 檔案會有多大？ (不壓縮) - Part2 ------------------------------------------ 全基因體長度 L=3,099,706,404 已知每條序列，定序到長度為 150 bp 這邊是說，超音波打散的序列，假設可能是 252bp, 301bp, 350bp, 367bp, 400bp (長度不定) 單邊可測得 150bp (中間無法測得) 30x 表示平均定序深度是 30 倍每個鹼基，平均可以測得 30 次 L x 30倍 / 150 = 3099706404 x 30 / 150 = 619941280.8 ≒ 619941281 (有 6 億多條序列資訊) 619941281 x 368 bytes / 1024 / 1024 / 1024 = 212.47 G bytes 30 倍定序深度的檔案大小理想會是 212.47 G bytes 若換算成定序深度為 44.48 倍 212.47 / 30 x 44.48 = 315.02 G bytes ``` - **原生 fastq 檔案會有多大？**(快速算法) ``` 查看 fastq 的第一條序列資訊的全部大小得知：每 151 鹼基，需要用 368 字元儲存原生 fastq 檔案大小： 3G bp x 30倍 / 151bp x 368 bytes = 219.3 G bytes fastq.gz 檔案大小：一般都是用 2.7 倍評估 219.3/2.7 = 81.2 G bytes ``` - [總結] ==fastq 的檔案大小，受到下列的因素影響：== - head info (fastq 的每段序列資訊的第一行) - 定序儀機台的名稱、樣本的名稱 - 定序策略 (fastq 的每段序列資訊的第2和4行)： - 第二代定序 PE150 (150b) - 第三代定序(1kb~10kb) - 參考基因體的大小 - 粒線體([NC_012920](https://www.genome.jp/dbget-bin/www_bget?refseq:NC_012920))已含在內 - 有無含外部序列： - [NC_007605](https://www.genome.jp/dbget-bin/www_bget?refseq:NC_007605) (人疱疹病毒的序列) (大小：171823 bp = 171.8 Kbp) - hs37d5 (decoy序列) - [hs37d5 (GRCh37/b37 + decoy sequences)](https://www.pnas.org/content/pnas/early/2017/10/23/1707609114.full.pdf) - b37 有含 NC_007605 和 hs37d5