Atomic Notes

@atomic-notes

這裡是Eric Juo(小卓)無聊發起的小小空間,目的在於共享生活中學習到的知識及經驗,這裡是一個新手友善空間,不要求筆記寫的多完整,只希望學習到的東西可以在某個地方保存下來,並可以幫助其他人一起學習

Public team

Community (0)
No community contribution yet

Joined on May 27, 2020

  • RNA-Seq analysis Prepared by: Eric Juo Date: 2022/04/15 slide: https://hackmd.io/@atomic-notes/Sk-TQULN9 What is RNA-Seq? RNA sequencing especially mRNA
     Like  Bookmark
  • RNA-Seq stands for RNA sequencing. It is a technology for studying RNA species. In a cell, the majority of RNA sepcies is ribosomal RNA (rRNA) which accounts for 90% of total RNA. Conversly, messenger RNA (mRNA) only aocunts for 1-2% of total RNA. The central dogma of biology tells us that DNA transcribes into mRNA and then mRNA translates into protein to execute biological functions. mRNA is the only RNA species we know encodes protein seuqences. Studying the level of mRNAs enables us to understand the gene expression level and infer the protein expression level. Therefore, most of time we only interest in the level of mRNA, and wants to deplet rRNA from the total RNA. Preprocessing RNA samples before sequencing includes total RNA extraction, mRNA enrichment, fragmentation, first-strand cDNA synthesis, RNA strand digestion, second-strand cDNA synthesis, 3'end repair, adenylation and adatper ligation. In total RNA extraction step, extracted RNA should not have signs of degradation. In mRNA enrichment, poly-T beads or poly-T columns are used to separate mRNA from other RNA sepcies, since only mRNA contains poly-A tails. mRNAs are fragmented by chemicals or ultrasonic into certain size range that fits for the sequencing capacity of sequencer. mRNA are primed with random hexamer and reverse-transcribed into first-strand cDNA, followed by RNA strand digestion and second-strand cDNA synthesis. For making strand-specific RNA-seq, TTPs are replaced with dUTPs, which serves as marks for the second-strand, in the second-strand cDNA synthesis step. Since the nature of polymerase synthesis, the double strand cDNA product lose several bases at the 3' ends on each strand. The 3'ends are repaired by dna end repair enzyme and then added a adenine. Y-shape adapters with 3'-T overhang are ligated to the 3'-A overhang of the double strand cDNA. To reduce the complexity in short read assembly, dUTP marked cDNA stands are digested using uracil-N-glycolas (UNG). The single strand cDNAs are then subjected to sequencing. In sequencing step, primer targeting adapter sequence aneal to one end of cDNA. Fluoresence-labelled ATPs, CTPs, GTPs or TTPs is added one at a time and photographed. Layers of photographes are analyzied by computer to interpret which nucleotide is added to the growing strand in each reaction. For paired end sequencing, the other primer targeting adatper on the other side is added and squenced again. The resultant are 2 separate files, one from forward strand, the other from reverse strand. Raw reads obtained from sequencer should undergo quality control before downtream analysis. Raw reads data are normally reported in fastq format. Fastq format stores 4 lines of information for each read. The first line is read name, the second line is sequence, the thrid line is a "+" sign, and the forth line is ASCII characters encypting qualtiy scores. For illumina sequencer, the quality socre is reported in Phred33 scheme. To obtain numberical quality score, ASCII characters need to be converted to numbers they represented in computer and minus 33. Quality score is used to repersent how confidence we can say this base is called correctly. A 40 quality score means 1 in 10000 chances this base is called incorrectly. Thus, higher quality score indicating higher chance this base is called correctely. Usually, we would want to trim off bases with quality score lower than 20. Low quality score base calling usually happens at the 3' end of reads since the polymerase is getting weak attaching to the template and prone to add wrong nucleic acids. FastQC is a commonly used tool to give an overview of read quality. It reports per base quality, per sequence quality, per base nucleotide content, per sequence GC content, number of duplicated sequences, overrepresented sequences and adapter content metrices. Per base quality metric shows the distribution of quality score in box plot for each base location. The upper boundary of the yellow box is the third quantile of qualtiy scores, and the lower boundary of the box is the first quantile. The red line at the middle of box is the median value of quality scores. Per sequence quality metric averages base quality for each read and shows the distribution of per read quality. For good quality reads, the distribution of read quality should skews toward high quality side. Per base nucleotide content metric shows the frequencies of nucleic acid appear at each base location. In principle, RNA has been randomly fragmented, so the frequency of each nucleic acid should distribute evenly across all read length. However, RNA-Seq data is an exception, a research has found that priming with random hexamer in cDNA synthesis step seems to have certain selection for fragmented RNA sequences, which then results in not that random nucleic acid content at the 5'end of reads. But these bases are still real bases in mRNAs, not artifacts, it's ok to put them into short read assembly. Per sequence GC content metric cacluate GC content for each read and report the distribution of per read GC content. GC content is like a fingerprint of a species, it has a constant value for a given species. This also applys to GC content of reads. The distribution of per read GC content should peaks at the value equal to that species' GC content. If peak shifts away from the expected GC content, it can stem from foreign species contaminants in read dataset. Number of duplicated sequences metric shows how many reads are exact maches to each other. For RNA-seq data set, highly expressed transcripts usually repeatly sequenced resulting in many duplicated reads. Overrepresented sequences metric reports reads that appear repeatively and account for 0.1% of total reads. Adapters are the most likely sequences captured by this metric. Adapter content metric align reads with public adapters and report alignment hits.
     Like  Bookmark
  • Fundamental Trigonometric Identities Power Reducing Formulas Fucdamental Trigonometric Identities Reciprocal identities (倒數) $sec\theta=\frac{1}{cos\theta}$ $csc\theta=\frac{1}{sin\theta}$ $cot\theta=\frac{1}{tan\theta}$ $cos\theta=\frac{1}{sec\theta}$
     Like  Bookmark
  • FastaQ格式是用來儲存核酸序列及其對應鹼基品質的資料儲存格式。 序列品質 序列品質是用來評估次世代定序(NGS)之定序結果可信度的數值,其公式如下: $$Q = -10logP$$ Where Q:代表序列(鹼基)品質 P:代表出現錯誤的百分比
     Like  Bookmark
  • 講者: 林宜靜副教授 體外診斷醫療器材的特性 體外診斷醫療器材(IVD)的性能評估分為兩大類,並向下細分各個檢測項目: 臨床性能 敏感度 (Sensitivity) 特異度 (Specificity) 陽性預測值 (Positive predictive value)
     Like 1 Bookmark
  • 主辦單位: 財團法人醫藥工業技術發展中心 日期: 2020.07.28 Claim: 以下解答純屬個人上完課後認為最合適的解答,但不代表答案完全正確。 Q1. 陽性預測值(positive predictive value) 不會受到盛行率(prevalence)的影響。 (A) 是 (B) 否 Ans. (B) 否,陽性預測值(PPV)會受到盛行率的影響。 陽性預測值定義為: 被檢驗為陽性者,真的患病的機率,以公式呈現如下:
     Like 1 Bookmark
  • 練習1 先下載herpesvirus_genome.json herpesvirus_genome.json A) Find the frequency of each amino-acid in the herpesvirus's proteome. 找出herpesvirus的蛋白質體中,所有胺基酸出現的頻率。 import os import json
     Like  Bookmark
  • 練習一 The file orf_exons_chr17.txt contains a list of genes on chromosome 17 and their ORF exon sequences. ORF exons are the parts of the exons of a gene that contribute to its ORF/CDS (i.e. without the UTRs). It means that exons that are contained entirely in the UTRs (and thus have no contribution to the ORF at all) are not included in the list. A) Parse the file into the following data structure: { ‘gene_symbol1’: ‘orf_exon_seq1’, ‘orf_exon_seq2’, ‘orf_exon_seq3’, …, ‘gene_symbol2’: ‘orf_exon_seq1’, ‘orf_exon_seq2’, ‘orf_exon_seq3’, …, ‘gene_symbol3’: ‘orf_exon_seq1’, ‘orf_exon_seq2’, ‘orf_exon_seq3’, …, … } import os
     Like  Bookmark
  • 練習一 Calculate the GC content of the following DNA sequence: ATGGGTGCGAGAGCGTCAGTATTAAGCGGGGGAGAATTAGATCGATGGGAAAAAATTCGGTTAAGGCCAGGGGGAAAGAAAAAATATAAATTAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAATCCTGGCCTGTTAGAAACATCAGAAGGCTGTAGACAAATACTGGGACAGCTACAACCATCCCTTCAGACAGGATCAGAAGAACTTAGATCATTATATAATACAGTAGCAACCCTCTATTGTGTGCATCAAAGGATAGAGATAAAAGACACCAAGGAAGCT dna_seq = """ATGGGTGCGAGAGCGTCAGTATTAAGCGGGGGAGAATTAGATCGATGGGAAAAAATTCGGTTAAGGCCAGGGGGAAAGAAAAAATATAAATTAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAATCCTGGCCTGTTAGAAACATCAGAAGGCTGTAGACAAATACTGGGACAGCTACAACCATCCCTTCAGACAGGATCAGAAGAACTTAGATCATTATATAATACAGTAGCAACCCTCTATTGTGTGCATCAAAGGATAGAGATAAAAGACACCAAGGAAGCT""" gc = 0 for nt in dna_seq.strip(): if nt == 'G' or nt == 'C': gc += 1
     Like  Bookmark
  • 中國自2020年4月底起,沿海省分養蝦場陸續出現十足目虹彩病毒,造成大量蝦、螃蟹死亡。台灣於2020年6月,在新竹縣、新北市、宜蘭縣及雲林縣的養蝦場發現此病毒蹤跡[^2]。 疫情最早可追朔至2014年,研究人員發現,經過檢測認定是一種高傳染性的病毒,不僅能殺死常見的蝦子品種,也能殺死龍蝦或是螃蟹, 病毒 病毒分類 虹彩病毒科 Iridoviridae 十足目虹彩病毒屬 Decapodiridovirus 十足目虹彩病毒 Decapod iridescent Virus 1 (DIV1)
     Like  Bookmark
  • 這是MIT OpenCourseWare所提供的免費線上課程,課程連結點這裡! 薛丁格方程式 Schrodinger Equation 1927年,Erwin Schrodinger寫下以下方程式來代表粒子在微觀世界的行為。 Schrodinger equation: $$\hat{H}\psi = E\psi$$ $\hat{H}$ = Hamiltonian operator,為計算粒子的動能K.E.及位能P.E.加總的表達式。 $\psi$ = 波函數(wavefunction),描述粒子的量子態。 $E$ = 粒子的總能量,以電子而言,代表電子與質子間的結合能(binding energy)。
     Like  Bookmark
  • 在Hackmd的筆記中可以使用MathJax語法來加入數學表達式。 撰寫方法 行內數學表達式 於數學表達式前方及後方加入各一個$符號,即可在文字段落中使用數學表達式。 理想氣體方程式可表示為$PV=nRT$,由方程式可知氣體壓力粒子數量及溫度呈正比。 理想氣體方程式可表示為$PV=nRT$,由方程式可知氣體壓力與粒子數量及溫度呈正比。
     Like  Bookmark
  • 這是MIT OpenCourseWare所提供的免費線上課程,課程連結點這裡! 自從生科系畢業並踏入職場後,總有一種無力感---開發檢驗試劑比不過化學系、開發藥物比不過藥學系、開發資訊軟體比不過資工系,果然書到用時方恨少...只好回頭惡補基礎化學惹。 原子 JJ Thomson's Plum Pudding Model 1897年,J.J. Thomson使用陰極射線(cathode rays)實驗發現了電子並測定了電子的荷質比。實驗設計如下: ![cathode rays](https://i.imgur.com/P4KS6dq.png =500x180)
     Like  Bookmark
  • matplotlib是Python中廣泛被運用的函示庫 可製作長條圖、折線圖、散點圖 安裝matplotlib $ pip install matplotlib 繪製折線圖 from matplotlib import pyplot as plt # [1950, 1960,..., 2010]
     Like  Bookmark
  • 類別 像是模板 用於產生同類的物件 相關方法可重複使用 Python慣用Camel case命名類別 建立類別 class Animal: # 用__init__函數設定類別產生物件當下要做的事
     Like  Bookmark
  • 台灣銀行匯率 先在終端機(Terminal)用指定安裝pyquery模組到電腦 !pip install pyquery 簡易爬蟲程式碼 from pyquery import PyQuery url = 'https://rate.bot.com.tw/xrt?Lang=zh-TW'
     Like  Bookmark
  • 除內建函數以外,也可以將常用的功能包成一個函數 計算矩形面積 def rect(width, height): ''' This function will return the area of an rectangel ''' area = width * height return area area = rect(2, 3)
     Like  Bookmark
  • if-elif-else 三種判斷條件 # 引入random模組隨機選一種天氣 from random import choice weathers = ["晴天", "多雲", "毛毛雨", "狂風", "暴雨", "下雪", "打雷閃電"] w = choice(weathers) if w == '晴天' or w == '多雲': print('到戶外跑步')
     Like  Bookmark
  • for 迴圈 宣告迴圈種類(for) 宣告一個疊代子 iterator (可自由命名) 宣告一個迴圈要跑的sequence (list, dict...) weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"] for wk in weekdays: # 一次印出一天 print(wk)
     Like  Bookmark
  • 串列(list) 一群資料的群組 具有順序性 nums = [2, 4, 5, 8, 10] # 串列也可以儲存不同型別的資料 tw_building = [101, 'Taiwan'] print(type(tw_building)) # list
     Like  Bookmark