$ pip install matplotlib
from matplotlib import pyplot as plt
# [1950, 1960,..., 2010]
years = range(1950, 2011, 10)
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# 建立一個折線圖,x軸是年份,y軸是GDP
# 也可用速寫法plt.plot(years, gdp, 'b-', marker='o')
plt.plot(years, gdp, color='blue', marker='o', linestyle='solid')
# 加入圖表標題
plt.title('Nominal GDP')
# 加入x軸標籤
plt.xlabel('Year')
# 加入y軸標籤
plt.ylabel('Billion of $')
# 也可以用plt.savefig(fname)存成圖檔
plt.show()
在matplotlib輸入中文字串時會出現亂碼,原因為此模組中沒有支援中文的字型,解決方式可以參考這篇文章: https://medium.com/marketingdatascience/解決python-3-matplotlib與seaborn視覺化套件中文顯示問題-f7b3773a889b
from matplotlib import pyplot as plt
# [1, 2, 4, 8, 16, 32, 64, 128, 256]
variance = [2**n for n in range(9)]
bias_squared = sorted(variance)
total_error = [x + y for x, y in zip(variance, bias_squared)]
xs = range(len(variance))
## 透過呼叫多次plt.plot來繪製多個資料序列的折線圖
plt.plot(xs, variance, 'g-', label='variance') # 綠色實線
plt.plot(xs, bias_squared, 'r-.', label='bias_squared') # 紅色點虛線
plt.plot(xs, total_error, 'b:', label='total error') # 藍色點線
# 將圖例說明設定在圖表中上位置處(loc=9)
plt.legend(loc=9)
plt.xlabel('model complexity')
# 把x軸的資料點拿掉
plt.xtick([])
plt.title('The Bias_Variance Tradeoff')
plt.show()
from matplotlib import pyplot as plt
movies = ['Star Wars', 'The Avenger', 'Kingdom', 'Lord of the rings', 'Game of Thrones']
rates = [8, 7, 4, 8, 9]
plt.bar(movies, ratess, color='blue')
plt.title('My Favorate Movies')
plt.xlabel('Movie')
plt.ylabel('Rate')
plt.show()
from matplotlib import pyplot as plt
from collections import Counter
grades = [83, 95, 91, 87, 70, 0, 85, 82, 100, 67, 73, 77, 0]
# 以10為級距歸納資料: 83 -> 80, 67 -> 60
# 100要歸在90的級距內,因此使用min(n // 10 * 10, 90)做轉換
# Counter函數可以直接計算串列中數值出現的數字匯出一個字典
# histogram = {0: 2, 60: 1, ....}
histogram = Counter(min(n // 10 * 10, 90) for n in grades)
# x坐標軸需要位移5,讓各個bar落在正確的區間0~10, 10~20, ...
# plt.bar預設的bar寬度為0.8,將寬度拉寬為10橫跨整個區間
# 修改bar的邊界顏色讓bar之間清楚分割edgecolor='black'
plt.bar([x + 5 for x in histogram.keys()], histogram.values), width='10', edgecolor='black')
# 將x軸資料點名稱設為0, 10, 20, ...100
plt.xticks([10 * n for n in range(11)])
# 將x軸的數值範圍設為-5 ~ 105, y軸設為0~5
plt.axis([-5, 105, 0, 5])
plt.xlabel('Decile')
plt.ylabel('Number of Students')
plt.title('Distribution of Exam 1 Grades')
plt.show()
from matplotlib import pyplot
friends = [70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
plt.scatter(friends, minutes)
# 為每個資料點寫上標籤
for label, f, m in zip(labels, friends, minutes):
# 預設xytext=(x, y)即標籤會與資料點重疊,因此這裡設為(5, -5)
# textcoords設為offset points代表以資料點為依據位移多少字元
plt.annotate(label, xy=(f, m), xytext=(5, -5), textcoords='offset points')
plt.title('Daily Minutes vs. Number of Friends')
plt.xlabel('Number of Friends')
plt.ylabel('Daily minutes spend on the sites')
plt.show()
lectures and hands on tutorial
Apr 24, 2022RNA-Seq stands for RNA sequencing. It is a technology for studying RNA species. In a cell, the majority of RNA sepcies is ribosomal RNA (rRNA) which accounts for 90% of total RNA. Conversly, messenger RNA (mRNA) only aocunts for 1-2% of total RNA. The central dogma of biology tells us that DNA transcribes into mRNA and then mRNA translates into protein to execute biological functions. mRNA is the only RNA species we know encodes protein seuqences. Studying the level of mRNAs enables us to understand the gene expression level and infer the protein expression level. Therefore, most of time we only interest in the level of mRNA, and wants to deplet rRNA from the total RNA. Preprocessing RNA samples before sequencing includes total RNA extraction, mRNA enrichment, fragmentation, first-strand cDNA synthesis, RNA strand digestion, second-strand cDNA synthesis, 3'end repair, adenylation and adatper ligation. In total RNA extraction step, extracted RNA should not have signs of degradation. In mRNA enrichment, poly-T beads or poly-T columns are used to separate mRNA from other RNA sepcies, since only mRNA contains poly-A tails. mRNAs are fragmented by chemicals or ultrasonic into certain size range that fits for the sequencing capacity of sequencer. mRNA are primed with random hexamer and reverse-transcribed into first-strand cDNA, followed by RNA strand digestion and second-strand cDNA synthesis. For making strand-specific RNA-seq, TTPs are replaced with dUTPs, which serves as marks for the second-strand, in the second-strand cDNA synthesis step. Since the nature of polymerase synthesis, the double strand cDNA product lose several bases at the 3' ends on each strand. The 3'ends are repaired by dna end repair enzyme and then added a adenine. Y-shape adapters with 3'-T overhang are ligated to the 3'-A overhang of the double strand cDNA. To reduce the complexity in short read assembly, dUTP marked cDNA stands are digested using uracil-N-glycolas (UNG). The single strand cDNAs are then subjected to sequencing. In sequencing step, primer targeting adapter sequence aneal to one end of cDNA. Fluoresence-labelled ATPs, CTPs, GTPs or TTPs is added one at a time and photographed. Layers of photographes are analyzied by computer to interpret which nucleotide is added to the growing strand in each reaction. For paired end sequencing, the other primer targeting adatper on the other side is added and squenced again. The resultant are 2 separate files, one from forward strand, the other from reverse strand. Raw reads obtained from sequencer should undergo quality control before downtream analysis. Raw reads data are normally reported in fastq format. Fastq format stores 4 lines of information for each read. The first line is read name, the second line is sequence, the thrid line is a "+" sign, and the forth line is ASCII characters encypting qualtiy scores. For illumina sequencer, the quality socre is reported in Phred33 scheme. To obtain numberical quality score, ASCII characters need to be converted to numbers they represented in computer and minus 33. Quality score is used to repersent how confidence we can say this base is called correctly. A 40 quality score means 1 in 10000 chances this base is called incorrectly. Thus, higher quality score indicating higher chance this base is called correctely. Usually, we would want to trim off bases with quality score lower than 20. Low quality score base calling usually happens at the 3' end of reads since the polymerase is getting weak attaching to the template and prone to add wrong nucleic acids. FastQC is a commonly used tool to give an overview of read quality. It reports per base quality, per sequence quality, per base nucleotide content, per sequence GC content, number of duplicated sequences, overrepresented sequences and adapter content metrices. Per base quality metric shows the distribution of quality score in box plot for each base location. The upper boundary of the yellow box is the third quantile of qualtiy scores, and the lower boundary of the box is the first quantile. The red line at the middle of box is the median value of quality scores. Per sequence quality metric averages base quality for each read and shows the distribution of per read quality. For good quality reads, the distribution of read quality should skews toward high quality side. Per base nucleotide content metric shows the frequencies of nucleic acid appear at each base location. In principle, RNA has been randomly fragmented, so the frequency of each nucleic acid should distribute evenly across all read length. However, RNA-Seq data is an exception, a research has found that priming with random hexamer in cDNA synthesis step seems to have certain selection for fragmented RNA sequences, which then results in not that random nucleic acid content at the 5'end of reads. But these bases are still real bases in mRNAs, not artifacts, it's ok to put them into short read assembly. Per sequence GC content metric cacluate GC content for each read and report the distribution of per read GC content. GC content is like a fingerprint of a species, it has a constant value for a given species. This also applys to GC content of reads. The distribution of per read GC content should peaks at the value equal to that species' GC content. If peak shifts away from the expected GC content, it can stem from foreign species contaminants in read dataset. Number of duplicated sequences metric shows how many reads are exact maches to each other. For RNA-seq data set, highly expressed transcripts usually repeatly sequenced resulting in many duplicated reads. Overrepresented sequences metric reports reads that appear repeatively and account for 0.1% of total reads. Adapters are the most likely sequences captured by this metric. Adapter content metric align reads with public adapters and report alignment hits.
Apr 15, 2022Fundamental Trigonometric Identities Power Reducing Formulas Fucdamental Trigonometric Identities Reciprocal identities (倒數) $sec\theta=\frac{1}{cos\theta}$ $csc\theta=\frac{1}{sin\theta}$ $cot\theta=\frac{1}{tan\theta}$ $cos\theta=\frac{1}{sec\theta}$
Oct 11, 2020FastaQ格式是用來儲存核酸序列及其對應鹼基品質的資料儲存格式。 序列品質 序列品質是用來評估次世代定序(NGS)之定序結果可信度的數值,其公式如下: $$Q = -10logP$$ Where Q:代表序列(鹼基)品質 P:代表出現錯誤的百分比
Aug 23, 2020or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up