生物序列分析演算法期中考Hit

* # 生物序列分析演算法期中考Hit 考古題來源:https://www.ptt.cc/man/NTU-Exam/DE0A/D245/D298/DE44/D64F/D48/index.html ## •Huffman code construction (The occurrence probabilities of nucleotides A (adenine), C (cytosine), G (guanine), T (thymine) are given.) >https://docs.google.com/presentation/d/1fSvSm5l2-X-Mtaem0BBTS5q7nQ0D_Hg0vL9OpWmtcLA/edit#slide=id.g163f743414f_0_584 > - 99.1: Suppose we are given a very long DNA sequence where the occurrence probabilities of nucleotides A (adenine), C (cytosine), G (guanine), T (thymine) are 0.1, 0.3, 0.4, and 0.2, respectively. 1. (10%): Construct a Huffman code for them. You should work out the binary tree constructionas well as the code assignment. >![](https://i.imgur.com/X9F9K7G.png) 2. (5%): By the above Huffman coding scheme, what is the binary string for a 10-nucleotide DNA sequence “GGGCTTCACG.” >A=11,C=100,G=0,T=101, >So the DNA sequence “GGGCTTCACG” =0 0 0 100 101 101 100 11 100 0 - 99.2: Suppose we are given a very long DNA sequence where the occurrence probabilities of nucleotides A, C, G, T are 0.3, 0.1, 0.2. and 0.4 respectively. Construct a Huffman code for them. You should work out the binary three construction as well as the code assignment. >![](https://i.imgur.com/3tDImHx.png) ## •Longest increasing subsequence (Show how the $O(n \log n)$-time algorithm works with a given test example.) Dynamic Programming - 99.1: In class, we introduced an O(n log n)-time algorithm for finding a longest increasing subsequence. Use h8; 2; 6; 4; 5; 7; 3; 1; 12; 9; 10i to explain how the algorithm works. > - 99.2: In class, we introduced an O(nlog n)-time algorithm fore finding a longest increasing subsequence of an input sequence of length n. Use <5,2,4,1,6,8,9,3,7> to explain how the algorithm works. > - step1: initialize, A[1]=5, creat and input 5 > 5 > - step2: A[2]=2, copy, extend and discard > 5 discarded > 2 > - step3: A[3]=4, copy and extend > 2 > 2, 4 > - step4: A[4]=1, copy, extend and discard > 1 > 2 discarded > 2, 4 > - step5: A[5]=6, copy, extend and discard > 1 > 6 discarded > 1, 6 discarded > 2, 4 > 2, 4, 6 > - step6: A[6]=8, copy, extend and discard > 1 > 8 discarded > 1, 8 discarded > 2, 4 > 2, 4, 8 discarded > 2, 4, 6 > 2, 4, 6, 8 > - step7: A[7]=9, copy, extend and discard > 1 > 9 discarded > 1, 9 discarded > 2, 4 > 2, 4, 9 discarded > 2, 4, 6 > 2, 4, 6, 9 discarded > 2, 4, 6, 8 > 2, 4, 6, 8, 9 > - step8: A[8]=3, copy, extend and discard > 1 > 3 discarded > 1, 3 > 2, 4 discarded > 2, 4, 6 > 2, 4, 6, 8 > 2, 4, 6, 8, 9 > - step9: A[9]=7, copy, extend and discard > 1 > 7 discarded > 1, 7 discarded > 1, 3 > 1, 3, 7 discarded > 2, 4, 6 > 2, 4, 6, 7 > 2, 4, 6, 8 discarded > 2, 4, 6, 8, 9 <-- LIS List ## •Two algorithms for the maximum-sum segment problem (Show how they work with a given test example.) - 99.1: Problem 3 (10%): Given a sequence of real numbers A = ha1; a2; : : : ; ani, the maximum-sum segment problem is to find a consecutive subsequence, i.e., a substring or segment, in A with the maximum sum. Let prefix sum P[i] = Pi j=1 aj be the sum of the first i elements. Explain how to use the prefix sum to deliver the maximum-sum segment in O(n) time. In the following, we are given two sequences A = ha1; a2; : : : ; ami and B = hb1; b2; : : : ; bni. An alignment of A and B is obtained by introducing dashes into the two sequences such that the lengths of the two resulting sequences are identical and no column contains two dashes. Let § denote the input symbol alphabet. A score ¾(a; b) is defined for each (a; b) 2 § ￡ §. The score of an alignment is the sum of ¾ scores of all columns with no dashes minus the penalties of the gaps. > - 99.2: We assume the following simple scoring scheme. A match is given a bonus score 8, a mismatch is penalized by assigning score -5, and the gap penalty for each gap symbol is -3. > ![](https://i.imgur.com/hg6Pwam.png) 1. Write down the dynamic-programming matrix for computing an optimal global alignment of the sequences ATGGC and AGGTC. Print the global alignment and its score > ![](https://i.imgur.com/xtBvGZY.png) > A - G G T C > A T G G - C > 8-3+8+8-3+8=26 2. Write down the dynamic-programming matrix for computing an optimal local alignment of the sequences GATGGCT and TAGGTCG. Print the local alignment and its score. > ![](https://i.imgur.com/q7RQgAz.png) > A T G G - C > A - G G T C > 8-3+8+8-3+8=26 > ![](https://i.imgur.com/6QtS2O5.png) > T - G G - C > T A G G T C > 8-3+8+8-3+8=26 > ![](https://i.imgur.com/5treK7z.png) > T - G G - C > T A G G T C > 8-3+8+8-3+8=26 > ![](https://i.imgur.com/9zWq0TZ.png) > A T G G - C > A - G G T C > 8-3+8+8-3+8=26 3. Assume further that opening up a gap is charged an additional cost -8. Write down the dynamic-programming matrix for computing an optimal global alignment of the sequences ATGGC and AGGTC. Print the global alignment and its score. > A - G G T C > A T G G - C > ^ ^ > 8-3+8+8-3+8=26 > Alignment score = 26-8-8 =10 ## •Pairwise alignment (Show how to compute $S^{-}[i, j]$, $S^{+}[i, j]$, local alignment, and affine gap penalties for two given test sequences.) ![](https://i.imgur.com/MfRex7j.png) > ![](https://i.imgur.com/AJjAiZ4.png) > > https://www.csie.ntu.edu.tw/~kmchao/bioinformatics14spr/various_schemes.pdf ## •Some details about Hirschberg's linear-space > http://www.cs.nthu.edu.tw/~wkhon/algo09/tutorials/tutorial-hirschberg.pdf > https://www.cs.cmu.edu/~ckingsf/class/02-714/Lec07-linspace.pdf > https://blog.csdn.net/weixin_43934036/article/details/105437343 > ## •Algorithms for the problem of computing all $\Delta$-points > ![](https://i.imgur.com/J6INkjq.png) > ![](https://i.imgur.com/TD1AeJI.png) > Same with global alignment > ![](https://i.imgur.com/SSVPgZN.png) > ![](https://i.imgur.com/Qo4ftQd.png) > 綠色部分總和皆為14 > ![](https://i.imgur.com/edYwIr1.png) > ![](https://i.imgur.com/kdZ4tRP.png) ## •Scoring schemes for multiple sequence alignment (gap-start, gap-end, and quasi-gaps) ![](https://i.imgur.com/4x9K3QJ.png) > quasi-gap應該是T型的gap(猜的) 生醫資訊導論期中考評分標準說明 > https://www.csie.ntu.edu.tw/~kmchao/bioinformatics14spr/mid1_grading_criteria.pdf