BWT 程式碼片段解釋

# BWT 程式碼片段解釋 ### fmindex.py #### 1. FMSimpleIndex._occ(): ```python= def _occ(self, qc): """ get the first occurance of letter qc in left-column""" occ = 0 #-------Your code should start here--------- occ = self.occ.get(qc) #--------Your code should end here---------- return occ ``` 程式碼解說： ``` line 1: 表示 _occ function 接收一參數 qc。 line 3: 先將 occ 歸零。 line 6: 利用 self.occ.get() 取得 qc 的 occurance value。其中，self.occ 是由 bwt.calc_first_occ(self.data) 計算出。 self.occ 是 dictionary 型態，因此透過 .get() 內建函示即可取得 qc 的 occ value。 line 8: 回傳 occ 值。 ``` ----- #### 2. FMSimpleIndex._count(): ```python= def _count(self, idx, qc): """ count the occurances of letter qc (rank of qc) upto position idx """ if not qc in self.occ.keys(): return 0 count = 0 #-------Your code should start here--------- count = self.data[:idx].count(qc) #--------Your code should end here---------- return count ``` 程式碼解說： ``` line 1: 表示 _count function 接收兩參數 idx 與 qc。 line 5: 先檢查 qc 是否存在 occ dictionary 清單內，若無，則 return 0。結束 _count() line 9: 計算 count 值 self.data[:idx]: 將字串 transformedBWT 進行修剪。讓字串只取到 idx 位置，後面接刪除。接著使用 python 內建函示 .count() 計算剩餘字串中，有幾個 qc。 line 11: 回傳 count 值。 ``` ----- #### 3. FMSimpleIndex._lf(): ```python= def _lf(self, idx, qc): #-------Your code should start here--------- """ return the nearset lf mapping for letter qc at position idx """ lf_result = self._occ(qc) + self._count(idx, qc) return lf_result #--------Your code should end here---------- ``` 程式碼解說： ``` line 1: 表示 _lf function 接收兩參數 idx 與 qc。 line 5: 計算 lf 值。計算方法為呼叫兩 fmindex.py 內建函示 _occ() 與 _count() 分別計算出 occ 值與 count 值後，加總，即為 lf 值。 line 6: 回傳 lf 值。 ``` ----- #### 4. FMSimpleIndex._bounds(): ```python= def bounds(self, q): """ find the first and last suffix positions for query q """ """These are positions in the BWT string""" """This is the meat of the FM search algorithm""" top = 0 bot = len(self.data) #-------Your code should start here--------- #iterate over letters in query string q in reverse #reverse will change q, so use [::-1] re_q = q[::-1] for qc in re_q: # calculate top (which maps the position # in the last column to the position in the first column) top = self._lf(top, qc) # calculate bottom bot = self._lf(bot, qc) #--------Your code should end here---------- if top == bot: return (-1,-1) #since bottom is non-inclusive, #top==bot implies that the string was not found. return (top,bot) ``` 程式碼解說： ``` line 1: 表示 bounds function 接收一參數 q。 line 5: 先將 top bound 設定為 0。 line 6: 再將 bottom bound 設定為 transformedBWT 長度。 line 11: 將 q，即欲尋找之字串，反轉。 line 13: 用 for 迴圈，逐一取出字元。 line 16: 透過 fmindex.py 內建函示 _lf 進行 top bound 夾擊，產生新 top bound。 line 17: 透過 fmindex.py 內建函示 _lf 進行 bottom bound 夾擊，產生新 bottom bound。 line 21: 若 top bound 與 bottom bound 相等，便回傳 (-1, -1)，亦即找不到字串。 line 22: 反之，若 top/bottom bound 不相等，則回傳 (top, bot)。 ``` ----- ### bwt.py #### 5. BurrowsWheeler.transform() ```python= def transform(self, s): """ Simplest Burrows-Wheeler transform implementation, O(n^2) respective to the length of the text. """ assert self.EOS not in s, "Input string cannot contain null character (%s)" % self.EOS # add end of text marker # please note that although this code # uses the null marker to mark the end of the string, # i've replaced it with "$" just for the output # to help you debug your code -BZ s += self.EOS bwtTransformedString='' # s is the input string, # with the end-of-string character added for you #-------Your code should start here--------- # generate a list of rotated input strings inputstring = s * 2 after_rotation = [] for i in range(0, len(s)): do_rotation = inputstring[i:i+len(s)] after_rotation.append(do_rotation) # sort the table of rotations sort_result = sorted(after_rotation) # extract the last characters of each row for line in sort_result: result = line[-1] '''convert the characters to a string, and store them in the variable bwtTransformedString ''' bwtTransformedString = bwtTransformedString + result #--------Your code should end here---------- return bwtTransformedString ``` 程式碼解說： ``` line 1: 表示 transform function 接收一參數 s。 line 4: 排除字串中有 null character。 line 12: 將 null character 塞到 s 最後面，當 end of the string。 line 13: 將 bwtTransformedString 設為空字串。 line 19: 為方便 rotation，字串位移，我們先將 s * 2。 line 20: 宣告 after_rotation 為空 list。 line 23: 透過 for 迴圈，搭配 python string slicing 功能，進行字串 rotation。 line 24: 將 rotation 結果，append 進 after_rotation list 中。 line 27: 使用 python 內建函式 sorted()，將 after_rotation 進行排序，並將結果傳給 sort_result。 line 30: 使用 for 迴圈，將 sort_result list 逐一取出。 line 31: 將 line 30 取出之結果，透過 python string 操弄，取出最後一個字元，並將結果傳給 result。 line 36: 將 line 31 結果逐一 append 到 bwtTransformedString 字串上。 line 40: 回傳 bwtTransformedString。 ``` ----- #### 6. calc_checkpoints() ```python= def calc_checkpoints(s, step): #returns the list of checkpoints, seperated by distance "step", each of which is a dictionary of the #number of occurences of each character up to that point (noninclusive) -BZ """ count the number of letters for each step and return list of the counts""" A = {} # letter count C = [] # checkpoints, a list of letter count dictionaries #-------Your code should start here--------- for ckp_index, q in enumerate(s): point = ckp_index % step if point == 0: C.append(A.copy()) if A.get(q): A[q] = A[q] + 1 else: A[q] = 1 #--------Your code should end here---------- return C ``` 程式碼解說： ``` 此函式之目的為加速 count 值之計算。在 index () 挑選 FMCheckpointing() 才會啟用。 line 1: 表示 transform function 接收一參數 s。 line 6: 給定一空 dictionary，存放 letter count，該 dict 命名為 A。 line 7: 給定一空 list，存放，各 check point 當時的 letter count。 line 11: 利用 enumerate() 搭配 for 迴圈，將 string 逐一取出，並授與編號。 line 12: 計算 checkpoint 檢查條件。 line 13: 透過檢查條件，判斷是否為 checkpoint。若 index number % step 之餘數為零則為 checkpoint。 line 14: 若為 checkpoint，則將當時的 letter count 狀態 A，存入 list C 中。 line 16: 若可在 letter count dictionary A 中，尋找到 q 字元，則將 letter count + 1。 line 19: 若無法在 letter count dictionary A 中，尋找到 q 字元，則表示該字元為第一次出現，該 letter count 值 = 1。 line 22: 回傳 C (各 checkpoint 的 letter count 狀態)。ｓ ``` ----- # -*- coding: utf-8 -*- print("讓事情變不可能的，是「人」，不是「事」。")