Alignment - HackMD

# Alignment Nguồn: tham khảo tại các tài liệu về [Sequence Alignment](https://www.ncbi.nlm.nih.gov/books/NBK464187/#ch20_1.sec20.1.8) Rating: 1800. ## Đề bài Trong Tin Sinh học, bài toán ***Short-read Alignment*** là bài toán tìm vị trí của một đoạn trình tự ngắn (short read) trong một trình tự dài (reference sequence). Việc tìm vị trí của các short read trong trình tự tham chiếu là một bài toán quan trọng có nhiều ứng dụng tin sinh học, bao gồm giải mã hệ gen, khám phá các biến thể di truyền, phân loại sinh vật, nghiên cứu phát sinh loài,... từ đó hé lộ những thông tin quan trọng về nguồn gốc tiến hoá hoặc chức năng cấu trúc chung. #### Yêu cầu: Cho một chuỗi DNA $S$ độ dài $n$ và $m$ chuỗi DNA $T_1, T_2,... T_m$. Tìm vị trí xuất hiện của $T_i$ ($1 \le i \le m$) trong chuỗi $S$. ## Input: - Dòng $1$ gồm 2 số nguyên dương $n,m$ ($1 \le n \le 10^6, 1 \le m \le 10^6$). - Dòng $2$ chứa xâu $S$ độ dài $n$. - $m$ dòng tiếp theo, dòng thứ $i+2$ chứa xâu $T_i$ ($1 \le i \le m$). Gọi tổng độ dài $m$ xâu này là $x$ thì $x \le 10^6$. ## Output - In ra $m$ dòng, dòng thứ $i$ bao gồm các vị trí đầu mà $T_i$ xuất hiện trong $S$ theo thứ tự tăng dần. Nếu $T_i$ không xuất hiện trong $S$, dòng $i$ in ra kết quả $-1$. kết quả đảm bảo số kí tự in ra không quá $10^7$. ## Sample ## Giới hạn - Subtask 1:($20\%$) $1 \le n \le 10^3$. - Subtask 2:($20\%$) $1 \le n \times m \le 10^6$. - Subtask 3:($30\%$) $1 \le n \le 10^5$. - Subtask 4:($30\%$) Không giới hạn gì thêm. ## Lời giải: <details> <summary>Lời giải</summary> </details> **1. Subtask 1:** $1 \le n \le 10^3$. Ý tưởng: sủ dụng Prefix Trie. ĐPT: $O(n^2)$. **2. Subtask 2:** $1 \le n \times m \le 10^6$. Ý tưởng: sủ dụng [KMP](https://vnoi.info/wiki/algo/string/kmp.md)/[Rabin-Karp](https://vnoi.info/wiki/algo/string/hash.md). ĐPT: $O(n \times m)$. **3. Subtask 3:** $1 \le n \le 10^5$. Ý tưởng: sử dụng [Suffix Array](https://vnoi.info/wiki/algo/data-structures/suffix-array), sau đó chặt nhị phân để tìm vị trí. ĐPT: $O((n+m)log(n))$. <details> <summary>Code </summary> ```cpp= #include <bits/stdc++.h> using namespace std; int N, M; string S, T[1000006]; int hehe[1000006]; struct suffix { int index; // to store original index int ranks[2]; // to store ranks and next rank pair }; bool cmp(suffix a, suffix b) { if (a.ranks[0] == b.ranks[0]) return a.ranks[1] < b.ranks[1]; return a.ranks[0] < b.ranks[0]; } int *buildSuffixArray(string txt, int n) { suffix suf[n]; for (int i = 0; i < n; i++) { suf[i].index = i; suf[i].ranks[0] = txt[i] - 'a'; suf[i].ranks[1] = ((i + 1) < n)? (txt[i + 1] - 'a'): -1; } sort(suf, suf + n, cmp); int ind[n]; for (int k = 4; k < 2*n; k*=2) { // Assigning rank and index values to first suffix int rak = 0; int prev_rank = suf[0].ranks[0]; suf[0].ranks[0] = rak; ind[suf[0].index] = 0; // Assigning rank to suffixes for (int i = 1; i < n; i++) { prev_rank = suf[i].ranks[0]; if (suf[i].ranks[0] == prev_rank && suf[i].ranks[1] == suf[i - 1].ranks[1]) suf[i].ranks[0] = rak; else suf[i].ranks[0] = ++rak; ind[suf[i].index] = i; } // Assign next rank to every suffix for (int i = 0; i < n; i++) { int nextIndex = suf[i].index + k/2; suf[i].ranks[1] = (nextIndex < n)? (suf[ind[nextIndex]].ranks[0]): -1; } ///radixSort(suf, n); sort(suf, suf + n, cmp); } int *sufArr = new int[n]; for (int i = 0; i < n; i++) sufArr[i] = suf[i].index; return sufArr; } int main() { ios_base::sync_with_stdio(0); cin.tie(0); cout.tie(0); cin >> N >> M; cin >> S; for (int i = 1; i <= M; i++) cin >> T[i]; int *SA = buildSuffixArray(S, N); for (int i = 1; i <= M; i++) { int m = T[i].size(); int l = 0, r = N-1; int maxi = -1; int mini = -1; while (l <= r) { int mid = (l + r) / 2; int res = 0; for (int j = 0; j < m; j++) { if (j + SA[mid] >= S.size() || T[i][j] > S[j + SA[mid]]) { res = 1; // t[i] > sx break; } else if (T[i][j] < S[j + SA[mid]]) { res = -1; break; } } if (res == 1) l = mid + 1; else { maxi = mid; r = mid - 1; } } l = 0; r = N-1; while (l <= r) { int mid = (l + r) / 2; int res = 0; for (int j = 0; j < m; j++) { if (j + SA[mid] >= S.size() || T[i][j] > S[j + SA[mid]]) { res = 1; // t[i] > sx break; } else if (T[i][j] < S[j + SA[mid]]) { res = -1; break; } } if (res == -1) r = mid - 1; else { mini = mid; l = mid + 1; } } swap(mini, maxi); if (maxi == -1 || mini == -1 || mini > maxi) { cout << -1 << '\n'; continue; } for (int j = mini; j<= maxi; j++) hehe[j-mini] = SA[j]; sort(hehe, hehe + (maxi - mini+1)); for (int j = 0; j <= maxi - mini; j++) cout << hehe[j] << ' ' ; cout << '\n'; } return 0; } ``` </details> **4. Subtask 4:** $1 \le n \le 10^6$. Ý tưởng: sử dụng [Burrows–Wheeler Transform](https://web.stanford.edu/class/cs262/notes/lecture4.pdf) (BWT) ```cpp= //Index: P[1..n] pair<int, int> ExactMatch(string P) { int n = P.size(); char c = P[n]; int low = C(c) + 1; int up = C(c+1); //next char int pos = n-1; while (low <= up && pos >= 1) { c = P[pos]; int i = O(c, low-1); int j = O(c, up); low = C(c) + i + 1; up = C(c) + j; pos = pos - 1; } return {low, up}; } ```