110 選手班 - 字串

tags: `宜中資訊` `CP`

Ccucumber12
2021.08.24

Outline

Trie
KMP
Z-algorithm
Rolling hash

Trie

Problem

給定 \(N\) 個字串
請支援以下詢問 \(Q\) 次
- 給定字串 \(S\)，請問 \(N\) 個字串中出現幾個 \(S\)
\(N,Q \leq 10^5，|S| \leq L = 100\)

Naive

儲存所有字串
每次詢問就遍歷一次
Time: \(\mathcal{O}(NQL)\)

Binary Search

儲存所有字串並排序
對於每個詢問二分搜
Time: \(\mathcal{O}(NL\log N + QL\log N)\)

Better？

\(\mathcal{O}(NL + QL)\)？

Trie

命名：Retrieval
發音：
- /tri/: 發明者(tree)
- /trai/: 其他人(try)
字典樹，字首樹

Key

英文字串字母數量有限 (26個)
將共同前綴合併
- 節省空間
- 加速尋找

每個節點(分岔)代表一個字母
紀錄每個字串的結尾位置
每個從根結點出發的路徑代表一個字串

node




struct node {
    int count ;
    node *nxt[26] ;
};


































struct node {
    int cnt ;
    node *nxt[26] ;

    node () {
        cnt = 0 ;
        for(int i=0; i<26; ++i)
            nxt[i] = nullptr ;
    }
} ;

node *root = new node() ;

void modify(string s, int val) {
    node *tmp = root ;
    for(auto i:s) {
        i -= 'a' ;
        if(tmp -> nxt[i] == nullptr)
            tmp -> nxt[i] = new node() ;
        tmp = tmp -> nxt[i] ;
    }
    tmp -> cnt += val ;
}

int count(string s) {
    node *tmp = root ;
    for(auto i:s) {
        i -= 'a' ;
        if(tmp -> nxt[i] == nullptr)
            return 0 ;
        tmp = tmp -> nxt[i] ;
    }
    return tmp -> cnt ;
}

Time Complexity

Insert: \(\mathcal{O}(L)\)
Query: \(\mathcal{O}(L)\)
Total: \(\mathcal{O}(NL + QL)\)

Sorting

依照字母順序 DFS

Binary

處理二進位字串
二元樹
二分搜

KMP

Problem

給定字串 \(S\) 與字串 \(T\)
請問 \(S\) 在 \(T\) 中出現幾次？
\(|S| \leq |T| \leq 5\times 10^5\)

在 文本(Text) 中尋找欲求的的 段落(Pattern)

Naive

枚舉每個 \(T_i\) 當作開頭
往右檢查是否與 \(S\) 相同
Time: \(\mathcal{O}(TS)\)

KMP

Knuth–Morris–Pratt algorithm
三個人共同發表
Morris-Pratt Algorithm (MP) 優化而成

減少比對次數
如果比對完 \(S[0:9]\) 發現第九格不對
前面 \(S[0:8]\) 有沒有辦法提供甚麼資訊？

Failure Function

\(F[i] :=\) 當成功配對 \(s[0, i-1]\) 直到 \(s[i]\) 失敗時，將 \(F[s[i]]\) 對齊原本 \(s[i]\) 的位置

a	b	a	b	b	a	a	b	a	b	b	a	b	a	a
-1	-1	0	1	-1	0	0	1	2	3	4	5	1	2	0

\(F[i] = k\)
\(k\) 為滿足 \(s[0:k]=s[i-k:i]\) 的最大值
若不存在，則 \(k = -1\)
對於每個前綴，尋找其後綴與整個字串最大相同長度

Naive

根據定義，對於每個位置枚舉所有 \(k\) 檢查
枚舉 \(k=[0,i]\)
Time: \(\mathcal{O}(|S|^3)\)

Optimization 1

相鄰的 failure function 至多相差 \(1\)
若 \(s[i+1]=s[F[i]+1]\)，則\(F[i+1]=F[i]+1\)
若不相同，則枚舉 \(k=[0,F[i]]\)
Time: \(\mathcal{O}(|S|^2)\)

Optimization 2

當 \(s[i+1] \neq s[F[i]+1]\) 時
希望找到次長 \(k'\) 使得 \(s[0:k']=s[i-k':i]\)
若 \(s[i+1] = s[k'+1]\)，則 \(F[i+1]=k'+1\)
若不同，則找第三長 \(k''\dots\)

Time Complexity

增長：\(O(1)\)
減少：\(O(1)\)
每格最多增長 \(1\)，最多減少 \(|S|\) 次
Total：\(\mathcal{O}(|S|)\)









void build(string s) {
    f[0] = -1 ;
    for(int i=1; i<s.size(); ++i) {
        f[i] = f[i-1] ;
        while(f[i] > -1 && s[i] != s[f[i] + 1])
            f[i] = f[f[i]] ;
        f[i] += (s[i] == s[f[i] + 1]) ;
    }
}

KMP

拿 \(T\) 跟 \(S\) 和 Failure Function 比較
若成功配對則往前一格
若配對師被則用 \(F[i]\) 跳轉
當成功長度達 \(|S|\) 即找到一個出現在 \(T\) 中的 \(S\)
本質上跟尋找 Failure Function 一樣

Time Complexity

Failure Function：\(\mathcal{O}(|S|)\)
尋找配對：\(\mathcal{O}(|T|)\)
Total：\(\mathcal{O}(|S|+|T|)\)











int kmp(string s, string t) {
    int k = -1, ret = 0 ;
    for(int i=0; i<t.size(); ++i) {
        while(k > -1 && t[i] != s[k + 1])
            k = f[k] ;
        k += (t[i] == s[k + 1]) ;
        if(k == s.size() - 1) 
            k = f[k], ++ret ;
    }
    return ret ;
}

Z-algorithm

Problem

給定字串 \(S\) 與字串 \(T\)
請問 \(S\) 在 \(T\) 中出現幾次？
\(|S| \leq |T| \leq 5\times 10^5\)

Z-Value
Z 函數 / Z 演算法
擴展 KMP (中國)

Z Value

\(z[i] = \max(p: s[0:p) = s[i:i+p))\)
對於每個後綴，尋找其前綴與整個字串最大相同長度

a	b	a	b	b	a	a	b	a	b	b	a	b	a	a
0	0	2	0	0	1	6	0	2	0	0	3	0	1	1

WHY

將段落(pattern) \(S\) 與文本(text) \(T\) 連接在一起
中間加上特殊字元 (@)
S @ T

\(a = S + @ + T\)
\(z[i] \leq |S|\) (有 @ 卡住)
當 \(z[i] = |S| = k \implies a[0:k)=a[i:i+k)\)
\(\implies S = a[i:i+k)\)
\(\implies\) \(S\) 在 \(i\) 這個位置出現

維護區間 \([L,R]\)，代表從 \(L\) 之後與整個字串的最長相同前綴
若完全不相等，則 \(L=R\)

對於步驟 \(i\)
\(i > R\)：重置 \(L,R = i\)，計算最長的 \(R\)
\(i \leq R\)：令 \(k = i-L\)，\(z[i] \geq \min(z[k], R-i+1)\)
- \(z[k] < R-i+1\)：無法再延長了，\(z[i]=z[k]\)
- \(z[k] \geq R-i+1\)：有可能延長，設 \(L=i\)，將 \(R\) 往後繼續比對，\(z[i]=R-L+1\)

模擬動畫
http://www.utdallas.edu/~besp/demo/John2010/z-algorithm.htm

Time Complexity

雙指標：\(\mathcal{O}(|S| + |T|)\)

























vector<int> zAlgo(string s) {
    int l = 0, r = 0;
    vector<int> z(s.size()) ;

    for(int i = 1, n = s.size(); i < n; ++i) {
        if(i > r) {
            l = r = i ;
            while(r < n && s[r - l] == s[r])
                r++ ;
            z[i] = r - l ;
            r-- ;
        } else {
            int k = i - l ;
            if(z[k] < r - i + 1) z[i] = z[k] ;
            else {
                l = i ;
                while(r < n && s[r - l] == s[r])
                    r++ ;
                z[i] = r - l ;
                r-- ;
            }
        }
    }
    return z ;
}

Rolling Hash

Problem

字串比對
Time: \(\mathcal{O}(|S|)\)
把字串 hash 成一個整數以達到 \(\mathcal{O}(1)\) 比對

Hash

雜湊 / 哈希(中國)
hash值不同 \(\implies\) 原始值必定不同
hash值相同 \(\implies\) 原始值不一定相同
碰撞：兩個不同的原始值 hash 成相同的值
預期碰撞機率 \(\frac{n}{C}\)

Rolling Hash

多項式 hash 法
\(f(s) = \sum\limits_{i=1}^{l} s[i] \times b^{l-i} \pmod M\)
ex. \(s="xyz"，f(s)=xb^2+yb+z\)

\(M\)：大質數
\(b\)：\([1,M)\) 間任意正整數
\(f(s)\) 與 \(f(t)\) 碰撞機率：\(\frac{l-1}{M}\)

Time Complexity

預處理：\(\mathcal{O}(|S|)\)
比較：\(\mathcal{O}(1)\)













const int N = 500010 ;
const int M = 1e9 + 7 ;
const int b = 313 ;

long long hs[N];

void init(string s) {
    int n = s.size();
    bp[0] = 1 ;
    for(int i=1; i<=n; ++i) {
        hs[i] = (hs[i-1] * b + s[i-1]) % M ;
    }
}

取得子字串

想知道 \(s[l,r]\) 的 hash 值
\(f(s[l,r]) = f(s_r) - f(s_{l-1})\times b^{r-l+1}\)
預處理 \(b^i\)


















const int N = 500010 ;
const int M = 1e9 + 7 ;
const int b = 313 ;

long long hs[N], bp[N] ;

void init(string s) {
    int n = s.size();
    bp[0] = 1 ;
    for(int i=1; i<=n; ++i) {
        hs[i] = (hs[i-1] * b + s[i-1]) % M ;
        bp[i] = bp[i-1] * b % M ;
    }
}

long long query(int l, int r) {
    return (hs[r+1] - hs[l] * bp[r-l+1] % M + M) % M ;
}

KMP or Z-Algorithm？

\(S = T[i, i+|S|)\)
預處理：\(\mathcal{O}(|S| + |T|)\)
尋找比對：\(\mathcal{O}(|T|)\)

Other

增刪字元：\(\mathcal{O}(1)\)
拼接字串：\(\mathcal{O}(1)\)

最長回文

枚舉中間點
二分搜長度
\(\mathcal{O}(N\log N)\)

Best \(M\) and \(b\)？
https://codeforces.com/blog/entry/60445

Advance

AC Automaton (KMP + Trie)
Suffix Array
Manacher
Suffix Automaton

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

110 選手班 - 字串

tags: 宜中資訊 CP

Outline

Trie

Problem

Naive

Binary Search

Trie

Key

node

Time Complexity

Sorting

Binary

KMP

Problem

Naive

KMP

Failure Function

Naive

Optimization 1

Optimization 2

Time Complexity

KMP

Time Complexity

Z-algorithm

Problem

Z Value

Time Complexity

Rolling Hash

Problem

Hash

Rolling Hash

Time Complexity

取得子字串

Other

最長回文

Advance

Credit

tags: `宜中資訊` `CP`