###### tags: `school meterial`, `IR`,`public` # Tolerancw Retrideval ## Dictionary - hash table - look up time O(1) 快 - 但無容錯性 - tree - O(log M) (balance tree) - save prefix problem ### Wildcard Query 1. mon* - B tree - mon<=w<moo retrieve這個範圍之間所有words 2. \*mon - maintain an backward B-tree - retrieve range: mon<=u< non 3. mon\*mom - B-treere : trieve mon<=w< moo - backward B-tree : mon<=u< non #### Bigram(K-Gram) Index - 把句子切成k個為一組的sequence k-grams(bigrams) - 其中使用$去分開每個詞 - maintain **a second inverted** index from bigrams to dictionary terms that martch each bigram > kitten: \$ki kit itt tte en\$ > \$ki \rightarrow (kinkiten, kitchen, kitten, ...)\$$ >  1. 建所有gram的invertedindex, 再將查詢改成算式 2. Wildcard query: \$kit*en\$ : \$ki AND kit AND en\$ 3. 以算式\$ki AND kit AND en\$去查表再做operations - 錯誤範例  - 選擇k大小 | | K 大 | K 小 | | -------- | -------- | -------- | | pro | 精準,少podcast | 表豐富。多機會命狀 | -------- | -------- | -------- | | con | 表沒啥可放 | 需多postprocessing | ### Misspelled Correction **when : not in vocab** - appraoch: 1. find fimilar term 2. caculate their similarity to the query term 3. choose the most frequent ones - distance - Edit distance (Levenshtein distance) - Weighted edit distance - n-gram overlap #### Edit distance 把一個word轉成另一的worf所需要的最小步驟數量 E.g., the edit distance from dof to dog is 1 From cat to act is 2 from cat to dog is 3. #### Weighted edit distance Edit distance加上keyboard去討論每個字打成另一個字的可能性 - requires weight matrix as input | Column 1 | Column 2 | Column 3 | | -------- | -------- | -------- | | Text | Text | Text | #### **n-gram overlap** ##### problem : overmapping時會偏向長words solve: **Jaccard coefficient** $$\frac{交集}{連集}$$ edit distance $m\times n\times v$次 bi-gram : $(m + n)\times v$次 - v: vocab ##### wrong word cus: - Misspelled - 單詞本身的拼寫錯誤 - Context-sensitive - 指單詞本身的拼寫是正確的,但是在特定上下文中卻出現了錯誤 - try all - - hit-based spelling correction - 將錯誤的單詞替換為在文本中出現頻率較高的正確單詞 ## Dictionary Compression
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up