AC with SIMD - HackMD

###### tags: `pattern-matching` # AC with SIMD https://www.scpe.org/index.php/scpe/article/view/1572 論文整理 by Yoona ## Part 1 關於各個字串匹配演算法的特性分析 & 最後為啥是選擇 AC 作為其套用 SIMD 優化的算法 ### 綜觀分析 pattern matching 的 algorithmn complexity is measured by 1. running time of the algo 2. space memory required by the computation 3. set up time before the algo can be search thus total running time : 1+3 = a + bn, a : preprocessing time b : a constant indicating the number of comparisons made for “each character”, and n is the number of characters in the text being searched. The effectiveness of the algorithms is measured according to their “ability to lower the value of b.” Effective ones perform in less than one comparison per character searched. Many solutions for exact string-matching of multiple patterns have been developed. The most commonly used solutions are Aho-Corasick (AC), Wu-Manber (WM) and Commentz-Walter (CW) algorithms. However, most of them have been designed for ==moderately sized== pattern sets, and therefore they do not fit well with larger sets of patterns. -> set 大的話就不適合了 --- ### AC 的介紹利用有限狀態機的轉換 & 查看是否為 output 去實作，time 的部分 : pre part : is proportional to the total length of the patterns searching part : only proportional to the size of the text being processed major drawback : Its major drawback is the space memory required to store the automaton states, which increases with their number. ★ 這篇 paper 是透過對 processing step 的結構改變 -> state table 改成 next state function 的狀態機及更簡易的查表流程去減少 memory 的使用，但並未真正的對 memory 的使用或存放方式再進行根本上的改變已達到更大的優化效果，所以 if 可以套用老師上次說的學習模型去存放那應該會是別的向度的突破至於 processing time 的部分老師也提到了平行運算的優化部分，這方面也是第二個方向的突破 ### WM 特性 BM : 字符串匹配的Boyer-Moore算法 - 阮一峰的网络日志 (ruanyifeng.com) -> 從最後一個開始找是 BM 的變形，text is treated as blocks, preprocessing phase 要 bulid 3 tables (數量跟 AC 好像一樣，但查找過程依我觀察更複雜了XD) In practice, the algorithm is more suitable when dealing with patterns of similar length, and its running time does not increase in proportion to the size of the pattern set. But, in case of short patterns, the scanning time increases due to the decrease in characters shifting [5]. -> 當字串長度差不多長時，WM 的優勢是查找時間不會因為 set 變大而變長但是! another paper says : 樣本群中最短樣本的長度對於Wu-Manber演算法有決定性的影響，當最短樣本的長度太短，會使其演算法效能過慢，另外同字首的樣本數量過多也會拖累比對的速度而使得攻擊者可經由==設計封包內容大量使用此字首拖累比對的速度==，此稱為演算法攻擊(Algorithmic Attackes)。 btw, 其實一般網頁查找功能(比對單一字串)，比起 KMP，更常用的是 BM, 但是若是多個 pattern 的比較，AC 下去優化出來的方式可能會比 WM 厲害，且 WM 有"可被敵人設計"的弱點 ### CW exact multiple string matching solution that combines the concepts of Boyer-Moore and AC algorithms. 運作原理不多加贅述 XD，基本上就是比對方式是從後面開始(BM 的巧妙精神) BUT! With small numbers of patterns to look for, Boyer-Moore aspects of the CW algorithm can make it faster than the AC algorithm, but with larger numbers of patterns, the AC algorithm has a slight advantage [7, 8]. -> pattern set 若小，CW 快，若大，AC 勝! (阿現在的 pattern set 怎麼可能小，所以作者偏向用 AC --- 結合上面一些特性描述，我們可以看出各個演算法有各自的缺陷，就看怎麼去突破那針對這篇論文作者著重的重點 & SIMD 實作需要的特性，作者認為優化 AC 是他們的選擇，maybe 因為 WM 的 memory 看起來不會比 AC 的少 XD, 甚至 operate 過程比 AC 複雜(但人家就是利用比較麻煩的 pre & 當下的 operate 過程去搶到更多的時間的嗎) 而且可以被設計攻擊 QQ CW 若 pattern set 大，會輸 AC，而且 operate 過程跟 WM 一樣應該都比 AC 複雜，不知道能不能透過分析 or 改變結構去套用 SIMD AC 最大的缺陷是 memory 的問題，operate 過程看起來最有可能被套用 SIMD，看要怎麼去改變過程以優化 memory 同時讓其可套用 SIMD --- ## Part 2 SIMD 那說來說去到底套用 SIMD 需要哪些特質哩~ 首先 SIMD 是啥? SIMD : Single Instruction Multiple Data 原理 : What is SIMD ? - YouTube ![](https://i.imgur.com/iKVw9R5.png) ![](https://i.imgur.com/BaXBNdC.png) 就是將資料 vectorize (塞進 vector), 並可以用單一指令對 vector 內的 "每一格" 的值進行"同步運算" -> 交給會同步運算的硬體去做也就是如果原本的操作過程是 sequential implementation，但我們可以把他的 data vectorize, 那就可以用 SIMD 去變成平行運算並將時間大幅縮短 SIMD 的要求 : 中間的那個指令只可以是單一化的 data 是要可以塞進 vector 的 ( 若 data 是混和，互相牽連，或長相不齊的話，那 vectorize 會有困難 ) Programs can benefit from SIMD processing ==if they have highly repetitive loops, and use intensively instructions that operate on independent values in parallel.== The practical speed-up with vectorization comes from the ==efficient data movement== and the identification of vectorization possibilities in the program itself [17]. Vectorization ==fails if the structure of the program is not suitable for parallelisation and its data types are either mixed, dependent or not aligned.== --- ## AC 要如何優化&分析 - next state version of AC 目的 : 1. 減少 memory 2. 套進 SIMD / vectorize ### next state function version of AC algo : 將 AC 的 goto, fail function 兩個 function 結合成一個 function : next state function To eliminate unnecessary failure transitions in the previous version of the AC algorithm (the standard version), ==the next-move function of a deterministic finite automaton (DFA) can be used in place of the goto and failure functions.== Converting the algorithm to a DFA allows to ==indicate for each pair (given state, given character) the next state to move to, which is always a valid state. -> 該有限狀態機可以變成有了現在的 state & 現在來的 characcter 就可以知道下個 state 要去哪，而且 next state 永遠是 valid 的 (減少 invalid state 的轉換過程) -> 需要先將 finite state machine 建立好以方便使用，但建立時的規則會更加複雜== Theoretically, ==the DFA can reduce up to 50% of state transitions==, which would reduce extra comparisons and accelerate significantly the matching process. In practice such accelerations cannot be reached since the pattern-matching machine spends a lot of its time in state 0, from which there are no failure transitions [3] as shown in our experience results reported in Tables 5.1, 5.2 and 5.3. The searching process is easy once these tables are available. ==This method of encoding state transitions is more economical than storing them as a two-dimensional array.== ### result : <前> 實作時要存一個很大的 state table 去查表什麼的 ![](https://i.imgur.com/Sj8CsaT.png) <後> ![](https://i.imgur.com/0j0MSGg.png) ![](https://i.imgur.com/ooMg0Iv.png) 螢光筆的部分可以 vectorize 並同時查找，另外這種紀錄方式也可以減少所需記憶體 ![](https://i.imgur.com/BGHt9A1.png) ## 一些結論的部分 ![](https://i.imgur.com/CwtOYS5.png) Conclusion. In this paper, we have presented the vectorized Aho-Corasick algorithm in attempting to speed up the string-matching process. ==Experimental results clearly show that the vectorized version of the AhoCorasick algorithm, using the next-move function, yields better performance in terms of speed whose running time was up to five (×5) of its original counterpart (not vectorized).== The speed-up gained depends on the number of the patterns to search for and their length but not on the file size. 還有能多快也跟 programmer 怎麼去 vectorize & code structure 很相關，並且要根據硬體為其寫出適配的 code structure [351601.pdf (nctu.edu.tw)](https://ir.nctu.edu.tw/bitstream/11536/80777/1/351601.pdf) [259301.pdf (nctu.edu.tw)](https://ir.nctu.edu.tw/bitstream/11536/41910/1/259301.pdf)