Python Regular Expression

--- tags: Python, IND --- # Python Regular Expression [TOC] ## What is regex? 正則表達式，即「描述某種規則的表達式」。這樣好像有講跟沒講一樣對吧ＸＤ最一開始的時候，regex 是使用在對形式化語言的分類，也就是藉由盡量統一的規則來分類各種奇形怪狀的表達方式現在也使用在 compiler 中，將輸入的字串 parse 成便於解析的語法，接著執行相對應的指令 ## Why we need regex? ~~因為他好用 (X~~ 藉由 regex，我們可以將很多不同樣貌的字串轉換成符合一定規則的表達式，也可以藉由 regex 找到字串中符合規則的部分，特別是對於看似複雜的字串，說不定可以藉由好的 regex 規則找到需要的部分 ~~但 re 也是個謎樣的東西 OuO 常常寫到豆頁疼~~ ## Basic syntax * `|` 代表 `or`，優先度最低。例如 `"is|are"` 就會配對 "is" 或是 "are" (以下都是針對單個字元，或是搭配括號的字串群組) * `+` 代表前面的字元必須至少出現一次 ==(1 ~ n 次)== * `?` 代表前面的字元最多只可以出現一次 ==(0 or 1 次)== * `*` 代表前面的字元可以不出現，也可以出現一次或者多次 ==(0 ~ n 次)== * `.` 代表所有不包含換行的字元 * `^` 代表字串的開頭，也包括換行後的字串 (以下是常用的 syntax) * `[]` 代表 char set * `[^charset]` 代表 char set 以外的東西，例如 `[^6]` 表示除了 '6' 以外的任意字元 * `[0-9]` or `\d` 代表所有的數字 * `\D` 代表所有非數字的字元，相當於 `[^0-9]` * `[a-z]` 代表所有的小寫英文字母 * `[A-Z]` 代表所有的大寫英文字母 * `[0-9a-zA-Z_]` or `\w` 就是所有**十進位數字**跟**英文大小寫字母**和**底線** * `\W` 就是除了 `\w` 的所有字元 * `()` 代表一個群組，在裡面寫的 re 如果找到東西都會當作同一個 group ## Usage 凡事都先 import 就對ㄌ ```python import re ``` 若 string 包含 pattern 字串，則回傳 `Match` Object，只會配對到第一個 ```python re.search(pattern, string) ``` 從首字母開始尋找，string 如果包含 pattern 字串則尋找成功，return `Match` Object，失敗則 return `None`，若要完全匹配則 pattern 要以 $ 結尾 ```python re.match(pattern, string[, flags]) ``` 回傳一個包含所有符合 pattern 的字串的 `list` ```python re.findall(pattern, string[, flags]) ``` 跟上面一樣但是回傳一個 iterator，裡面的每個物件都是一個 `Match` Object ```python re.finditer(pattern, string[, flags]) ``` `re.compile(pattern)` 可以讓你先把 pattern 變成一個物件 ```python compiled_pattern = re.compile(pattern) compiled_pattern.match(string) ``` ## 等等，flag 是啥 * `re.A` ASCII-only matching * `re.I` 忽略大小寫 * `re.L` 表示特殊符號 `\w` `\W` `\b` `\B`（？） * `re.M` 多行模式 * `re.S` 相當於 `'.'` 且包含換行在內的任意字元（`'.'`不包括換行） * `re.U` 表示 unicode 的特殊符號集 * `re.X` 為了增加可讀性而忽略空格和 `'#'` 後面的註解 [Reference][re flag] ## Escape Character 描述 re 裡面的 pattern 時，有些符號會與單獨的該符號搞混，例如 `'` 或 `"`，直接在 pattern 中出現時，很可能會被當作 python 中字串的開頭或結尾，所以在 python 中會以 `\'`、`\"` 來表示其中，`\n` 會拿來表示換行，`\t` 會用來表示 tab，等等觀察發現，只要遇到 `\` ，python 就會先去看看下一個是甚麼，然後決定這個是甚麼字元 ### 說到這裡，`\` 要怎麼辦如果只單獨使用 `\`，python 就會再去往後看看後面放了甚麼，以此來判斷式不是跳脫字元，所以為了使用 `\`，我們就要打成 `\\`，那如果想要表示兩個 `\` 就要打成 `\\\\`，`\` 的數量變成兩倍了！還有一個方法是在 pattern 前面加上 `r` 表示 raw data，那後面的 `\` 就只需要打一個就好例子： * `print('\"')` $\rightarrow$ `"` * `print(r'\"')` $\rightarrow$ `\"` [Reference](https://en.wikipedia.org/wiki/Escape_character) ## Match vs Search + re.match() 和 re.search() + re.match() 必須從開頭符合 + re.search() 則不用 ## VSCode 也可以用！ + ![](https://i.imgur.com/NeagfI1.png) ## Use cases ### 檔名範例 + 輸入範例 + [123] Hello.xml + [456] World.xml + Regex: `re.compile('\\[[0-9]*\\] (.*)\\.xml')` ### 找到指定的檔案類型 regex 也可以搭配 grep 使用！假設有個 files.txt 長這樣 ```shell $ cat files.txt foo.txt bar.txt foo1.txt bar1.doc foobar.txt foo.doc bar.doc dataset.txt purchase.db purchase1.db purchase2.db purchase3.db purchase.idx foo2.txt bar.txt $ ``` 想找 .txt 檔的話可以快樂的這樣用 ```shell $ cat files.txt | grep "\.txt" ``` 就會有像這樣的輸出了 ![](https://i.imgur.com/sTrg1zA.png) [Reference][files.txt] ### 日期年份 + Regex ```python re.compile('[0-9]+月[0-9]+日'), re.compile('(前)?[0-9]+?(年)?'), re.compile('[0-9]+(世紀)?'), ``` ### 極簡易英文斷詞 + 輸入範例 + `Hello, I am Donald Trump. What is your name?` + Regex: `re.compile('[\w]+')` + Output + `['Hello', 'I', 'am', 'Donald', 'Trump', 'What', 'is', 'your', 'name']` ### 網址抓取 + Regex: `re.compile('https?://[^ ]+')` + Question + 網路協定還有 FTP, TELNET, 可以都寫在一個敘述裡面嗎? + Demo ```python import re slist = [ 'My Website http://abc.com/', 'Why not google https://google.com/', 'Go to telnet://ptt.cc/', 'My VM SSH Remote ssh://localhost:2222/', ] protocal = [ 'http', 'https', 'telnet', 'ssh' ] p = re.compile('(%s)://[^ ]+' % '|'.join(protocal)) for s in slist: print(p.search(s).group(0)) ``` ### Greedy Problem + 如果有多重符合狀況，則一般情況下是取最長的字串 ```python import re p = re.compile('\$.*\$') ss = 'ABC (AABBCC)' p.search(ss) # Output: (AABBCC), correct ss = 'ABC (AABBCC) DEF (DDEEFF)' p.search(ss) # Output: (AABBCC) DEF (DDEEFF), wrong! ``` + 解決方法 + 使用 Non-Greedy 的寫法 + `re.search('\$.*?\$', s)` + 使用另一種 Tricky 的寫法 + `re.search('\$[^()]*\$', s)` ### POS 詞性標記 + 例句：你(Nh) 是(SHI) 屬於(VG) 領導(VC) 者(Na) 型(Na) 。(PERIODCATEGORY) + Regex: `re.compile('([^(]*)\$([^)]*)\$ ?')` + Demo ```python import re p = re.compile('([^(]*)\$([^)]*)\$ ?') txt = '你(Nh) 是(SHI) 屬於(VG) 領導(VC) 者(Na) 型(Na) 。(PERIODCATEGORY)' for pp in p.findall(txt): print(pp) ``` + Output ``` ('你', 'Nh') ('是', 'SHI') ('屬於', 'VG') ('領導', 'VC') ('者', 'Na') ('型', 'Na') ('。', 'PERIODCATEGORY') ``` ### Wikitext 清除 + 目標 + 把維基語法清掉，只留下純文字 + Code {%gist /penut85420/c8dfce718eb37297f6e611714a59956e%} ## Reference [re flag]: https://www.ibm.com/developerworks/cn/opensource/os-cn-pythonre/index.html [files.txt]: https://www.cyberciti.biz/faq/grep-regular-expressions/