regex - HackMD

--- title: regex tags: learning notes, regex, Python --- # re - Regular expression operations * src: (https://docs.python.org/3/library/re.html) * 這個re模組提供與Perl類似的模式比對 * 被比對的模式或字串可以是Unicode strings[(str)](https://docs.python.org/3/library/stdtypes.html#str)或是8-bits strings[(bytes)](https://docs.python.org/3/library/stdtypes.html#bytes)，但兩者因為類型不同，不能同時共用 * "正規表示法"跟"python"的backslash - "\\" 的使用方法是有衝突的 * python用\來顯示效果，可是RE卻是用\來脫離效果 >> Regular expressions use the backslash character (' \\ ') to indicate special forms or to allow special characters to be used without invoking their special meaning. This **collides with** Python’s usage of the same character for the same purpose in string literals * 解決的方式是在模式字串加入"r"字元(raw)，就可以取消python字串的轉義功能 * 很重要的一點，絕大多數的正規表達式均可以先編譯，來提升之後的匹配速度 ## Regular Expression Syntax * 在RE模組的函式讓你確認“給定的RE”和“特定字串”是否一樣 * 不同的RE可以被串連成新的、更複雜的RE >> This holds unless A or B contain low precedence operations; boundary conditions between A and B; or have numbered group references. * RE可以包含簡單和特殊的字符(character) * 像"|" 和 "(" 等是特殊字符，他可以代表一個普通字符，或是去影響RE的意義 ### Speical character＿特殊字符 |Pattern|Match(default)|fine-tuning|Match(after)| |:-:|-|:-:|-| |.| 任何字元(除了newline)|re.S, re.DOTALL, (?s)|任何字元 |^| 字串開頭|re.M, re.MULTILINE, (?m)|(每行)字串開頭 |$| 結尾是newline的字串結尾 字串結尾|re.M, re.MULTILINE, (?m)|(每行)結尾是newline的字串結尾| |*|修飾符，0或多個，越多越好(greedy)| |+|修飾符，1或多個，越多越好(greedy)| |?|修飾符，0或1個，越多越好(greedy)| |*?, +?, ??|修飾符，越少越好 (non-greedy/minimal fashion)| |{m}|m個連續的| |{m, n}|m到n個連續的(greedy) 忽略m等同0、忽略n等同無上限| |{m, n}?|non-greedy的{m, n}| |\ |轉義特殊字符(*, +, ?...) 表示特殊字組(\d, \w, \s...)| |[]|可以單獨列出字符e.g. [amk] 可以用"-"列出字符範圍e.g. [0-9] 使特殊字符失去特殊意義 可以接受特殊字組，Unicode str與8 bit還是有差 ^在[]開頭可以實現反向選擇，其他位置無意義| |a\|b|a或b，符合其一就不再進行匹配 可以被用在群組裡 non-greedy| |(...)|匹配括號內的任何pattern，從頭到尾代表群組 括號的內容可以被取回，以`\number`再被匹配| |(?...)| ？後的第一個字符決定了架構的意義和語法 不會形成群組，(?P<name>...)是例外 下列是有支援的例子⬇︎| |(?aiLmsux)|一個或多個字母 re.A(ascii-only matching), re.I(ignorecase), re.L(locale dependent), re.M(muti-line), re.S(dot matchs all), re.U(unicode matching), re.X(verbose)| |(?: ...)|匹配括號內的任何pattern，從頭到尾代表群組 括號後的內容不行被取回| |(?aiLmsux-imsx:...)|0個或多個字母的aiLmsux (選用)使用“-”跟著一或多imsx| |(?P<name>...)|與正常的括弧類似 括號的內容可以透過name被取回 已命名群組可以在三個地方被參考： ☞自己的模式裡面 ☞處理match object時 ☞在re.sub()的*repl*字串參數中| |(?=name)|一個已命名群組的backreference，匹配之前模式的群組name| |(?=#...)|註解，會被忽略| |prev(?=next)|prev，如果後面跟著next(lookahead assertion) | | prev(?!next) | prev，如果後面沒有跟著next(negative lookahead assertion) | | (?<=prev)next | 如果next前是prev，匹配next(lookbehind assertion) | | (?<!prev)next | 如果next前是不是prev，匹配next(negative lookbehind assertion) | |(?(id/name)yes-pattern\|no-pattern)|| ### Special sequence | Pattern | Match | | :--------: | -------- | | \d | 一個數字，同[0-9] | | \D | 一個非數字，同[^0-9] | | \w | 一個英數字元，同[a-zA-Z0-9_] | | \W | 一個非英數字元，同[^a-zA-Z0-9_] | | \s | 一個空白字元，同[ /t/n/r/f/v] | | \S | 一個非空白字元，同[^ /t/n/r/f/v] | | \b | 一個單字範圍 | | \B | 一個非單字範圍 | |\Z|字串的結尾，不包含newline| ## Module Contents * 這個模組定義了多組函式、常數和例外，有些函式為了(compile RE)編譯表達式，簡化了方法以方便使用 - re.compile(pattern, flags=0) - **將RE的pattern編譯成RE物件，在程式中重複使用更具效率** >>> flags values can be any of the following variables, combined using bitwise OR (the | operator). [re.A, re.debug, re.X...] ``` prog = re.compile(pattern) result = prog.match(string) ``` is equivalent to ``` result = re.match(pattern, string) ``` - re.rearch(pattern, string, flags=0) - **掃描整個字串，找到'第一個'匹配模式的位置並回傳匹配物件，若無，回傳`None`** - re.match(pattern, string, flags=0) - **若字串'開頭'匹配模式，回傳匹配物件；若無，回傳`None`** - re.fullmatch(pattern, string, flags=0) - **若'整個'字串匹配模式，回傳匹配物件；若無，回傳`None`** - re.split(pattern, string, maxsplit=0, flags=0) - **按照模式拆分字串。若在模式內使用括號，那模式內容也會回傳至串列中。** - re.findall(pattern, string, flags=0) - **回傳所有非重疊(non-overlapping)的匹配的'字串的串列'，若無，回傳空匹配** - re.finditer(pattern, string, flags=0) - **回傳所有非重疊(non-overlapping)的匹配的'匹配物件'，若無，回傳空匹配** - re.sub(pattern, repl, string, count=0, flags=0) - **很像字串方法replace()，repl可以是字串或函式** - re.escape(pattern) - **轉義在模式裡的特殊字符，當你想匹配的字串裡有metacharacter就會很好用** ## Regular Expression Objects * 編譯過的RE物件支援下列的方法和屬性 - Pattern.search(string, pos, endpos) - **掃描整個字串，找到'第一個'匹配模式的位置並回傳匹配物件，若無，回傳`None`** ## Match Objects * match object是boolean: `True`，所以可以用if statement檢視是否匹配 ``` match = re.search(pattern, string) if match: process(match) ``` * 匹配物件支援下列的方法和屬性 - Match.expand(template) - ??? - Match.group([group1,...]) - 回傳一或多個匹配的子群組 - 如果是一個引數，回傳一個字串；如果是多個引數，以tuple形式回傳多個 - 如果群組數字是負的或是大於群組數量，Error - 如果RE用(?P<name>)語法，那引數需為字串形式 `m = re.match(r"(?P<first_name>) (?P<last_name>)", "HungYi Yu")` `m.group('first_name')` - 若重複匹配多次，取最後一次 - Match.\_\_getitem__(g) - 與m.group(g)相同，更容易寫 `m = re.match(r"(\w+) (\w+)", "HungYi Yu")` `m.group[0]` - Match.groups(default=None) - 以tuple形式回傳所有子群組 - Match.groupdict(default=None) - 以字典形式回傳所有子群組 - Match.start([group]) / Match.end([group]) - 回傳子字串開頭和結尾的索引