【Python 筆記】正規運算式（Regular Expression）

# 【Python 筆記】正規運算式（Regular Expression） [TOC] 感謝您點進本篇文章，本篇文章為我 LukeTseng 個人首篇 Python 筆記系列文章，主要記錄我個人學習軌跡，所有內容皆用於個人學習用途，斟酌參考。若文章有任意處有誤，煩請各位指點，謝謝。 ## 簡介這東西的名字很多，有什麼正規表達式、正規表示法、規則運算式、常規表示法等等，我個人比較習慣叫他是正規運算式，總之他的名字就是 Regular Expression（簡稱 RE、regex）就對了，在這之後我都叫他是 regex 縮寫。 regex 是一套用來描述和比對字串樣式規則的「語法」或「表示法」，通常被嵌入在各大程式語言當中，例如我們最愛用的 Python 就是其一。regex 在處理和解析字串方面十分強大，因此也是我們要學習的對象。你可能想說在 notepad 裡面就可以用搜尋、取代這些功能了，為什麼還需要 regex？因為 regex 他不只是只有搜尋、取代這些功能而已，他還有篩選的功能，例如可以篩出特定字元、字串，以此來找到電子郵件格式及電話號碼。 ## 在 Python 使用 regex ```python import re ``` Python 本身就有內建 regex，因此只要直接引入就好了。 ## 編譯 regex 在做 regex 之前，都需要做編譯的動作，因此以下就是我們 regex 最基本的語法： ```python= import re p = re.compile('ab*') ``` :::info 建議在 regex 字串前加上 r 讓字串轉成原始字串，這樣反斜線 `\` 就不會當成跳脫字元處理。 ::: 而上面 `'ab*'` 的意思就是「**必須要有一個字母 a**」以及「**0 個或 1 個以上的的字母 b**」。 ## 字元類別（Character Classes）字元類別允許匹配指定集合中的任一個字元，用中括號 `[]` 圍起來。常見的有以下幾種： - `[abc]`：匹配 a、b 或 c。 - `[a-z]`：匹配任何小寫字母。 - `[A-Z]`：匹配任何大寫字母。 - `[0-9]`：匹配任何數字。 - `[^abc]`：匹配除了 a、b、c 以外的任何字元（否定）。 ## 特殊字元序列 * `\d`：匹配任何數字，等同於 `[0-9]`。 * `\D`：匹配任何非數字字元。 * `\w`：匹配任何字母、數字或底線，等同於 `[a-zA-Z0-9_]`。 * `\W`：匹配任何非字母數字字元。 * `\s`：匹配任何空白字元（空格、tab、換行等）。 * `\S`：匹配任何非空白字元。 ## 量詞（Quantifiers）用來指定某個模式應出現的次數： * `*`：匹配前面的字元 0 次或多次。 * `+`：匹配前面的字元 1 次或多次。 * `?`：匹配前面的字元 0 次或 1 次。 * `{n}`：精確匹配 n 次。 * `{n,}`：至少匹配 n 次。 * `{n,m}`：匹配 n 到 m 次。 ## 錨點（Anchors）用來指定匹配在字串中的位置： * `^`：匹配字串的開頭。 * `$`：匹配字串的結尾。 * `\b`：匹配單詞邊界。 * `\B`：匹配非單詞邊界。 ## 常用方法 | 方法 | 語法 | 說明 | 回傳值 | | :-- | :-- | :-- | :-- | | `re.compile()` | `re.compile(pattern, flags=0)` | 將正規表達式編譯成模式物件，可重複使用以提升效能 | Pattern 物件 | | `re.match()` | `re.match(pattern, string, flags=0)` | 從字串**開頭**開始匹配，只檢查開頭是否符合模式 | Match 物件或 None | | `re.search()` | `re.search(pattern, string, flags=0)` | 掃描整個字串，尋找**第一個**符合模式的位置 | Match 物件或 None | | `re.findall()` | `re.findall(pattern, string, flags=0)` | 找出**所有**不重疊的匹配結果 | 包含所有匹配字串的列表 | | `re.finditer()` | `re.finditer(pattern, string, flags=0)` | 找出所有匹配結果，以迭代器形式回傳 | Match 物件的迭代器 | | `re.sub()` | `re.sub(pattern, repl, string, count=0, flags=0)` | 將所有符合模式的部分替換成指定字串，count 可限制替換次數 | 替換後的新字串 | | `re.subn()` | `re.subn(pattern, repl, string, count=0, flags=0)` | 與 `re.sub()` 類似，但會回傳替換次數 | (新字串, 替換次數) 的元組 | | `re.split()` | `re.split(pattern, string, maxsplit=0, flags=0)` | 根據符合的模式分割字串，maxsplit 可限制分割次數 | 分割後的字串列表 | | `re.fullmatch()` | `re.fullmatch(pattern, string, flags=0)` | 檢查**整個字串**是否完全符合模式 | Match 物件或 None | ### 1. `re.match()` 範例 `re.match()` 為從字串開頭開始匹配模式（pattern）的方法。如果開頭符合模式則回傳 Match 物件，否則回傳 None。以下程式碼中的 pattern 的 `^` 符號是一個錨點（anchor），表示匹配字串的開頭位置。這個符號確保 pattern 必須從字串的最開始就符合，而不是在字串中間找到符合的部分。 `^` 放在 regex 的開頭，表示 `text` 字串一開始就要是 `Hello` 否則不匹配。 :::info 在當 `re.match()` 或 `re.search()` 成功匹配時，會回傳 `match` 物件，這物件會提供三個方法來取得匹配資訊： 1. `match.group()`：以字串形式回傳匹配到的內容。 2. `match.start()`：回傳匹配內容在字串中的起始位置（索引值）。 3. `match.end()`：回傳匹配內容的結束位置（最後一個字元的索引 + 1）。 ::: ```python= import re text = "Hello, world!" pattern = r'^Hello' # 檢查字串是否以 "Hello" 開頭 match = re.match(pattern, text) if match: print("字串以 'Hello' 開頭") print("匹配內容:", match.group()) # Hello else: print("不匹配") ``` Output： ``` 字串以 'Hello' 開頭匹配內容: Hello ``` ### 2. `re.search()` 範例 ```python= import re text = 'Hello World!, this is my first program for Python!' m = re.search(r'Python', text) if m: print('找到:', m.group()) # 找到: Python print('起始位置:', m.start()) # 起始位置: 43 print('結束位置:', m.end()) # 結束位置: 49 else: print('未找到') ``` Output： ``` 找到: Python 起始位置: 43 結束位置: 49 ``` ### 3. `re.findall()` 範例 `\d` 表示任何一個數字字元，同 `[0-9]` 就是 0 到 9 的意思。而 `\d` 再加上一個 `+`，表示前面的 pattern 至少要出現 1 次以上。 `\d+` 整句話翻譯過來的話就是 0 到 9 的數字至少要出現 1 次以上。 `re.findall()` 方法會回傳一個匹配到的字串列表。 ```python= import re string = "My phone number is 0987654321, and my friend's phone number is 0912345678" regex = r'\d+' matches = re.findall(regex, string) print(matches) # ['0987654321', '0912345678'] ``` Output： ``` ['0987654321', '0912345678'] ``` ### 4. `re.sub()` 範例 sub 是取自 substitute 的縮寫，表示把字串交換，如下範例所示，將 Python 替換成 C++ 字串。 ```python= import re text = "I like Python because Python is so powered." result = re.sub(r'Python', 'C++', text) print(result) # I like C++ because C++ is so powered. ``` Output： ``` I like C++ because C++ is so powered. ``` 那 `re.sub()` 還有他的親姊妹叫做 `re.subn()`，兩者用法完全一樣，只差在 `re.subn()` 回傳一個元組，前面是替換完成的字串，後面是替換的次數，沿用上述範例： ```python= import re text = "I like Python because Python is so powered." result = re.subn(r'Python', 'C++', text) print(result) # ('I like C++ because C++ is so powered.', 2) ``` Output： ``` ('I like C++ because C++ is so powered.', 2) ``` ### 5. `re.split()` 範例 `re.split()` 根據匹配到的 pattern 來分割字串。以下範例展示了將 `,;:` 的 pattern 分割字串，達到如內建方法 `.split()` 一樣的事情。但差別在於內建的 `.split()` 只能找到一種 pattern 而已。 ```python= import re text = "apple,banana;orange:grape" result = re.split(r'[,;:]', text) print(result) # ['apple', 'banana', 'orange', 'grape'] ``` Output： ``` ['apple', 'banana', 'orange', 'grape'] ``` ## 字元類別、特殊字元序列、量詞、錨點的一些例子前面說明過了一些使用的方法，那麼就可以來講一些特殊的例子來玩玩看了。 `[^0-9]` 除了 0 到 9 以外的字元： ```python= import re print(re.findall(r'[^0-9]', "abc123")) # ['a', 'b', 'c'] ``` Output： ``` ['a', 'b', 'c'] ``` `\w+` 匹配任何字母、數字或底線： ```python= import re print(re.findall(r'\w+', "他說了 *** 在某種語言中")) # ['他說了', '在某種語言中'] ``` Output： ``` ['他說了', '在某種語言中'] ``` `*`、`?`、`{n, m}` 範例： ```python= import re # * 匹配 0 個或多個 'b' print(re.findall(r'ab*', "a ab abb abbb")) # ['a', 'ab', 'abb', 'abbb'] # ? 匹配前面的字元 0 次或 1 次 print(re.findall(r'colou?r', "color and colour")) # ['color', 'colour'] # {n,m} 匹配 2 到 4 個數字 print(re.findall(r'\d{2,4}', "1 22 333 4444 55555")) # ['22', '333', '4444', '5555'] ``` Output： ``` ['a', 'ab', 'abb', 'abbb'] ['color', 'colour'] ['22', '333', '4444', '5555'] ``` `$`、`\b` 範例： ```python= import re # $ 匹配 Hello World 字串結尾 print(re.search(r'World$', "Hello World")) # <re.Match object; span=(6, 11), match='World'> print(re.search(r'Hello$', "Hello World")) # None # \b 匹配單詞邊界 text = "cat cats caterpillar" print(re.findall(r'\bcat\b', text)) # ['cat'] print(re.findall(r'cat', text)) # ['cat', 'cat', 'cat'] ``` ## 一些小應用 ### 1. 電子郵件驗證 ```python= import re def validate_email(email): pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' if re.match(pattern, email): return True return False print(validate_email("user@example.com")) # True print(validate_email("invalid.email")) # False print(validate_email("user@domain.co.uk")) # True ``` ### 2. 電話號碼擷取 ```python= import re text = """ Contact Information: Office phone number: 02-1234-5678 phone number: 0912-345-678 home phone number: (02)8765-4321 """ # 台灣電話號碼格式 patterns = [ r'\d{2}-\d{4}-\d{4}', # 02-1234-5678 r'\d{4}-\d{3}-\d{3}', # 0912-345-678 r'$\d{2}$\d{4}-\d{4}' # (02)8765-4321 ] for pattern in patterns: matches = re.findall(pattern, text) print(f"找到: {matches}") ``` Output： ``` 找到: ['02-1234-5678'] 找到: ['0912-345-678'] 找到: ['(02)8765-4321'] ``` ### 3. 資料擷取 ```python= import re # 從網頁內容擷取所有連結 html_content = """ <a href="https://example.com">範例</a> <a href="https://test.com">測試</a> <a href="/local/path">本地連結</a> """ urls = re.findall(r'href=["\']([^"\']+)["\']', html_content) print(urls) # ['https://example.com', 'https://test.com', '/local/path'] # 擷取價格資訊 text = "商品 A: $1,299 元，商品 B: $599 元，商品 C: $2,999 元" prices = re.findall(r'\$[\d,]+', text) print(prices) # ['$1,299', '$599', '$2,999'] ``` Output： ``` ['https://example.com', 'https://test.com', '/local/path'] ['$1,299', '$599', '$2,999'] ``` ### 4. 密碼強度驗證 ```python= import re def check_password_strength(password): # 至少 8 個字元 if len(password) < 8: return False, "密碼長度至少需要 8 個字元" # 至少包含一個大寫字母 if not re.search(r'[A-Z]', password): return False, "至少需要一個大寫字母" # 至少包含一個小寫字母 if not re.search(r'[a-z]', password): return False, "至少需要一個小寫字母" # 至少包含一個數字 if not re.search(r'\d', password): return False, "至少需要一個數字" # 至少包含一個特殊字元 if not re.search(r'[!@#$%^&*(),.?":{}|<>]', password): return False, "至少需要一個特殊字元" return True, "密碼強度良好" print(check_password_strength("weak")) # False, 不符合要求 print(check_password_strength("Strong123!")) # True, 密碼強度良好 ``` Output： ``` (False, '密碼長度至少需要 8 個字元') (True, '密碼強度良好') ``` ## Regex 線上除錯工具網路上隨便找就有了，這邊推薦兩個網站： - regex101.com - regexr.com 輸入自己的 Regex 就可以在他的測試文件裡面知道哪些資料是被篩選的。當在設計複雜的 Regex 的時候，就可以用這個試試看。 ## 總結基礎語法架構： 1. 字元類別：`[abc]`、`[a-z]`、`[0-9]` 等用於匹配指定集合中的字元。 2. 特殊字元序列：`\d`（數字）、`\w`（字母數字底線）、`\s`（空白字元）。 3. 量詞：`*`（0次或多次）、`+`（1次或多次）、`?`（0次或1次）、`{n,m}` 控制出現次數。 4. 錨點：`^`（字串開頭）、`$`（字串結尾）、`\b`（單詞邊界）等定位符號。常用方法表： | 方法 | 語法 | 說明 | 回傳值 | | :-- | :-- | :-- | :-- | | `re.compile()` | `re.compile(pattern, flags=0)` | 將正規表達式編譯成模式物件，可重複使用以提升效能 | Pattern 物件 | | `re.match()` | `re.match(pattern, string, flags=0)` | 從字串**開頭**開始匹配，只檢查開頭是否符合模式 | Match 物件或 None | | `re.search()` | `re.search(pattern, string, flags=0)` | 掃描整個字串，尋找**第一個**符合模式的位置 | Match 物件或 None | | `re.findall()` | `re.findall(pattern, string, flags=0)` | 找出**所有**不重疊的匹配結果 | 包含所有匹配字串的列表 | | `re.finditer()` | `re.finditer(pattern, string, flags=0)` | 找出所有匹配結果，以迭代器形式回傳 | Match 物件的迭代器 | | `re.sub()` | `re.sub(pattern, repl, string, count=0, flags=0)` | 將所有符合模式的部分替換成指定字串，count 可限制替換次數 | 替換後的新字串 | | `re.subn()` | `re.subn(pattern, repl, string, count=0, flags=0)` | 與 `re.sub()` 類似，但會回傳替換次數 | (新字串, 替換次數) 的元組 | | `re.split()` | `re.split(pattern, string, maxsplit=0, flags=0)` | 根據符合的模式分割字串，maxsplit 可限制分割次數 | 分割後的字串列表 | | `re.fullmatch()` | `re.fullmatch(pattern, string, flags=0)` | 檢查**整個字串**是否完全符合模式 | Match 物件或 None | ## 參考資料 [Regular Expression HOWTO — Python 3.14.0 documentation](https://docs.python.org/3/howto/regex.html) [Python RegEx - GeeksforGeeks](https://www.geeksforgeeks.org/python/regular-expression-python-examples/) [Python 規則運算式 | Python Education | Google for Developers](https://developers.google.com/edu/python/regular-expressions?hl=zh-tw) [Python RegEx](https://www.w3schools.com/python/python_regex.asp) [Mastering Python RegEx: A Deep Dive into Pattern Matching - StrataScratch](https://www.stratascratch.com/blog/mastering-python-regex-a-deep-dive-into-pattern-matching/) [使用正規表達式 re - Python 教學 | STEAM 教育學習網](https://steam.oxxostudio.tw/category/python/library/re.html)