12.正規表示式

# 正規表示式 [![hackmd-github-sync-badge](https://hackmd.io/EWQ1ZAAhStOSWzieyx_TLQ/badge)](https://hackmd.io/EWQ1ZAAhStOSWzieyx_TLQ) ![](https://hackmd.io/_uploads/BkxSfb_u3.png) ## 1. 正規表示式簡介 Introduction to Regular Expressions 正規表示式（Regular Expression，常簡寫為 regex、regexp 或 RE）是一種用來描述字串模式的語法規則。它可以用一組約定的符號當作樣板，比對想要的字串格式，例如找出符合電子郵件或手機號碼格式的資料。許多程式語言和文字編輯器都支援正規表示式，它是一種強大且通用的文字模式比對工具。 :::info RE的線上比對網站：https://regex101.com/ 這個網站能幫助你即時測試和理解正規表示式的比對結果。 ::: ## 2. 正規表示式基本語法 Basic Regex Syntax ![image](https://hackmd.io/_uploads/HyBDHlNlC.png) ### 2.1 字元類別與特殊符號 Character Classes & Metacharacters - `.` 代表除了換行符號外的任意單一字元 - `\d` 代表單個數字字元，相當於 [0-9]。 - `\w` 代表單個單字字元（a-z、A-Z、0-9、下劃線 _）。 - `\s` 代表單個空白字元（空格、Tab 等）。 - `\D`、`\W`、`\S` 分別是 `\d`、`\w`、`\s` 的反向（否定），例如 \D 比對任何非數字字元。 **範例：** - `.at` 可比對 "cat"、"hat"、"rat" 等。 - `\d\d\d `可比對三位數字。 - `\w\w\w` 可比對三個單字字元。 #### 字元集合 Character Sets [] - `[]` 字元集合：比對中括號內任一個字元。例如 [abc]ar 可比對 "aar"、"bar"、"car"。 - `[0-9]`、`[a-z]`：中括號內的連字號代表範圍，可比對 0-9 或 a-z（區分大小寫）。 - `[^A-Z]`：若 ^ 出現在中括號內，表示否定，即排除 A-Z，其餘皆可。 **範例：** - `[cab]ar` 可比對 "car"、"bar"、"aar"。 - `[^0-9]` 可比對任何非數字字元。 - `^c` 可比對開頭是 'c' 的字串。 ### 2.2 數量符號 (Quantifiers) 數量符號可用來設定緊鄰其前方的模式出現的次數： - `?`：前面的字元是可選的，可出現 0 次或 1 次。 - `*`：前面的字元可出現任意次（含 0 次）。 - `+`：前面的字元至少出現一次。 - `{n}`：前面的字元重複 n 次。 - `{m,n}`：前面的字元至少重複 m 次，最多 n 次。 **範例：** - `colou?r`：可比對 "color" 和 "colour"。 - `\d{3,}`：可比對 3 個或更多數字。 - `\d*`：可比對任意數目的數字。 - `a+b`：可比對 "ab"、"aaab"、"aaaaaab" 等。 - `\w{3}`：可比對連續 3 個字元。 - `\d{2,4}`：可比對 2 到 4 個數字。 #### 貪婪 vs. 非貪婪匹配 (Greedy vs. Non-Greedy) 預設情況下，`*` 和 `+` 都是「貪婪的」，它們會盡可能匹配**最長**的字串。如果在它們後面加上一個 `?` (例如 `*?` 或 `+?`)，它們就會變成「非貪婪」模式，轉而匹配**最短**的字串。 **範例：** - This.*test (貪婪)：在 "This is a test. This is only a test." 中，會比對整個字串，因為 .* 會盡可能多地匹配字元。 - 匹配結果：['This is a test. This is only a test'] - This.*?test (非貪婪)：在同樣的字串中，會分別比對到 "This is a test" 與 "This is only a test"，因為 .*? 在找到第一個 "test" 後就會停止，然後繼續向後尋找下一個匹配。 - 匹配結果：['This is a test', 'This is only a test'] ![image](https://hackmd.io/_uploads/SkS6_JjQ1l.png) ### 2.3 開頭與結尾符號 Anchors & Boundaries - `^` 若出現在模式開頭，表示必須從字串的**第一個位置開始比對**。 - `$` 若出現在模式結尾，表示直到字串的**最後一個位置才結束比對**。 - `\b` 單詞邊界 (Word Boundary)：\b 會比對一個「單字字元」(\w) 和「非單字字元」(\W) 之間的位置。它能確保你比對到的是一個獨立的單詞。 **範例:** - `^Hello`：可比對開頭是 "Hello" 的字串。 - `world$`：可比對結尾是 "world" 的字串。 - `^Hello world$`：完全比對 "Hello world" 字串，前後沒有空白或其他字元。 - `\bcat\b`：只比對獨立的單詞 "cat"，而不會比對到 "caterpillar" 或 "tomcat"。 ### 2.4 分組與選擇 Grouping & Alternation * |：或 (OR)。用來比對兩個模式的任一個。 * ()：捕獲分組 (Capturing group)。將小括號內的模式視為一個整體，並將比對到的內容「捕獲」起來，便於後續提取。 * (?:...)：非捕獲分組 (Non-capturing Group)。功能與 () 類似，可以將一部分模式群組起來，但不會將比對到的內容儲存為一個可提取的分組。 **範例:** * c(at|ow) 可比對 "cat" 或 "cow"。 * (\d{4})-(\d{6}) 針對電話號碼 0912-345678，可比對整個 "0936-279195"，並透過分組提取 "0936"（第一組）和 "279195"（第二組）。 * \b(?:\d{1,3}\.){3}\d{1,3}\b (比對 IP 位址)。此處 (?:...) 僅用於將「數字+點」的組合重複 3 次，而不會單獨捕獲 192. 或 168. 等部分。 :::success **() 與 [] 的區別** - `[]` 字元集合：比對集合內的任一字元。例如 [abc] 只會比對 'a' 或 'b' 或 'c' 這單一字元。 - `()` 分組：將一個模式視為一個整體。例如 (abc) 會比對 "abc" 這個連續的字串，並將其捕獲為一個分組。 ::: ### 2.5 預查斷言 Lookaround Assertions (進階) 預查斷言是一種「零寬度斷言」，它會檢查目前位置的左邊或右邊是否符合某個模式，但不會實際消耗任何字元，只會檢查條件是否滿足。 * (?=...)：正向預查 (Positive Lookahead)。檢查目前位置的右邊是否符合某個模式。 * (?!...)：負向預查 (Negative Lookahead)。檢查目前位置的右邊是否不符合某個模式。 * (?<=...)：正向後查 (Positive Lookbehind)。檢查目前位置的左邊是否符合某個模式。（此處略過，作為進階主題） * (?<!...)：負向後查 (Negative Lookbehind)。檢查目前位置的左邊是否不符合某個模式。（此處略過，作為進階主題） **範例:** - 正向預查：驗證密碼強度 `^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]{8,}$` (驗證密碼)。它包含多個條件： 1. `(?=.*[a-z])`：從字串開頭向後看，必須存在一個小寫字母。 1. `(?=.*[A-Z])`：從字串開頭向後看，必須存在一個大寫字母。 1. `(?=.*\d)`：從字串開頭向後看，必須存在一個數字。 1. `.{8,}`：如果上面三個條件都滿足，則實際比對並消耗至少 8 個任意字元。 - 負向預查：找出不是 .txt 結尾的檔案名稱 `\b\w+\.(?!txt$)\w+\b` 這個模式會比對一個單詞，後面跟著一個點 .，但這個點後面不能是 txt 且為字串結尾。例如：在 "file.doc, report.pdf, notes.txt" 中，會比對 "file.doc" 和 "report.pdf"，但不會比對 "notes.txt"。 ## 3. Python 中的正規表示式應用 Regex in Python Python 使用內建的 re 模組來實作正規表示式。在 Python 中，建議使用 r'...'（raw string）來定義 RE 樣板，可避免反斜線 \ 被視為跳脫字元。 ### 3.1 尋找單一匹配 RE 以下的方法可用來比對一次字串 - search()：掃描整個字串，返回第一個成功的匹配。 - match()：只從字串的開頭進行比對，若開頭不符則失敗。 - fullmatch()：比對整個字串是否與模式完全相同。若比對成功，它們會回傳一個 Match 物件(下一節探討)；若失敗，則回傳 None。由於匹配失敗會回傳 None，因此在使用 Match 物件前，需先檢查是否存在： `if match` 或 `if match is not None` **範例：search、match、fullmatch 的區別** ```python= import re # 範例字串 text = "哈囉，世界！這是一個範例文字。" # 使用 re.match() 來比對以 "哈囉" 開頭的字串 match_obj = re.match(r'哈囉', text) if match_obj: print("re.match() 比對成功:", match_obj.group()) print(f"開始位置: {match_obj.start()}, 結束位置: {match_obj.end()}") else: print("re.match() 未找到比對") # 使用 re.search() 來搜尋包含 "世界" 的字串 search_obj = re.search(r'世界', text) if search_obj: print("re.search() 比對成功:", search_obj.group()) print(f"開始位置: {search_obj.start()}, 結束位置: {search_obj.end()}") else: print("re.search() 未找到比對") # 使用 re.fullmatch() 來比對整個字串 fullmatch_obj = re.fullmatch(r'哈囉，世界！這是一個範例文字。', text) if fullmatch_obj: print("re.fullmatch() 比對成功:", fullmatch_obj.group()) print(f"開始位置: {fullmatch_obj.start()}, 結束位置: {fullmatch_obj.end()}") else: print("re.fullmatch() 未找到比對") ``` 輸出: ``` re.match() 比對成功: 哈囉開始位置: 0, 結束位置: 2 re.search() 比對成功: 世界開始位置: 3, 結束位置: 5 re.fullmatch() 比對成功: 哈囉，世界！這是一個範例文字。開始位置: 0, 結束位置: 15 ``` :::info **re.match()與re.search()的差別** re.match只有匹配字串的開頭，如果字符串開頭就不符合正則表達式，則匹配失敗，函式回傳 None；而re.search()則是整個字串都會做匹配，只要找到一個匹配就表示成功，整個字串都沒有匹配才會回傳 None。 ::: ### 3.2 Match 物件與分組提取 Match Object & Group Extraction 上一節介紹的三個方法若比對成功會回傳一個 Match 物件，而 Match 物件還可以透過下列屬性獲得更多資訊： - match.group(0) 或 match.group()：獲得完整的比對內容。 - match.group(1), match.group(2)...：獲得第一個、第二個... () 分組的內容。 - match.groups()：以元組 (tuple) 形式返回所有分組的內容。 - match.start() / match.end()：返回匹配子串的起始/結束索引。 **範例：提取 Email 的使用者名稱與域名** ```python= import re text = "聯絡信箱是 username@example.com，請盡快回覆。" # 我們使用兩個 () 來分別捕獲 @ 前後的部分 pattern = r'([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)' result = re.search(pattern, text) if result: print(f"完整匹配 (group): {result.group()}") print(f"完整匹配 (group(0)): {result.group(0)}") print("-" * 20) print(f"第一個分組 (group(1)): {result.group(1)}") # 使用者名稱 print(f"第二個分組 (group(2)): {result.group(2)}") # 域名 print("-" * 20) print(f"所有分組 (groups): {result.groups()}") print(f"匹配範圍: 從 {result.start()} 到 {result.end()}") ``` 輸出: ``` 完整匹配 (group): username@example.com 完整匹配 (group(0)): username@example.com -------------------- 第一個分組 (group(1)): username 第二個分組 (group(2)): example.com -------------------- 所有分組 (groups): ('username', 'example.com') 匹配範圍: 從 6 到 26 ``` :::success **進階技巧：命名分組 (Named Groups)** 當分組很多時，用數字 (group(1), group(2)) 可能會造成混亂。此時可以使用 (?P<name>...) 語法為分組命名，之後便可透過名稱來提取，增加程式碼的可讀性。 ```python= import re text = "我的電話是 0912-345678" pattern = r'(?P<area_code>\d{4})-(?P<number>\d{6})' match = re.search(pattern, text) if match: # 透過名稱提取分組 print(f"區碼: {match.group('area_code')}") print(f"號碼: {match.group('number')}") ``` 輸出: ``` 區碼: 0912 號碼: 345678 ``` ::: ### 3.3 尋找所有匹配 Finding All Matches `re.findall()` 可以找到字串中**所有**非重疊的匹配，並以列表 (list) 形式返回。 **範例 1：找出所有 Email** ```python= import re text = "My emails are test@test.com and hello@world.org" match = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', text) print(match) for email in match: print(email) ``` 輸出: ``` ['test@test.com', 'hello@world.org'] test@test.com hello@world.org ``` **範例 2：貪婪 vs. 非貪婪的應用** RE 本質上是貪婪的（找出最長比對），若要找出所有可能的最短比對，就要使用非貪婪匹配（加 ?）。 ```python= import re # 範例 1：文字段落 text1 = "This is a test. This is only a test." print(re.findall(r'This.*test', text1)) # 貪婪匹配：['This is a test. This is only a test'] print(re.findall(r'This.*?test', text1)) # 非貪婪匹配：['This is a test', 'This is only a test'] # 範例 2：HTML 標籤 html = "<p>First paragraph.</p><div>Some content.</div><p>Second paragraph.</p>" # 貪婪模式：.* 會從第一個 <p> 一直匹配到最後一個 </p> greedy_matches = re.findall(r'<p>.*</p>', html) print(f"貪婪匹配: {greedy_matches}") # 貪婪匹配：['<p>First paragraph</p><p>Second paragraph</p>'] # 非貪婪模式：.*? 只會匹配到最近的那個 </p> non_greedy_matches = re.findall(r'<p>.*?</p>', html) print(f"非貪婪匹配: {non_greedy_matches}") # 非貪婪匹配：['<p>First paragraph</p>', '<p>Second paragraph</p>'] ``` :::success **使用 re.finditer() 處理大型檔案** `re.finditer(pattern, string)` 與 `findall` 類似，但返回一個迭代器 (iterator)，其元素為 Match 物件。在處理大型檔案時，這種方式更節省記憶體。 ::: ### 3.4 字串替換與分割 Substitution & Splitting #### 取代符合的字串使用 re.sub(pattern, repl, string) 可以用新字串 repl 替換所有比對到的模式 pattern。 **範例：** ```python= import re text = "My phone is 0936-279195, your phone is 0988-123456" new_text = re.sub(r'\d{4}-\d{6}', '******', text) print(new_text) ``` 輸出: ``` My phone is ******, your phone is ****** ``` **範例：搭配分組進行字串替換：** ```python= import re text = "My phone is 0936-279195, your phone is 0988-123456" pattern = r"(\d{4})-(\d{6})" replacement = r"\1-####" result = re.sub(pattern, replacement, text) print(result) ``` **範例：repl 也可以是一個回呼函數，用來對每個比對到的地方做進一步處理：** ```python= import re text = "Their phone numbers are 0936-279195 and 0988-123456" def replace_phone(match): phone = match.group() return phone[:4] + "-******" new_text = re.sub(r'\d{4}-\d{6}', replace_phone, text) print(new_text) ``` 輸出: ``` Their phone numbers are 0936-****** and 0988-****** ``` #### 分割字串 re.split(pattern, string)：依據比對到的模式 pattern 分割字串。 **範例：以空白分割字串** ```python import re text = "apple banana\tcherry" print(re.split(r'\s+', text)) # ['apple', 'banana', 'cherry'] ``` ### 3.5 修改比對行為：旗標 (Flags) 在進行正規表示式比對時，有時我們需要微調比對的規則，例如「忽略大小寫」或「讓 `.` 也能比對換行符號」。這時就可以使用 `flags` 參數。 `flags` 可以在兩種地方使用： 1. 在 `re.compile()` 中設定，建立一個帶有特定比對規則的樣板物件。 2. 直接在 `re.search()`, `re.findall()`, `re.sub()` 等函式中當作參數傳入，僅對該次操作生效。 ![flags](https://hackmd.io/_uploads/SJGecB6alx.png) #### 常用旗標 (Flags) 以下介紹幾個最常用的旗標： **1. `re.IGNORECASE` (或 `re.I`)：忽略大小寫** 讓樣板在比對時不區分英文字母的大小寫。 ```python= import re text = "Hello, world. HELLO, PYTHON." # 不使用 flag，只會找到第一個 "Hello" print(re.findall(r'hello', text)) # [] # 使用 re.IGNORECASE，大小寫的 "hello" 都會被找到 print(re.findall(r'hello', text, flags=re.IGNORECASE)) # ['Hello', 'HELLO'] ``` **2. `re.MULTILINE` (或 `re.M`)：多行模式** 預設情況下，`^` 和 `$` 只會比對整個字串的開頭和結尾。啟用多行模式後，`^` 和 `$` 也能比對每一行的開頭和結尾。 ```python= import re text = """first line second line third line""" # 不使用 flag，^ 只比對整個字串的開頭 print(re.findall(r'^\w+', text)) # ['first'] # 使用 re.MULTILINE，^ 會比對每一行的開頭 print(re.findall(r'^\w+', text, flags=re.MULTILINE)) # ['first', 'second', 'third'] ``` **3. `re.DOTALL` (或 `re.S`)：點 (.) 匹配所有字元** 預設情況下，`.` (點) 會比對除了換行符號 `\n` 以外的任何字元。啟用此旗標後，`.` 就能比對包含換行符號在內的**所有**字元。 ```python= import re text = "hello\nworld" # 不使用 flag，. 遇到換行符號 \n 就會停止 match1 = re.search(r'hello.world', text) print(match1) # None # 使用 re.DOTALL，. 可以成功比對換行符號 match2 = re.search(r'hello.world', text, flags=re.DOTALL) print(match2.group()) # 'hello\nworld' ``` #### 組合使用多個旗標若要同時使用多個旗標，可以用 `|` (位元 OR 運算子) 將它們連接起來。 ```python= import re text = "Start of line 1\nEND OF LINE 2" # 同時忽略大小寫 (I) 並啟用多行模式 (M) pattern = re.compile(r'^start.*end$', flags=re.I | re.M | re.S) match = pattern.search(text) if match: print("成功匹配多行且忽略大小寫的內容：") print(match.group()) ``` 輸出： ``` 成功匹配多行且忽略大小寫的內容： Start of line 1 END OF LINE 2 ``` | 旗標 (Flag) | 縮寫 | 說明 | | ------------------- | ------ | ------------------------------------------------------------ | | `re.IGNORECASE` | `re.I` | 進行忽略大小寫的比對。 | | `re.MULTILINE` | `re.M` | 使 `^` 和 `$` 能作用於每一行的開頭和結尾。 | | `re.DOTALL` | `re.S` | 使 `.` 特殊字元可以比對任何字元，包含換行符號 `\n`。 | | `re.VERBOSE` | `re.X` | 允許在樣板中加入空白和註解，讓複雜的 RE 更易讀（進階用法）。 | ### 3.6 預編譯模式 Compiling Patterns 若一個正規表示式需要被重複使用，建議先用 re.compile() 將其編譯成一個【樣板物件】，可以提升執行效率。編譯後的物件擁有 search(), findall() 等所有方法。 ```python= import re pattern = re.compile(r'\d{4}-\d{6}') # 使用樣板物件的方法 match = pattern.search("Call me at 0936-279195") print(match.group()) matches = pattern.findall("test 0936-123456 and 0977-456789") print(matches) ``` 輸出: ``` 0936-279195 ['0936-123456', '0977-456789'] ``` `re.compile()` 的第二個參數可以傳入 `flags` 來修改比對行為，詳細用法請參考上一節 **3.5 修改比對行為：旗標 (Flags)**。範例：`pattern = re.compile(r'hello', flags=re.IGNORECASE)` ### 3.7 RE 使用時機 * 避免過於複雜的正規表示式。 * 如果你只需要尋找一個匹配項的話，可以使用 re.search() 方法。 * 如果你需要尋找多個匹配項，可以使用 re.findall() 方法。 * 如果你只需要尋找給定的文字串是否以某一個模式開頭的話，可以使用 re.match() 方法。 * 如果需要替換字串中符合模式的部分，可以使用 re.sub() 方法。 * 如果你需要對一個或多個文字串進行多次匹配的話，可以先使用 re.compile() 將正規表達式模式編譯成正規表達式物件，然後再使用各種方法進行匹配操作。 :::success 若模式太複雜，可能導致 RE 的性能低下，例如：ReDoS（RE 拒絕服務）風險，避免易被惡意輸入容易卡住的模式。 ::: ## 4. 實務應用範例 Practical Applications ### 4.1 格式驗證 Pattern Validation **範例：驗證電子郵件** ```python import re email = "test@example.com" pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$' print(re.match(pattern, email)) # <re.Match object; span=(0, 17), match='test@example.com'> # 測試無效 email invalid_email = "test@.com" print(re.match(pattern, invalid_email)) # None ``` **範例：驗證 URL** ```python import re url_pattern = re.compile(r'^(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([/\w \.-]*)*/?$') print(url_pattern.match("https://www.example.com/path")) # <re.Match object; ...> print(url_pattern.match("ftp://example.com")) # None ``` **範例：驗證密碼強度** ```python import re # 使用了 ^, $, 與 (?=...) 正向預查 pw_pattern = re.compile(r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]{8,}$') print(pw_pattern.match("Passw0rd123")) # <re.Match object; ...> print(pw_pattern.match("password123")) # None (缺少大寫) print(pw_pattern.match("Password")) # None (缺少數字) print(pw_pattern.match("Pass123")) # None (長度不足) ``` ### 4.2 資料擷取 Text Extraction **範例：提取 IP 地址** ```python= import re text = "Server IP: 192.168.1.1, Client IP: 10.0.0.255, Invalid: 999.999.999.999" # 使用了 \b 單詞邊界與 (?:...) 非捕獲分組 ip_pattern = re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b') ips = ip_pattern.findall(text) print(ips) # ['192.168.1.1', '10.0.0.255'] ``` **解析 HTML 標籤（簡單情況）** ```python= import re html_text = """ <div class="member_name"> <a href="/profile/john">John Doe</a> </div> <div class="member_info_title">職稱</div> <div class="member_info_content">工程師</div> <div class="member_contact"> <a href="mailto:john@example.com">聯絡</a> </div> """ # 提取姓名 # 使用 [^"]+ 來非貪婪匹配「除了雙引號以外」的任何字元（href 的屬性值） name_pattern = re.compile(r'<div class="member_name">\s*<a href="[^"]+">([^<]+)</a>') names = name_pattern.findall(html_text) # 提取職稱 title_pattern = re.compile( r'<div class="member_info_title">.*?職稱</div>\s*<div class="member_info_content">([^<]+)</div>' ) titles = title_pattern.findall(html_text) # 提取 Email email_pattern = re.compile(r'mailto:([^"]+)') emails = email_pattern.findall(html_text) print("Names:", names) print("Titles:", titles) print("Emails:", emails) ``` 輸出: ``` Names: ['John Doe'] Titles: ['工程師'] Emails: ['john@example.com'] ``` ### 4.3 資料清理 Data Cleaning **範例：清理多餘空白** 將文字中的連續空白替換為單一空格，並去除首尾空白，常用於資料清洗。 ```python= import re messy_text = " This is a test . " clean_text = re.sub(r'\s+', ' ', messy_text).strip() print(clean_text) # 'This is a test .' ``` ### 4.4 日期與網址解析 Date & URL Parsing **範例：解析日期（YYYY-MM-DD 格式）** 從文字中提取標準日期，適用於報告或資料處理。 ```python= import re date_text = "Meeting on 2023-10-14 and 2024-01-01, not 2023-13-01" date_pattern = r'\d{4}-\d{2}-\d{2}' dates = re.findall(date_pattern, date_text) print(dates) # ['2023-10-14', '2024-01-01'] (此處忽略了無效日期，因為 RE 只比對格式，不驗證邏輯) ``` :::success 關於使用 RE 解析 HTML 的提醒雖然正規表示式可以用來解析簡單的 HTML，但它非常脆弱。HTML 結構的微小變化，例如多一個空格、換行、屬性順序調換，或出現巢狀標籤，都可能導致 RE 比對失敗。因此，強烈不建議在正式專案中使用 RE 解析複雜的 HTML。請務必使用專業的 HTML 解析套件，例如： - BeautifulSoup：語法友善，易於上手，適合初學者與快速開發。 - lxml：基於 C 語言，效能極高，支援 XPath 和 CSS 選擇器，適合處理大型文件。 ::: ## 5. 下課練習請在 https://regex101.com/ 練習以下題目： 1. 試用 RE 驗證台灣手機號碼格式（例如： 0912-345678），並提取後 4 碼。 2. 試用 RE 提取字串中的 URL（例如： https://www.google.com.tw, http://tw.yahoo.com）。 3. 試用 RE 提取 IP:Port（例如： 140.128.73.11:8080, 120.108.9.37:4123）。 ![image](https://hackmd.io/_uploads/B1e3lroplg.png)