C#正則表達式：強大的文本匹配工具

# C#正則表達式：強大的文本匹配工具 {%hackmd BJrTq20hE %} <style> .markdown-body:not(.next-editor) pre { padding: 16px; background-color: #333; } .markdown-body pre.flow-chart, .markdown-body pre.sequence-diagram, .markdown-body pre.graphviz, .markdown-body pre.mermaid, .markdown-body pre.abc { background-color: #d9edf7 !important;<!-mermaidbg-!> } .markdown-body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", "Helvetica Neue", Helvetica, Roboto, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; padding-top: 40px; padding-bottom: 40px; max-width: 800px; <!-筆記寬度-!> overflow: visible !important; position: relative; } </style> 在C#開發中，處理文本是一項常見的任務。無論是搜索文本、驗證輸入、提取信息還是進行文本轉換，都需要一種有效的工具來處理文本。這就是C#正則表達式的優越之處。 ### 什麼是正則表達式？正則表達式（Regular Expressions），簡稱正則或Regex，是一種強大的文本處理工具，它允許根據特定模式匹配和操作字符串。它不僅限於C#，而是通用的文本處理工具，被廣泛應用於各種程式語言和應用中。 ### 命名空間和引用首先，為了使用正則表達式，需要引入 `System.Text.RegularExpressions` 命名空間。可以在C#代碼中這樣做： ```csharp! using System.Text.RegularExpressions; ``` ### 常用方法和函數 #### 1\. `Regex.IsMatch` 方法(常用於驗證的方式) `Regex.IsMatch` 方法用於檢查字符串是否匹配正則表達式模式。它的基本用法如下： ```csharp! string input = "your_input_string"; string pattern = @"your_regex_pattern"; bool isMatch = Regex.IsMatch(input, pattern); ``` `isMatch` 變數將包含布林值，指示是否有匹配。範例: ```csharp! using System; using System.Text.RegularExpressions; class Program { static void Main() { string input = "The quick brown fox jumps over the lazy dog."; string pattern = @"\bfox\b"; // 正則表達式模式，匹配單獨的 "fox" bool isMatch = Regex.IsMatch(input, pattern); if (isMatch) { Console.WriteLine("The input contains the word 'fox'."); } else { Console.WriteLine("The input does not contain the word 'fox'."); } } } ``` #### 2\. `Regex.Match` 方法(用於調用匹配的值、索引、長度) `Regex.Match` 方法用於在字符串中尋找第一個匹配項，並返回一個 `Match` 對象，其中包含有關匹配的信息。以下是一個示例： ```csharp! string input = "your_input_string"; string pattern = @"your_regex_pattern"; Match match = Regex.Match(input, pattern); ``` 可以使用 `match` 對象來訪問有關匹配的詳細信息，例如匹配的值、位置等。 `Match` 對象包含有關單個匹配的信息，例如匹配的值、位置等。可以通過以下方式訪問這些信息： ```csharp! Match match = Regex.Match(input, pattern); if (match.Success) { string value = match.Value; // 匹配的值 int startIndex = match.Index; // 匹配的起始索引 int length = match.Length; // 匹配的長度 } ``` 範例: ```csharp! using System; using System.Text.RegularExpressions; class Program { static void Main() { string input = "The price of the product is $99.99, and the discount is 20%."; string pattern = @"\$\d+\.\d{2}"; // 正則表達式模式，匹配價格 // 在輸入字符串中尋找第一個匹配 Match match = Regex.Match(input, pattern); // 檢查是否有匹配 if (match.Success) { string value = match.Value; // 匹配的值，即 "$99.99" int startIndex = match.Index; // 匹配的起始索引，即 23 int length = match.Length; // 匹配的長度，即 7 Console.WriteLine("Original String: " + input); Console.WriteLine("Matched Value: " + value); Console.WriteLine("Start Index: " + startIndex); Console.WriteLine("Length: " + length); } else { Console.WriteLine("No match found."); } } } ``` 在這個例子中，我們有一個包含價格信息的字符串 `input`，我們使用正則表達式 `pattern` 來匹配第一個價格。然後，我們創建了一個 `Match` 對象 `match`，它包含有關匹配的信息。我們檢查 `match.Success`，以確保找到了匹配。如果有匹配，我們可以使用 `match.Value` 獲取匹配的值（即 `$99.99`）、`match.Index` 獲取匹配的起始索引（即 23）、以及 `match.Length` 獲取匹配的長度（即 7）。 #### 3\. `Regex.Matches` 方法 `Regex.Matches` 方法用於在字符串中查找所有匹配項，並返回一個 `MatchCollection` 對象，其中包含多個 `Match` 對象。以下是一個示例： ```csharp! string input = "your_input_string_with_multiple_matches"; string pattern = @"your_regex_pattern"; MatchCollection matches = Regex.Matches(input, pattern); ``` `matches` 對象包含了所有匹配的信息，可以遍歷它以獲取每個匹配的詳細信息。 #### 4\. `Regex.Replace` 方法 `Regex.Replace` 方法用於替換字符串中的匹配項。它的基本用法如下： ```csharp! string input = "your_input_string_with_matches_to_replace"; string pattern = @"your_regex_pattern"; string replacement = "replacement_text"; string result = Regex.Replace(input, pattern, replacement); ``` `result` 變數將包含替換後的新字符串。範例: ```csharp! using System; using System.Text.RegularExpressions; class Program { static void Main() { string input = "Hello, my email is user123@example.com, and my friend's email is friend456@example.com"; string pattern = @"\b\w+@\w+\.\w+\b"; // 正則表達式模式，匹配電子郵件地址 // 替換所有的電子郵件地址為 "[email hidden]" string replacement = "[email hidden]"; string result = Regex.Replace(input, pattern, replacement); Console.WriteLine("Original String: " + input); Console.WriteLine("Modified String: " + result); } } ``` 在這個例子中，我們有一個包含電子郵件地址的字符串 `input`。我們使用正則表達式 `pattern` 來匹配所有的電子郵件地址。然後，我們將匹配到的電子郵件地址都替換為 `[email hidden]`，並將結果存儲在 `result` 變數中。這些方法和函數是使用C#正則表達式時最常用的。它們可以幫助執行各種文本處理任務，包括驗證、提取和替換。 ### 簡單的正則表達式示例讓我們看一個簡單的正則表達式示例。假設我們想要驗證一個字符串是否是合法的電子郵件地址。以下是一個用於此目的的簡單正則表達式： ```csharp string pattern = @"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$"; ``` **解釋**： - `^` 和 `$`：這些符號分別表示字符串的開頭和結尾。在上面的示例中，它們用於確保整個字符串匹配模式。 - `[]`：方括號用於定義字符集。例如，`[a-zA-Z]` 匹配任何一個小寫或大寫字母。 - `+` 和 `*`：`+` 表示匹配前面的模式一次或多次，而 `*` 表示匹配零次或多次。 - `.`：匹配除換行符之外的任何字符。 - `\`：用於轉義特殊字符。例如，`\.` 匹配句點字符而不是通配符。 - `{}`：用於指定匹配次數的範圍。例如，`{2,4}` 表示匹配2到4次。所以，這個正則表達式用來確保輸入字符串具有一般的電子郵件格式，例如`example@email.com`。 ### 正則表達式的基本元素正則表達式包含一些基本元素，讓我們更深入地了解它們： 1. `^` 和 `$`：這些是錨定字符，分別表示字符串的開頭和結尾。 2. `[]`：方括號用於定義字符集。例如，`[a-zA-Z]` 匹配任何一個小寫或大寫字母。 3. `+` 和 `*`：`+` 表示匹配前面的模式一次或多次，而 `*` 表示匹配零次或多次。 4. `.`：匹配除換行符之外的任何字符。 5. `\`：用於轉義特殊字符。例如，`\.` 匹配句點字符而不是通配符。 6. `{}`：用於指定匹配次數的範圍。例如，`{2,4}` 表示匹配2到4次。 ### 使用正則表達式的示例現在，讓我們深入解釋每個示例的工作原理： #### 1\. **驗證電話號碼**： ```csharp! string phoneNumberPattern = @"^\d{3}-\d{3}-\d{4}$"; bool isValid = Regex.IsMatch(input, phoneNumberPattern); ``` **解釋**： - `^\d{3}-\d{3}-\d{4}$` 正則表達式用於驗證美國標準的電話號碼，例如`123-456-7890`。 - `^\d{3}`：開頭必須是三個數字。 - `-`：接著是一個短橫線。 - `\d{3}`：然後是另外三個數字。 - `-`：再次是一個短橫線。 - `\d{4}`：最後是四個數字。 - `$`：最後必須是字符串的結尾。只有當輸入字符串完全符合`123-456-7890`的格式時，`isValid`變量才會為`true`。 #### 2\. **提取URL中的域名**：(用括號來產生匹配組，並提取特定內容) ```csharp! string url = "https://www.example.com"; string domainPattern = @"^https?://(www\.)?([a-zA-Z0-9.-]+)"; Match match = Regex.Match(url, domainPattern); string domain = match.Groups[2].Value; ``` **解釋**： - `^https?://(www\.)?([a-zA-Z0-9.-]+)` 正則表達式用於提取URL中的域名。 - `^https?://`：匹配URL的開頭，可以是`http://`或`https://`。 - `(www\.)?`：匹配可選的`www.`子域。 - `([a-zA-Z0-9.-]+)`：匹配域名部分，可以包含字母、數字、句點和短橫線 - ` Match match = Regex.Match(url, domainPattern);`：這一行代碼使用正則表達式 `domainPattern` 在 `url` 字符串中尋找匹配。如果找到了匹配，它將存儲在 `match` 變數中。 - `string domain = match.Groups[2].Value;`：這一行代碼從 `match` 中提取第二個匹配組（由圓括號 `()` 括起來的部分），這部分代表域名。提取的域名存儲在 `domain` 變數中。所以，如果 `url` 字符串是 `https://www.example.com`，`domain` 變數將包含 `example.com`，這是提取的域名部分。 #### 3\. **替換文本中的HTML標記**： ```csharp! string html = "<p>Hello, <strong>world</strong>!</p>"; string noHtml = Regex.Replace(html, "<.*?>", ""); ``` **解釋**： - `<.*?>` 正則表達式用於匹配HTML標記，包括標記的開始和結束部分。 - `<` 和 `>`：這些字符匹配左尖括號和右尖括號，它們是HTML標記的標誌。 - `.*`：這部分匹配任何字符（`.`）零次或多次（`*`），這允許匹配標記之間的內容。 - `?`：這個字符表示匹配模式是非貪婪的，這意味著它會匹配最短的字符串，以便在遇到第一個結束標記時停止。所以，使用 `Regex.Replace` 方法，我們可以將 `html` 字符串中的所有HTML標記替換為空字符串，從而得到不包含標記的純文本字符串，即 `"Hello, world!"`。這些示例展示了C#正則表達式的基本用法，可以根據需求應用不同的正則表達式來執行文本匹配、驗證、提取和轉換操作。 ## 常用元素 | 元素 | 解釋 | | --- | --- | | `.` | 匹配任何一個字符，除了換行符（像是 Enter 鍵）。 | | `*` | 匹配前一個字符零次或多次。例如，`a*` 可以匹配零個或多個字母 "a"。 | | `+` | 匹配前一個字符一次或多次。例如，`b+` 可以匹配一個或多個字母 "b"。 | | `?` | 匹配前一個字符零次或一次，表示它是可選的。例如，`colou?r` 可以匹配 "color" 和 "colour"。 | | `[]` | 用於定義一組字符，匹配這組字符中的任何一個。例如，`[aeiou]` 可以匹配任何一個元音字母。 | | `[^]` | 在 `[]` 內使用插入符號 `^`，表示匹配不在這組字符中的任何一個字符。例如，`[^0-9]` 可以匹配任何非數字字符。 | | `()` | 用於創建一個組，可以捕獲這個組匹配到的內容，以後可以使用。 | | `|` | 表示或操作，匹配左側或右側的內容。例如，`cat|dog` 可以匹配 "cat" 或 "dog"。 | | `\d` | 匹配任何數字字符，等同於 `[0-9]`。 | | `\D` | 匹配任何非數字字符，等同於 `[^0-9]`。 | | `\w` | 匹配任何字母、數字或底線字符，等同於 `[a-zA-Z0-9_]`。 | | `\W` | 匹配任何非字母、非數字或非底線字符，等同於 `[^a-zA-Z0-9_]`。 | | `\s` | 匹配任何空白字符，包括空格、制表符和換行符。 | | `\S` | 匹配任何非空白字符。 | | `^` | 匹配輸入字符串的開頭。 | | `$` | 匹配輸入字符串的結尾。 | | `\b` | 匹配單詞的邊界，通常用於單詞的全字匹配。 | | `\B` | 匹配非單詞的邊界。 | Reference: 菜鳥教程: https://www.runoob.com/regexp/regexp-metachar.html 微軟官方: https://learn.microsoft.com/zh-tw/dotnet/api/system.text.regularexpressions.regex?view=net-7.0