正規表達式(python)

# 正規表達式(python) ## 基本符號符號 | 意義 | 用法 ------ |----------------------------------------------- |------------ . | 換行以外的任何字元 | . ? | 符合0次或1次，或者比對最少的就停止。 | a? \* | 符合0次或多次，或者盡可能比對最多。 | a* \+ | 符合1次或多次。 | a+ [] | 自訂比對格式。左側中括號加上^表示[]內的除外。 | [abc] [^abc] {} | 指定比對的次數。 | a{1,5} a{1,} a{,5} () | 將規則分組。 | (abc) ^ | 表示開頭。 | ^abc $ | 表示結尾。 | abc$ \d | 從0到9的數字，大寫D表示\d除外 | \d \D \w | 任何的字母、數字及底線符號_，大寫W表示\w除外 | \w \W \s | 空白字元，包括空格、(tab)、換行符號，大寫S表示\s除外 | \s \S ## 使用正規表達式 python的用法需要引入一個re模組 ``` import re ``` 使用compile建立物件正規表達式物件，表達式: ``` Regex = re.compile(r'\d{2}-\d{4}-\d{4}') ``` 字串前加上r的意思是保留原始字串，避免與python其他功能衝突，例如跳脫字元 Regex物件可以用幾個函式來做不同的搜尋: 方法 | 功能 ---------|------ match | 尋找必須是開頭的字串(相當於正規表達式開頭加上^) search | 尋找第一個符合的字串 findall | 尋找所有符合的字串 finditer | 尋找所有符合的字串 findall會把找到的值以串列方式回傳，沒找到則返回空串列 match、search、finditer是回傳物件，沒找到則返回none，要以以下方法取值: 方法 | 功能 ---------|------ gruop | 找到的值 span | 找到值的起始位置與結束位置(回傳數組) start | 找到值的起始位置 end | 找到值的結束位置 ### 使用search ``` Regex = re.compile(r'\d{2}-\d{4}-\d{4}') str = "Please call David at 02-8888-1688 by today.\r\n02-9888-9898 is his office number." result = Regex.search(str) print("group():",result.group()) print("span():",result.span()) print("start():",result.start()) print("end():",result.end()) ``` 輸出: ![](https://i.imgur.com/SZXnSY1.png) ### 使用match ``` Regex = re.compile(r'\d{2}-\d{4}-\d{4}') str = "Please call David at 02-8888-1688 by today.\r\n02-9888-9898 is his office number." result = Regex.match(str) print(result) ``` 輸出: ![](https://i.imgur.com/x2VGx8J.png) ``` Regex = re.compile(r'\d{2}-\d{4}-\d{4}') str = "02-9888-9898 is his office number." result = Regex.match(str) print(result) ``` 輸出: ![](https://i.imgur.com/RPn4fKr.png) ### 使用findall ``` Regex = re.compile(r'\d{2}-\d{4}-\d{4}') str = "Please call David at 02-8888-1688 by today.\r\n02-9888-9898 is his office number." result = Regex.findall(str) print(result) ``` 輸出: ![](https://i.imgur.com/3gaWZJ7.png) ### 使用finditer ``` Regex = re.compile(r'\d{2}-\d{4}-\d{4}') str = "Please call David at 02-8888-1688 by today.\r\n02-9888-9898 is his office number." result = Regex.finditer(str) for match in result: print("group():",match.group()) print("span():",match.span()) print("start():",match.start()) print("end():",match.end()) print() ``` 輸出: ![](https://i.imgur.com/xv5x7wD.png) ## 練習 1. 判斷是否為台灣的電話(0911111111,886911111111) 2. 判斷是否為台灣身份證字號(英文大寫字母+1 or 2+八個數字) 3. 判斷格式是否符合:1~3組數字，每組數字佔兩位，並以逗號分隔 ## 進階 ### 更多符號符號 | 意義 :------ |:------ | | 可以比對多個規則，較前面的條件符合為比對結果 \A | 字串開頭，等同於^，但不受MULTILINE旗標影響 \Z | 字串結尾，等同於$，但不受MULTILINE旗標影響 \b | 表示單字邊界，即單字開頭與結尾 ``` ``` ### 旗標旗標 | 功能 ------------- | -------- ASCII, A | 使用 ASCII 字符集 UNICODE, U | 使用 Unicode 字符集 DOTALL, S | .可以表示任何字元，包括換行字元 IGNORECASE, I | 忽略大小寫 LOCALE, L | 匹配 {\w \W \b \B} 跟本地語言相關。不推薦使用 MULTILINE, M | ^、$將換行作為字串開始與結尾 VERBOSE, X | 忽略各種空格以及以#開頭的註釋，這使得長匹配模式可以分行來寫，提高了可讀性 #### 旗標範例要使用旗標時將re."旗標名"寫到compile函數後面，並以逗號分隔寫完整單字或是字母縮寫都可以 ``` Regex = re.compile(r'[ABC]',re.IGNORECASE) or Regex = re.compile(r'[ABC]',re.I) ``` ##### MULTILINE ``` Regex = re.compile(r'^\w+',re.M) str = "Please call David at 02-8888-1688 by today.\r\n02-9888-9898 is his office number." result = Regex.findall(str) print(result) ``` 輸出: ![](https://i.imgur.com/1oHmtLw.png) ##### VERBOSE 下列以電子郵件的正規表達式示範沒VERBOSE旗標: ``` Regex = re.compile(r"^([\w!#$%&'*+\-/=?^_`{|}~]+)(\.[\w!#$%&'*+\-/=?^_`{|}~]+)*@[\w-]+(\.[\w-]+)+$") ``` 有VERBOSE旗標: ``` Regex = re.compile(r""" ^ ([\w!#$%&'*+\-/=?^_`{|}~]+) #第一個點前的字串 (\.[\w!#$%&'*+\-/=?^_`{|}~]+)* #任意組點加字串 @ [\w-]+ #一組字串 (\.[\w-]+)+ #至少一組點加字串 $ """,re.X)) ``` ### 分组被()包括的內容就是一個組，有時候需要再尋找更細節的資料，就可以分組來取得細節可以發現在原本的正規表達式中再加入了三組括號，分別將三組數字再個別分組，之後輸出在函數中加入參數取值函數的參數(0)等同於()，就是完整正規表達式所找到的字串甚至可以以逗號串接引數，將會返回一組數組 ``` Regex = re.compile(r'(\d{2})-(\d{4})-(\d{4})') str = "Please call David at 02-8888-1688 by today.\r\n02-9888-9898 is his office number." result = Regex.search(str) print(result.group(0),result.span(0)) print(result.group(1),result.span(1)) print(result.group(2),result.span(2)) print(result.group(3),result.span(3)) print(result.group(0,1,2,3)) ``` 輸出: ![](https://i.imgur.com/LauzZvT.png) 使用findall若正規表達式中有分組，則只輸出組的內容，想要完整字串只要全部再用括號包起來即可 ``` Regex = re.compile(r'(\d{2})-(\d{4})-(\d{4})') str = "Please call David at 02-8888-1688 by today.\r\n02-9888-9898 is his office number." result = Regex.findall(str) print(result) ``` 輸出: ![](https://i.imgur.com/Ch9eb5d.png) 若不想在使用findall的時候搜尋到括號的內容，就在括號內的開頭加上?: ``` Regex = re.compile(r'((\d{2})-(\d{4})-(?:\d{4}))') str = "Please call David at 02-8888-1688 by today.\r\n02-9888-9898 is his office number." result = Regex.findall(str) print(result) ``` 輸出: ![](https://i.imgur.com/KoaajIh.png) (? )這類型的用法還有很多例如: 1. a(?=b) a後面必須是b ``` Regex = re.compile(r'\d{4}(?=-1688)') str = "Please call David at 02-8888-1688 by today.\r\n02-9888-9898 is his office number." result = Regex.findall(str) print(result) ``` 輸出: ![](https://i.imgur.com/rMfX3ks.png) 需要注意的是放在(?= )裡的條件不會出現在結果裡 2. a(?!b) a後面必須不是b ``` Regex = re.compile(r'\d{4}(?!-1688)') str = "Please call David at 02-8888-1688 by today.\r\n02-9888-9898 is his office number." result = Regex.findall(str) print(result) ``` 結果: ![](https://i.imgur.com/lTfvnFo.png) 3. (?P\<first\>...) 把分組命名為first\ 除了能做為group等函數的參數外，用groupdict函數還能做為key輸出成字典 ``` Regex = re.compile(r'(?P<first>\d{2})-(?P<second>\d{4})-(?P<third>\d{4})') str = "Please call David at 02-8888-1688 by today.\r\n02-9888-9898 is his office number." result = Regex.search(str) print("group()\t\t\t:",result.group()) print("group(\"first\")\t:",result.group("first")) print("group(\"second\")\t:",result.group("second")) print("group(\"third\")\t:",result.group("third")) print("groupdict()\t\t:",result.groupdict()) ``` 輸出: ![](https://i.imgur.com/b2AVW2j.png)