正規表示式（Regular Expression）

tags: `python`

介紹

正規表示式（英語：Regular Expression，常簡寫為regex、regexp或RE），又稱正規表達式、正規表示法、規則運算式、常規表示法。

正規表示式用來操作字串，透過某個規則（pattern）的來檢索、搜尋字串裡符合條件的文字。

所以也常用在對純文字的文件進行解析，例如：txt、html、xml、json檔案，從中萃取出所需要的文字，或是針對純文字檔案來進行處理。

Python 中做正規運算式的模組為 re ，首先要設定好「規則（pattern）」，並提供要進行處理的「字串（string）」，然後在透過呼叫re模組中相關功能的函式（function）來進行處理。

提示：
「規則（pattern）」通常會用使用Python的r開頭的原始字串（raw string）格式，這是因為正規表示式的規則中有符號跟Python字串中的跳脫符號會互相衝突（例如反斜線\\），所以必須使用原始字串來作為規則字串。

網路資源

Regular Expression測驗：https://regexone.com/
Regular Expression測試：https://regex101.com/

用途

尋找資料（findall)
驗證資料（search、match）
抽取資料（split、sub）

常用re模組函數：

函數	說明
`findall(pattern, string)`	回傳string中所有與pattern相匹配的全部字串，返回形式為陣列。
`finditer(pattern, string)`	回傳string中所有與pattern相匹配的全部字串，返回形式為迭代器。
`search(pattern, string)`	回傳從string中「第一個」包含pattern的字串，沒有找到則回傳None 。
`match(pattern, string)`	匹配字串的開頭，如果有包含pattern，則匹配成功，回傳Match物件，失敗則回傳None。若要完全匹配pattern，必須以$結尾。
`fullmatch(pattern, string)`	判斷 string 是否與配對形式字串 pattern 完全相符，如果完全相符就回傳配對物件，不完全相符就回傳 None 。
`compile(pattern)`	以pattern字串當參數，回傳 re.compile() 物件，提供其他支援正規表示式的函式使用。
`split(pattern, string, maxsplit=0)`	將 string 以配對形式字串 pattern 拆解，結果回傳拆解後的串列。
`sub(pattern, repl, string, count=0)`	依據 pattern 及 repl 對 string 進行處理，結果回傳處理過的新字串。
`subn(pattern, repl, string, count=0)`	依據 pattern 及 repl 對 string 進行處理，結果回傳處理過的序對。
`escape(pattern)`	將 pattern 中的特殊字元加入反斜線，結果回傳新字串。
`purge()`	清除正規運算式的內部緩存。

備註：

re.match()與re.search()的差別

re.match只有匹配字串的開頭，如果字符串開頭就不符合正則表達式，則匹配失敗，函式回傳None；而re.search()則是整個字串都會做匹配，只要找到一個匹配就表示成功，整個字串都沒有匹配才會回傳None。

中介字元（Metacharacters）

說明：

中介字元	說明	範例	說明
[]	字元的集合。	[a-m]	a~m之間的小寫英文字
\	發出特殊序列的信號（也可以用於轉義特殊字符）。	\d	只要數字
.	除了新行符號外的任意字元。	he..o	he字串後接著兩個字元，然後接著是o
^	字串以此為開頭。	^hello	字串開頭為hello
$	以此為結尾的字串。	world$	字串結尾為world
*	字元或字串出現任意次數（包含０次）。	aix*	ai、aix、aix和aixx或更多x都符合。
?	字元或字串出現 0 或 1 次。	aix?	僅ai、aix符合。
+	字元或字串至少出現一次。	aix+	僅aix符合。
{m,n}	指定字元或字串出現的m~n之間的次數。	al{2} al{3,6}	a後面連續2個l的字串 a後面連續3到6個l的字串
\|	單一字元或群組的或，例如 'a\|b' 為 'a' 或 'b' 。	falls\|stays	字串包含falls或是stays
()	對小括弧內的字元形成群組。

特別序列（Special Sequences）

說明：

特別序列	說明
\A	字串的開頭字元。
\b	單字的界線字元。
\B	字元的界線字元。
\d	數字，從 0 到 9 。
\D	非數字。
\s	各種空白符號，包含換行符號 \n 。
\S	非空白符號。
\w	任意文字字元，包括數字。
\W	非文字字元，包括空白符號。
\Z	字串的結尾字元。

補充：

\A、\Z和^、$有類似的作用，差別在於前者會以全部內容為主，後者會以換行為結束。

`findall()`

範例ㄧ：找出a~m之間的小寫英文字





import re

txt = 'The rain in Spain'
x = re.findall(r'[a-m]', txt)
print(x)

輸出：

['h', 'e', 'a', 'i', 'i', 'a', 'i']

範例二：找出數字





import re

txt = 'That will be 59 dollars'
x = re.findall(r'\d', txt)
print(x)

['5', '9']

範例三：找出he字串後接著兩個字元，然後接著是o





import re

txt = 'hello world'
x = re.findall('he..o', txt)
print(x)

輸出：

['hello']

範例四：字串開頭必須為hello









import re

txt = 'hello world'
x = re.findall(r'^hello', txt)

if x:
  print("Yes, the string starts with 'hello'")
else:
  print('No match')

輸出：

Yes, the string starts with 'hello'

範例五：字串結尾為world









import re

txt = 'hello world'

x = re.findall(r'world$', txt)
if x:
  print("'Yes, the string ends with 'world'")
else:
  print('No match')

輸出：

Yes, the string ends with 'world'

範例六：找出ai字串後面有0~多個x字元的字串












import re

txt = 'The rain in Spain falls mainly in the plain!'

x = re.findall(r'aix*', txt)

print(x)

if x:
  print('Yes, there is at least one match!')
else:
  print('No match')

輸出：

['ai', 'ai', 'ai', 'ai']
Yes, there is at least one match!

範例七：找出ai字串後面有1~多個x字元的字串










import re

txt = 'The rain in Spain falls mainly in the plain!'
x = re.findall(r'aix+', txt)
print(x)

if x:
  print('Yes, there is at least one match!')
else:
  print('No match')

輸出：

[]
No match

範例八：找出a後面連續2個l的字串










import re

txt = 'The rain in Spain falls mainly in the plain!'
x = re.findall(r'al{2}', txt)
print(x)

if x:
  print('Yes, there is at least one match!')
else:
  print('No match')

輸出：

['all']
Yes, there is at least one match!

範例九：字串包含falls或是stays














import re

txt = 'The rain in Spain falls mainly in the plain!'

#Check if the string contains either 'falls' or 'stays':

x = re.findall(r'falls|stays', txt)

print(x)

if x:
  print('Yes, there is at least one match!')
else:
  print('No match')

輸出：

['falls']
Yes, there is at least one match!

`search()`

找出第一個空白字元的位置






import re

txt = 'The rain in Spain'
x = re.search(r'\s', txt)

print(r'The first white-space character is located in position:', x.start())

輸出：

The first white-space character is located in position: 3

找出Portugal是否出現在字串中





import re

txt = 'The rain in Spain'
x = re.search(r'Portugal', txt)
print(x)

輸出：

None

`split`

使用空白字元分割字串。





import re

txt = 'The rain in Spain'
x = re.split(r'\s', txt)
print(x)

輸出：

['The', 'rain', 'in', 'Spain']

使用空白字元分割字串，並限制最大分割次數。







import re

#Split the string at the first white-space character:

txt = 'The rain in Spain'
x = re.split(r'\s', txt, 1)
print(x)

['The', 'rain in Spain']

`sub()`

使用9取代所有的空白字元：







import re

#Replace all white-space characters with the digit '9':

txt = 'The rain in Spain'
x = re.sub(r'\s', '9', txt)
print(x)

輸出：

The9rain9in9Spain

使用9取代所有的空白字元，並限制最大的取代次數：







import re

#Replace the first two occurrences of a white-space character with the digit 9:

txt = 'The rain in Spain'
x = re.sub(r'\s', '9', txt, 2)
print(x)

輸出：

The9rain9in Spain

集合範例

集合	說明
[arn]	回傳字串中含有a、r或n的小寫字元。
[a-n]	回傳字串中含有a~n之間的任意小寫字元。
[^arn]	回傳任意字元，除了，a、r和n。
[0123]	回傳字串中含有0、1、2或3的數字。
[0-9]	回傳字串中含有0~9之間的數字。
[0-5][0-9]	回傳00~59之間的數字。
[a-zA-Z]	回傳a~z之間的大寫和小寫字元。
[+]	回傳字串中的+號（`+`, `*`, `.`, `

比較`match`、`searh`、`findall`、`finditer`差異

	match	search	findall	finditer
說明	字串開頭開始，如果包含pattern子字串則成功	整個字串中只要有出現pattern字串就成功	回傳字串中所有符合pattern的子字串	回傳字串中所有符合pattern的子字串迭代器
成功回傳	Match物件	Match物件	清單	Match物件迭代器
失敗回傳	None	None	空清單	空迭代器
其它	如果要整個字串符合pattern，則pattern必須是$結尾

應用 - 以正規表示式來搜尋網頁屬性

利用正規表達式來找出多個符合條件的標籤：






















from bs4 import BeautifulSoup
import re

html_doc = '''
<html>
  <head>
    <title>這是HTML文件標題</title>
  </head>
  <body>
    <div id='item-1'>牛肉乾</div>
    <div id='item-2'>豬肉乾</div>
    <div id='item-3'>羊肉乾</div>
    <div id='item-4'>鳥肉乾</div>
    <div id='item-5'>雞肉乾</div>
  </body>
</html>
'''

# 建立BeautifulSoup物件解析HTML文件
soup = BeautifulSoup(html_doc, 'lxml')
items = soup.find_all(id=re.compile(r'^item'))
for i in items: print(i)

輸出：

<div id='item-1'>牛肉乾</div>
<div id='item-2'>豬肉乾</div>
<div id='item-3'>羊肉乾</div>
<div id='item-4'>鳥肉乾</div>
<div id='item-5'>雞肉乾</div>

驗證手機號碼

一共有 10 位數
開頭要是 09
每一個字元都要是數字

^09\d{8}$

範例：











import re

phones = ['0912345678', '023456789', '096312345']

for p in phones:
  result = re.findall(r'^09\d{8}$', p)

  if len(result) > 0:
    print(p + '是手機號碼')
  else:
    print(p + '不是手機號碼')

練習

ㄧ、將下面的email清單，只抽取出帳號的部分，其餘去掉：

aaronho@gmail.com
andyliu@yahoo.com
apple@google.com
abner@microsoft.com
amberok@facebook.com

只留下：

aaronho
andyliu
apple
abner
amberok

二、以下哪些字串可以配對到這個 RE：/\w\w\w.\d\d\d/。

1. 000000
2. 9999999
3. aaaaaaa
4. 0a0a000
5. 0a0a0a0
6. cc3c777
7. cccc777

三、找出下面文章包含tion和sion的單字：

A regular expression (shortened as regex or regexp;also referred to as rational 
expression) is a sequence of characters that define a search pattern. Usually 
such patterns are used by string-searching algorithms for "find" or "find and 
replace" operations on strings, or for input validation. It is a technique 
developed in theoretical computer science and formal language theory.

答案

ㄧ、

@.+$

二、

9999999
0a0a000
cc3c777
cccc777

三、












import re

txt = '''
A regular expression (shortened as regex or regexp;also referred to as rational 
expression) is a sequence of characters that define a search pattern. Usually 
such patterns are used by string-searching algorithms for "find" or "find and 
replace" operations on strings, or for input validation. It is a technique 
developed in theoretical computer science and formal language theory.
'''

result = re.findall(r'\w*tion\w*|\w*sion\w*', txt)
print(result)

正規表示式（Regular Expression）

tags: python

介紹

網路資源

用途

常用re模組函數：

中介字元（Metacharacters）

特別序列（Special Sequences）

findall()

範例ㄧ：找出a~m之間的小寫英文字

範例二：找出數字

範例三：找出he字串後接著兩個字元，然後接著是o

範例四：字串開頭必須為hello

範例五：字串結尾為world

範例六：找出ai字串後面有0~多個x字元的字串

範例七：找出ai字串後面有1~多個x字元的字串

範例八：找出a後面連續2個l的字串

範例九：字串包含falls或是stays

search()

找出第一個空白字元的位置

找出Portugal是否出現在字串中

split

使用空白字元分割字串。

使用空白字元分割字串，並限制最大分割次數。

sub()

使用9取代所有的空白字元：

使用9取代所有的空白字元，並限制最大的取代次數：

集合範例

比較match、searh、findall、finditer差異

應用 - 以正規表示式來搜尋網頁屬性

利用正規表達式來找出多個符合條件的標籤：

驗證手機號碼

範例：

練習

答案

Read more

健行科技大學 網頁全端養成班 上課筆記

Java程式設計師養成班 專題

Java 上課記錄

Python入門實作班 上課記錄

tags: `python`

`findall()`

`search()`

`split`

`sub()`

比較`match`、`searh`、`findall`、`finditer`差異

健行科技大學網頁全端養成班上課筆記

Java程式設計師養成班專題

Python入門實作班上課記錄