Python / re
===
###### tags: `Python`
###### tags: `Python`, `PyPi`, `re`, `regular expression`, `regexp`, `reg-exp`, `regex`, `re-gex`, `zero-width positive lookahead`, `zero-width positive lookbehind`, `\b`, `word boundary`, `raw string`
<br>
[TOC]
<br>
## 文件
- [Pattern (Java Platform SE 8)](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) (完整) :+1:
<br>
## raw string
> regex 常用的字串表達方式
> r"...": raw string
### Python bug
```python
s1 = '\\A\\B'
print(f'len(s1) = {len(s1)} -> {s1}') # len(s1) = 4 -> \A\B
s2 = r'\\A\\B'
print(f'len(s2) = {len(s2)} -> {s2}') # len(s2) = 6 -> \\A\\B
s3 = s2.replace('\\', '*')
print(f'len(s3) = {len(s3)} -> {s3}') # len(s3) = 6 -> **A**B
s4 = s2.replace(r'\\', '*')
print(f'len(s4) = {len(s4)} -> {s4}') # len(s4) = 4 -> *A*B
#s5 = s2.replace(r'\', '*') # [bug] SyntaxError: EOL while scanning string literal
#print(f'len(s5) = {len(s5)} -> {s5}')
```
- 執行結果:
```
len(s1) = 4 -> \A\B
len(s2) = 6 -> \\A\\B
len(s3) = 6 -> **A**B
len(s4) = 4 -> *A*B
```
### 參考資料
- [What does preceding a string literal with "r" mean? [duplicate]](https://stackoverflow.com/questions/4780088/)
<br>
## lookahead (前瞻斷言) & lookbehind (後顧斷言)
```python=
import re
x='A1A-A2B-B3A-B4B'
print(re.search('\d', x)) #=1
# lookahead
print(re.search('(?<=B)\d', x)) #=3
print(re.search('(?<!B)\d', x)) #=1
print(re.search('\d(?=B)', x)) #=2
print(re.search('\d(?!B)', x)) #=1
print('-' * 50)
# lookbehind
print(re.search('(?<=A)\d(?=A)', x)) #=1
print(re.search('(?<=A)\d(?=B)', x)) #=2
print(re.search('(?<=B)\d(?=A)', x)) #=3
print(re.search('(?<=B)\d(?=B)', x)) #=4
```
- lookahead 往前綁訂 if 條件(條件是\d)
- lookbehind 往後綁訂 if 條件(條件是\d)
- positive 肯定條件
- negative 否定條件
<br>
### 範例教學
- [RegExp 應用: lookahead , lookbehind](https://darkk6.blogspot.com/2017/03/regexp-lookahead-lookbehind.html)
- 資料:`12345 XD Hi12345678ab666666cd987654321`
- 需求:`12345678` 和 `666666`;但不可以找出 `987654321` 中的 `98765432` 或者 `87654321`
- 語法:`(?<!\d)\d{6,8}(?!\d)`
<br>
### 練習1
- ### 資料:
```
>>> b'data: {"id":"chat-c535fdee63204413b5003c10c9810440",""choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":2615,"total_tokens":2615,"completion_tokens":0}}'
>>> b'data: {"id":"chat-c535fdee63204413b5003c10c9810440",""choices":[{"index":0,"delta":{"content":" Yes"}}],"usage":{"prompt_tokens":2615,"total_tokens":2616,"completion_tokens":1}}'
>>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2566,"completion_tokens":0}}'
>>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"content":" "}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
>>> b'data: {"id":"chat-56820a809a78404db2258cae01f2a5a4",""choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2566,"completion_tokens":0}}'
>>> b'data: {"id":"chat-56820a809a78404db2258cae01f2a5a4",""choices":[{"index":0,"delta":{"content":" ("}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
>>> b'data: {"id":"chat-b5c1792fdf7f4e0d86800f041ea3bb7d",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
>>> b'data: {"id":"chat-3100b13ffba346eb847272368cae61dd",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
>>> b'data: {"id":"chat-56820a809a78404db2258cae01f2a5a4",""choices":[{"index":0,"delta":{"content":"\xe7\x84\xa1"}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
>>> b'data: {"id":"chat-54fb4102c0704d57a6820d0e343f4c56",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
>>> b'data: {"id":"chat-4d31f93a866c436aa646b5ab78f6c8bc",""choices":[{"index":0,"delta":{"content":" rain"}}],"usage":{"prompt_tokens":2567,"total_tokens":2568,"completion_tokens":1}}'
>>> b'data: {"id":"chat-5e0b28d1f8974d57a03587a1df97fd6f",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
>>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
>>> b'data: {"id":"chat-c535fdee63204413b5003c10c9810440",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2615,"total_tokens":2616,"completion_tokens":1}}'
>>> b'data: {"id":"chat-b5c1792fdf7f4e0d86800f041ea3bb7d",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
>>> b'data: {"id":"chat-3100b13ffba346eb847272368cae61dd",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
```
- ### 需求:只保留 `"id":"chat-23218ab41db347c986c001cb5de2d776"`
- ### 解法:
- 搜尋:`^.*chat-(?!23218ab41db347c986c001cb5de2d776).*\n?`
- 換成:``
- ### 處理後的資料
```
>>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2566,"completion_tokens":0}}'
>>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"content":" "}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
>>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}'
```
<br>
## Hints
- ### 文字表達
```python
# ^(.*?)--(--[.]-+-------------[1]--)-------------------$
# |-[_]-| |-[R]-| |-[2]-| |-(--[.]-+.*)-|
# |-[-]-| |-[r]-| |-[_]-|
# |-[+]-| |-[-]-|
# |-[+]-|
BASENAME_FORMAT = '^(.*?)([._-+]+[Rr]?[1|2])([._-+]+.*)?$'
BASENAME_PATTERN = re.compile(BASENAME_FORMAT)
```
- ### `[...]` 中的字元列舉不需要跳脫
```python
import re
m = re.compile('.*?([._+-]+)[12](.fq)').match('abc*-1*fq')
m.groups()
```
執行結果
```
('-', '*fq')
```
- `*-1` 的 `*` 沒有被捕抓到
- `*fq` 的 `*` 有被捕抓到
- `[._+-]+` 不需要寫成`[\._+-]+`
- ### error: bad character range
```
import re
mm = re.compile('.*?([._+-]+)[12](.fq)').match('abc*-1*fq')
#m = re.compile('.*?([._-+]+)[12](.fq)').match('abc*-1*fq')
mm.groups()
```
- `[...]`中的字元`-`,有特殊意義
- `A-Z`: ABC..XYZ
- `A-z`: ABC..XYZabc-xyz
- 包含 Z-a 之間的字元:`[\]^_`
- 比較
- `[._-+]+` 中的 `_-+` 會被解讀成從字元 `_` 到 `+`
- `[._+-]+` 則無此問題
### 非嚴格 json 格式 --> 嚴格 json 格式
```python=
import re
import json
data = 'data: {content:"",stop:False}'
data = '{' + data + '}'
data = re.sub(
r'(?<=[{,])\s*(\w+):',
r'"\1":',
data)
print(data)
json.dumps(eval(data))
```
- 測試資料
- `'data: {"content":"","stop":False}'`
- `'data: {content:"",stop:False}'`
- 參考資料
- [Negative look ahead python regex](https://stackoverflow.com/questions/9952169/)
- Lookaheads are "zero-width", meaning they do not consume any characters. For example, these two expressions will never match:
- `(?=foo)bar`
- `(?!foo)foo`
- [[Java Regular Expressions] Pattern (Java Platform SE 8 )](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html)
- zero-width positive lookahead
<br>
## case studies
### single
- ### case1:
- 'the correct answer is:\nC:'
```
matcher = re.search(
'(?s)answer.{1,30}?([ABCD1234])[^\w]', input)
```
- ### 參考資料
- [matching any character including newlines in a Python regex subexpression, not globally](https://stackoverflow.com/questions/33312175/)
- `(?s)` - the inline modifier group with s flag on sets a scope where all . patterns match any char including line break chars
<br>
### non-greedy, lazy mode
- ### case1

- ### case2
- "正確答案是 C:必須以 DNA 為鑄模" -> 要匹配 "答案...C" ,不是 "答案...A"
```
matcher = re.search(
'(?s)answer.{1,30}?([ABCD1234])[^\w]', input)
```
- ### 參考資料
- [Regular expression in regards to question mark "lazy" mode](https://stackoverflow.com/questions/25728576/) :+1:
- Lazy is the opposite of greedy:
<br>
### \b (A word boundary)
- ### case1
```python
import re
re.search(r'[Cc]\b', 'The answer is C.')
re.search(r'[Cc]\b', 'The answer is C-python.')
```
執行結果:
```
<re.Match object; span=(14, 15), match='C'>
```
- ### case2
```python
import re
re.search(r'[Cc]\b', 'The answer is Cat.')
re.search(r'[Cc]\b', 'The answer is CPU.')
```
執行結果:
```
None
```
- ### case3
```python
import re
re.search(r'[Cc]\b', 'The answer is ABC.')
re.search(r'\b[Cc]\b', 'The answer is ABC.')
```
執行結果:
```
<re.Match object; span=(16, 17), match='C'>
None
```
<br>
## 字串操作
### sub
```python=
import re
line = re.sub(r"</?\[\d+>", "", line)
```
- [How to input a regex in string.replace?](https://stackoverflow.com/questions/5658369/)
<br>
### 參考資料
- ### [Case insensitive regular expression without re.compile?](https://stackoverflow.com/questions/500864/)
```python
re.search('test', 'TeSt', re.IGNORECASE)
re.match('test', 'TeSt', re.IGNORECASE)
re.sub('test', 'xxxx', 'Testing', flags=re.IGNORECASE)
```
<br>
## 綜合練習
### 找出所有的 Q/A 配對
- ### Question
```python
import re
s = """
Output format example:
<conversation>
Q: (statement from student1)
A: (statement from teacher1)
Q: (statement from student2)
A: (statement from teacher2)
Q: (statement from student3)
A: (statement from teacher3)
</conversation>
END
"""
re.compile(r'.*?(<conversation>\n(Q:[^\n]+\nA:[^\n]+\n){3,})',re.DOTALL).match(s).group(2)
```
我要如何依序取出 Q/A1, Q/A2, Q/A3? 且Q或A可能存在多行
- ### gpt-4o-2024-08-06
如果 Q 和 A 都可能存在多行,我們可以進一步調整正則表達式,使其能夠正確處理多行的問題和回答。具體做法是允許每個 Q 和 A 部分都包含多行,直到下一個 Q 或 A 為止。以下是解法:
```python
import re
s = """
Output format example:
<conversation>
Q: (statement from student1 line 1)
(statement from student1 line 2)
A: (statement from teacher1 line 1)
(statement from teacher1 line 2)
Q: (statement from student2 line 1)
(statement from student2 line 2)
A: (statement from teacher2 line 1)
(statement from teacher2 line 2)
(statement from teacher2 line 3)
Q: (statement from student3)
A: (statement from teacher3)
</conversation>
END
"""
# 正則來匹配每組 Q/A,允許 Q 和 A 都包含多行
qa_pairs = re.findall(r'(Q:(?:[^\n]*\n?)+?)(A:(?:[^\n]*\n?)+?)(?=Q:|</conversation>)', s, re.DOTALL)
# 依序輸出每個 Q/A 組合
for i, (q, a) in enumerate(qa_pairs, 1):
print(f"Q/A{i}:")
print(q.strip())
print(a.strip())
print()
```
### 解釋:
- `Q:(?:[^\n]*\n?)+?`:匹配 `Q:` 開頭的部分,允許跨行問題。`(?:[^\n]*\n?)+?` 會捕獲所有的非換行字符,直到遇到下一個 `A:` 或 `</conversation>`。
- `A:(?:[^\n]*\n?)+?`:同樣地,匹配 `A:` 開頭的部分,允許多行回答。
- `(?=Q:|</conversation>)`:這個部分是向前查找,確保在遇到下一個 `Q:` 或 `</conversation>` 時結束當前的 A 部分。
### 執行結果:
```
Q/A1:
Q: (statement from student1 line 1)
(statement from student1 line 2)
A: (statement from teacher1 line 1)
(statement from teacher1 line 2)
Q/A2:
Q: (statement from student2 line 1)
(statement from student2 line 2)
A: (statement from teacher2 line 1)
(statement from teacher2 line 2)
(statement from teacher2 line 3)
Q/A3:
Q: (statement from student3)
A: (statement from teacher3)
```
這樣的正則表達式會正確處理 Q 和 A 都可能包含多行的情況,並將它們依序提取出來。
- ### 實作結果
```python=
def get_answer_for_qa_pairs(response: str):
"""
param
-----
response: str
```
Output format example:
<conversation>
Q: (statement from student1)
A: (statement from teacher1)
Q: (statement from student2)
...1
...2
A: (statement from teacher2)
...3
...4
Q: (statement from student3)
A: (statement from teacher3)
</conversation>
END
```
return
------
[('(statement from student1)', '(statement from teacher1)'),
('(statement from student2)\n...1\n...2',
'(statement from teacher2)\n...3\n...4'),
('(statement from student3)', '(statement from teacher3)')]
"""
# find `<conversation>...</conversation>`
pattern = re.compile(
r'.*?(<conversation>.*</conversation>).*',
re.DOTALL
)
matcher = pattern.match(response)
if matcher == None:
raise ValueError('failed to find `<conversation>...</conversation>`')
content = matcher.group(1)
# find all qa pairs
qa_pairs = re.findall(
'(Q:(?:[^\n]*\n)+?)(A:(?:[^\n]*\n)+?)(?=Q:|</conversation>)',
content,
re.DOTALL
)
pattern = re.compile(r'[QA]:(.*)', re.DOTALL)
for idx, (q, a) in enumerate(qa_pairs):
q = pattern.match(q).group(1).strip()
a = pattern.match(a).group(1).strip()
qa_pairs[idx] = (q, a)
return qa_pairs
s = """
Output format example:
<conversation>
Q: (statement from student1)
A: (statement from teacher1)
Q: (statement from student2)
...1
...2
A: (statement from teacher2)
...3
...4
Q: (statement from student3)
A: (statement from teacher3)
</conversation>
END
"""
get_answer_for_qa_pairs(s)
```
執行結果:
```
[('(statement from student1)', '(statement from teacher1)'),
('(statement from student2)\n...1\n...2',
'(statement from teacher2)\n...3\n...4'),
('(statement from student3)', '(statement from teacher3)')]
```