Python / re - HackMD

Python / re === ###### tags: `Python` ###### tags: `Python`, `PyPi`, `re`, `regular expression`, `regexp`, `reg-exp`, `regex`, `re-gex`, `zero-width positive lookahead`, `zero-width positive lookbehind`, `\b`, `word boundary`, `raw string` [TOC] ## 文件 - [Pattern (Java Platform SE 8)](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) (完整) :+1: ## raw string > regex 常用的字串表達方式 > r"...": raw string ### Python bug ```python s1 = '\\A\\B' print(f'len(s1) = {len(s1)} -> {s1}') # len(s1) = 4 -> \A\B s2 = r'\\A\\B' print(f'len(s2) = {len(s2)} -> {s2}') # len(s2) = 6 -> \\A\\B s3 = s2.replace('\\', '*') print(f'len(s3) = {len(s3)} -> {s3}') # len(s3) = 6 -> **A**B s4 = s2.replace(r'\\', '*') print(f'len(s4) = {len(s4)} -> {s4}') # len(s4) = 4 -> *A*B #s5 = s2.replace(r'\', '*') # [bug] SyntaxError: EOL while scanning string literal #print(f'len(s5) = {len(s5)} -> {s5}') ``` - 執行結果： ``` len(s1) = 4 -> \A\B len(s2) = 6 -> \\A\\B len(s3) = 6 -> **A**B len(s4) = 4 -> *A*B ``` ### 參考資料 - [What does preceding a string literal with "r" mean? [duplicate]](https://stackoverflow.com/questions/4780088/) ## lookahead (前瞻斷言) & lookbehind (後顧斷言) ```python= import re x='A1A-A2B-B3A-B4B' print(re.search('\d', x)) #=1 # lookahead print(re.search('(?<=B)\d', x)) #=3 print(re.search('(?<!B)\d', x)) #=1 print(re.search('\d(?=B)', x)) #=2 print(re.search('\d(?!B)', x)) #=1 print('-' * 50) # lookbehind print(re.search('(?<=A)\d(?=A)', x)) #=1 print(re.search('(?<=A)\d(?=B)', x)) #=2 print(re.search('(?<=B)\d(?=A)', x)) #=3 print(re.search('(?<=B)\d(?=B)', x)) #=4 ``` - lookahead 往前綁訂 if 條件(條件是\d) - lookbehind 往後綁訂 if 條件(條件是\d) - positive 肯定條件 - negative 否定條件 ### 範例教學 - [RegExp 應用： lookahead , lookbehind](https://darkk6.blogspot.com/2017/03/regexp-lookahead-lookbehind.html) - 資料：`12345 XD Hi12345678ab666666cd987654321` - 需求：`12345678` 和 `666666`；但不可以找出 `987654321` 中的 `98765432` 或者 `87654321` - 語法：`(?<!\d)\d{6,8}(?!\d)` ### 練習1 - ### 資料： ``` >>> b'data: {"id":"chat-c535fdee63204413b5003c10c9810440",""choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":2615,"total_tokens":2615,"completion_tokens":0}}' >>> b'data: {"id":"chat-c535fdee63204413b5003c10c9810440",""choices":[{"index":0,"delta":{"content":" Yes"}}],"usage":{"prompt_tokens":2615,"total_tokens":2616,"completion_tokens":1}}' >>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2566,"completion_tokens":0}}' >>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"content":" "}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' >>> b'data: {"id":"chat-56820a809a78404db2258cae01f2a5a4",""choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2566,"completion_tokens":0}}' >>> b'data: {"id":"chat-56820a809a78404db2258cae01f2a5a4",""choices":[{"index":0,"delta":{"content":" ("}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' >>> b'data: {"id":"chat-b5c1792fdf7f4e0d86800f041ea3bb7d",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' >>> b'data: {"id":"chat-3100b13ffba346eb847272368cae61dd",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' >>> b'data: {"id":"chat-56820a809a78404db2258cae01f2a5a4",""choices":[{"index":0,"delta":{"content":"\xe7\x84\xa1"}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' >>> b'data: {"id":"chat-54fb4102c0704d57a6820d0e343f4c56",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' >>> b'data: {"id":"chat-4d31f93a866c436aa646b5ab78f6c8bc",""choices":[{"index":0,"delta":{"content":" rain"}}],"usage":{"prompt_tokens":2567,"total_tokens":2568,"completion_tokens":1}}' >>> b'data: {"id":"chat-5e0b28d1f8974d57a03587a1df97fd6f",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' >>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' >>> b'data: {"id":"chat-c535fdee63204413b5003c10c9810440",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2615,"total_tokens":2616,"completion_tokens":1}}' >>> b'data: {"id":"chat-b5c1792fdf7f4e0d86800f041ea3bb7d",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' >>> b'data: {"id":"chat-3100b13ffba346eb847272368cae61dd",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' ``` - ### 需求：只保留 `"id":"chat-23218ab41db347c986c001cb5de2d776"` - ### 解法： - 搜尋：`^.*chat-(?!23218ab41db347c986c001cb5de2d776).*\n?` - 換成：`` - ### 處理後的資料 ``` >>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2566,"completion_tokens":0}}' >>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"content":" "}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' >>> b'data: {"id":"chat-23218ab41db347c986c001cb5de2d776",""choices":[{"index":0,"delta":{"content":""}}],"usage":{"prompt_tokens":2566,"total_tokens":2567,"completion_tokens":1}}' ``` ## Hints - ### 文字表達 ```python # ^(.*?)--(--[.]-+-------------[1]--)-------------------$ # |-[_]-| |-[R]-| |-[2]-| |-(--[.]-+.*)-| # |-[-]-| |-[r]-| |-[_]-| # |-[+]-| |-[-]-| # |-[+]-| BASENAME_FORMAT = '^(.*?)([._-+]+[Rr]?[1|2])([._-+]+.*)?$' BASENAME_PATTERN = re.compile(BASENAME_FORMAT) ``` - ### `[...]` 中的字元列舉不需要跳脫 ```python import re m = re.compile('.*?([._+-]+)[12](.fq)').match('abc*-1*fq') m.groups() ``` 執行結果 ``` ('-', '*fq') ``` - `*-1` 的 `*` 沒有被捕抓到 - `*fq` 的 `*` 有被捕抓到 - `[._+-]+` 不需要寫成`[\._+-]+` - ### error: bad character range ``` import re mm = re.compile('.*?([._+-]+)[12](.fq)').match('abc*-1*fq') #m = re.compile('.*?([._-+]+)[12](.fq)').match('abc*-1*fq') mm.groups() ``` - `[...]`中的字元`-`，有特殊意義 - `A-Z`: ABC..XYZ - `A-z`: ABC..XYZabc-xyz - 包含 Z-a 之間的字元：`[\]^_` - 比較 - `[._-+]+` 中的 `_-+` 會被解讀成從字元 `_` 到 `+` - `[._+-]+` 則無此問題 ### 非嚴格 json 格式 --> 嚴格 json 格式 ```python= import re import json data = 'data: {content:"",stop:False}' data = '{' + data + '}' data = re.sub( r'(?<=[{,])\s*(\w+):', r'"\1":', data) print(data) json.dumps(eval(data)) ``` - 測試資料 - `'data: {"content":"","stop":False}'` - `'data: {content:"",stop:False}'` - 參考資料 - [Negative look ahead python regex](https://stackoverflow.com/questions/9952169/) - Lookaheads are "zero-width", meaning they do not consume any characters. For example, these two expressions will never match: - `(?=foo)bar` - `(?!foo)foo` - [[Java Regular Expressions] Pattern (Java Platform SE 8 )](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) - zero-width positive lookahead ## case studies ### single - ### case1: - 'the correct answer is:\nC:' ``` matcher = re.search( '(?s)answer.{1,30}?([ABCD1234])[^\w]', input) ``` - ### 參考資料 - [matching any character including newlines in a Python regex subexpression, not globally](https://stackoverflow.com/questions/33312175/) - `(?s)` - the inline modifier group with s flag on sets a scope where all . patterns match any char including line break chars ### non-greedy, lazy mode - ### case1 ![image.png](https://hackmd.io/_uploads/SknuHPPQa.png) - ### case2 - "正確答案是 C：必須以 DNA 為鑄模" -> 要匹配 "答案...C" ，不是 "答案...A" ``` matcher = re.search( '(?s)answer.{1,30}?([ABCD1234])[^\w]', input) ``` - ### 參考資料 - [Regular expression in regards to question mark "lazy" mode](https://stackoverflow.com/questions/25728576/) :+1: - Lazy is the opposite of greedy: ### \b (A word boundary) - ### case1 ```python import re re.search(r'[Cc]\b', 'The answer is C.') re.search(r'[Cc]\b', 'The answer is C-python.') ``` 執行結果： ``` <re.Match object; span=(14, 15), match='C'> ``` - ### case2 ```python import re re.search(r'[Cc]\b', 'The answer is Cat.') re.search(r'[Cc]\b', 'The answer is CPU.') ``` 執行結果： ``` None ``` - ### case3 ```python import re re.search(r'[Cc]\b', 'The answer is ABC.') re.search(r'\b[Cc]\b', 'The answer is ABC.') ``` 執行結果： ``` <re.Match object; span=(16, 17), match='C'> None ``` ## 字串操作 ### sub ```python= import re line = re.sub(r"</?\[\d+>", "", line) ``` - [How to input a regex in string.replace?](https://stackoverflow.com/questions/5658369/) ### 參考資料 - ### [Case insensitive regular expression without re.compile?](https://stackoverflow.com/questions/500864/) ```python re.search('test', 'TeSt', re.IGNORECASE) re.match('test', 'TeSt', re.IGNORECASE) re.sub('test', 'xxxx', 'Testing', flags=re.IGNORECASE) ``` ## 綜合練習 ### 找出所有的 Q/A 配對 - ### Question ```python import re s = """ Output format example: <conversation> Q: (statement from student1) A: (statement from teacher1) Q: (statement from student2) A: (statement from teacher2) Q: (statement from student3) A: (statement from teacher3) </conversation> END """ re.compile(r'.*?(<conversation>\n(Q:[^\n]+\nA:[^\n]+\n){3,})',re.DOTALL).match(s).group(2) ``` 我要如何依序取出 Q/A1, Q/A2, Q/A3? 且Q或A可能存在多行 - ### gpt-4o-2024-08-06 如果 Q 和 A 都可能存在多行，我們可以進一步調整正則表達式，使其能夠正確處理多行的問題和回答。具體做法是允許每個 Q 和 A 部分都包含多行，直到下一個 Q 或 A 為止。以下是解法： ```python import re s = """ Output format example: <conversation> Q: (statement from student1 line 1) (statement from student1 line 2) A: (statement from teacher1 line 1) (statement from teacher1 line 2) Q: (statement from student2 line 1) (statement from student2 line 2) A: (statement from teacher2 line 1) (statement from teacher2 line 2) (statement from teacher2 line 3) Q: (statement from student3) A: (statement from teacher3) </conversation> END """ # 正則來匹配每組 Q/A，允許 Q 和 A 都包含多行 qa_pairs = re.findall(r'(Q:(?:[^\n]*\n?)+?)(A:(?:[^\n]*\n?)+?)(?=Q:|</conversation>)', s, re.DOTALL) # 依序輸出每個 Q/A 組合 for i, (q, a) in enumerate(qa_pairs, 1): print(f"Q/A{i}:") print(q.strip()) print(a.strip()) print() ``` ### 解釋： - `Q:(?:[^\n]*\n?)+?`：匹配 `Q:` 開頭的部分，允許跨行問題。`(?:[^\n]*\n?)+?` 會捕獲所有的非換行字符，直到遇到下一個 `A:` 或 `</conversation>`。 - `A:(?:[^\n]*\n?)+?`：同樣地，匹配 `A:` 開頭的部分，允許多行回答。 - `(?=Q:|</conversation>)`：這個部分是向前查找，確保在遇到下一個 `Q:` 或 `</conversation>` 時結束當前的 A 部分。 ### 執行結果： ``` Q/A1: Q: (statement from student1 line 1) (statement from student1 line 2) A: (statement from teacher1 line 1) (statement from teacher1 line 2) Q/A2: Q: (statement from student2 line 1) (statement from student2 line 2) A: (statement from teacher2 line 1) (statement from teacher2 line 2) (statement from teacher2 line 3) Q/A3: Q: (statement from student3) A: (statement from teacher3) ``` 這樣的正則表達式會正確處理 Q 和 A 都可能包含多行的情況，並將它們依序提取出來。 - ### 實作結果 ```python= def get_answer_for_qa_pairs(response: str): """ param ----- response: str ``` Output format example: <conversation> Q: (statement from student1) A: (statement from teacher1) Q: (statement from student2) ...1 ...2 A: (statement from teacher2) ...3 ...4 Q: (statement from student3) A: (statement from teacher3) </conversation> END ``` return ------ [('(statement from student1)', '(statement from teacher1)'), ('(statement from student2)\n...1\n...2', '(statement from teacher2)\n...3\n...4'), ('(statement from student3)', '(statement from teacher3)')] """ # find `<conversation>...</conversation>` pattern = re.compile( r'.*?(<conversation>.*</conversation>).*', re.DOTALL ) matcher = pattern.match(response) if matcher == None: raise ValueError('failed to find `<conversation>...</conversation>`') content = matcher.group(1) # find all qa pairs qa_pairs = re.findall( '(Q:(?:[^\n]*\n)+?)(A:(?:[^\n]*\n)+?)(?=Q:|</conversation>)', content, re.DOTALL ) pattern = re.compile(r'[QA]:(.*)', re.DOTALL) for idx, (q, a) in enumerate(qa_pairs): q = pattern.match(q).group(1).strip() a = pattern.match(a).group(1).strip() qa_pairs[idx] = (q, a) return qa_pairs s = """ Output format example: <conversation> Q: (statement from student1) A: (statement from teacher1) Q: (statement from student2) ...1 ...2 A: (statement from teacher2) ...3 ...4 Q: (statement from student3) A: (statement from teacher3) </conversation> END """ get_answer_for_qa_pairs(s) ``` 執行結果： ``` [('(statement from student1)', '(statement from teacher1)'), ('(statement from student2)\n...1\n...2', '(statement from teacher2)\n...3\n...4'), ('(statement from student3)', '(statement from teacher3)')] ```