python [正則表達式Regular Expression](https://zh.wikipedia.org/wiki/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F)
===
### 前面不是正則
#### 類繼承搜尋
```python
# 新式類 class A(object):
# 舊式類 class A:
In [1]: class A:
...: def __init__(self):
...: self.a = 100
...:
In [2]: class B(A):
...: pass
...:
In [3]: class C(A):
...: pass
...:
In [4]: class D(B, C):
...: pass
...:
In [5]: d = D()
In [6]: d.a
Out[6]: 100
```
- 新式類係橫向搜尋
- D>BC>A
- 舊式類係縱向搜尋
- D>B>A + D>C>A
#### 搜尋類屬性和實例屬性
```python
In [1]: class Foo(object):
...: # pro = property
...: c_pro = 10 # 類屬性
...: def __init__(self):
...: self.obj_pro = 100 # 實例(對象)屬性
...:
In [2]: foo = Foo()
# foo如何知道FOO有哪些屬性?
In [3]: dir(foo)
Out[3]:
['__class__',
'__delattr__',
'__dict__',
'__dir__',
...
'__weakref__',
'c_pro',
'obj_pro']
# 實例屬性會被放到對象當中的__dict__裡
In [4]: foo.__dict__
Out[4]: {'obj_pro': 100}
In [5]: dir(Foo)
Out[5]:
['__class__',
'__delattr__',
'__dict__',
'__dir__',
...
'__weakref__',
'c_pro']
# 類在構造後, 一但定義類屬性時, 類即會具備__dict__屬性(魔法屬性)
# 定義的類都會在這個魔法字典的裡
In [6]: Foo.__dict__
Out[6]:
mappingproxy({'__module__': '__main__',
'c_pro': 10,
'__init__': <function __main__.Foo.__init__(self)>,
'__dict__': <attribute '__dict__' of 'Foo' objects>,
'__weakref__': <attribute '__weakref__' of 'Foo' objects>,
'__doc__': None})
# foo會先搜他的__dict__, 此時有這個屬性
In [7]: foo.obj_pro
Out[7]: 100
# foo先搜自己的__dict__有無, 此時沒有
# 接著搜尋類的__dict__有無, 此時存在c_pro
In [8]: foo.c_pro
Out[8]: 10
# 當都沒有時, ERROR
In [9]: foo.haha
AttributeError: 'Foo' object has no attribute 'haha'
# 如果我要調用一個一開始就沒有建構的屬性?
# __getattr__ > get attribute
```
#### 打印think different itcast
```python
# coding:utf-8
class foo(object):
def __init__(self):
pass
def __getattr__(self, item):
print(item, "", end="") # item沒有被返回, 故自己手動打印
# return item # 思考
return self
# print(foo())會打印對象, 此時定義對象為空str
def __str__(self):
return ""
print(foo().think.different.itcast)
# 思考: return 誰?
# 如果return item
# 調用foo().think 返回值為think > str
# 接著調用different時,即會ERROR > "think".diffrernt
# (AttributeError: 'str' object has no attribute 'different')
# 故不能回傳item, 應回傳self > foo().different
#
# 小結: 當在dict找不到時, 會到getattr找
```
#### \_\_getattribute\_\_ vs \_\_getattr\_\_
- Q. 誰規定先找dict後找getattr?
- getattribute
- 當執行xx.xx時, 即會執行getattribute, python即在裡面定義
- 改getattribute切記, 請勿在裡面寫xx.xx, 會造成無限迴圈
正則: 描述某種規則的表達式
---
### match
```python
In [1]: import re
In [2]: a = "abcde"
In [3]: b = "abcdf"
In [4]: c = re.match(a, b)
In [5]: print(c)
None # 他不是返回"", 因為空str也可能為規則
In [6]: b = "abcde"
In [7]: d = re.match(a, b)
In [8]: print(d)
<re.Match object; span=(0, 5), match='abcde'>
In [11]: d.group() # 用來查看的
Out[11]: 'abcde'
# match 只匹配從左至右, 符合的話就返回
In [12]: e = re.match("abcde", "abcdefgh")
In [13]: e.group()
Out[13]: 'abcde'
```
### 字符
```python
# . > 任何字符(除了\n)
In [18]: re.match(".", "abc")
Out[18]: <re.Match object; span=(0, 1), match='a'>
In [20]: print(re.match(".", "\n"))
None
In [22]: print(re.match("..", "a"))
None # 要兩個卻只有一個
In [23]: print(re.match("..", "ab"))
<re.Match object; span=(0, 2), match='ab'>
In [24]: print(re.match("..", "abc"))
<re.Match object; span=(0, 2), match='ab'>
# \d > digit數字
In [25]: print(re.match("\d", "1"))
<re.Match object; span=(0, 1), match='1'>
In [26]: print(re.match("\d", "2"))
<re.Match object; span=(0, 1), match='2'>
In [27]: print(re.match("\d", "a"))
None
# \D > 非數字
In [29]: print(re.match("\D", "a"))
<re.Match object; span=(0, 1), match='a'>
# \s > 空白, 包括 \r\n\t等
In [35]: print(re.match("\s", " a"))
<re.Match object; span=(0, 1), match=' '>
In [36]: print(re.match("\s", "\ra"))
<re.Match object; span=(0, 1), match='\r'>
In [37]: print(re.match("\s", "\na"))
<re.Match object; span=(0, 1), match='\n'>
In [38]: print(re.match("\s", "\ta"))
<re.Match object; span=(0, 1), match='\t'>
# \S > 非空白
In [39]: print(re.match("\S", "\ta"))
None
# \w > 單詞字符, 即A-Z, a-z, 0-9, _
In [40]: print(re.match("\w", "\ta"))
None
In [41]: print(re.match("\w", "-a"))
None
In [42]: print(re.match("\w", "@a"))
None
In [43]: print(re.match("\w", "_a"))
<re.Match object; span=(0, 1), match='_'>
In [44]: print(re.match("\w", "Aa"))
<re.Match object; span=(0, 1), match='A'>
# \W > 非單詞字符
In [46]: print(re.match("\W", "Aa"))
None
# 匹配方法: 一個錯就None
In [47]: print(re.match("\w\W", "Aa"))
None
In [48]: print(re.match("\w\w", "Aa"))
<re.Match object; span=(0, 2), match='Aa'>
# 應用: 手機號碼
In [49]: print(re.match("09\d\d\d\d\d\d\d\d", "0987654321"))
<re.Match object; span=(0, 10), match='0987654321'>
In [50]: print(re.match("09\d\d\d\d\d\d\d\d", "1987654321"))
None
In [51]: print(re.match("09\d\d\d\d\d\d\d\d", "098765432a"))
None
# [] > 列舉
# 注意,沒有逗號
In [53]: print(re.match("1[3456]", "14"))
<re.Match object; span=(0, 2), match='14'>
In [52]: print(re.match("1[3456]", "19"))
None
In [57]: print(re.match("1[a-z0-9]", "1a"))
<re.Match object; span=(0, 2), match='1a'>
In [58]: print(re.match("1[a-z0-9]", "1z"))
<re.Match object; span=(0, 2), match='1z'>
In [59]: print(re.match("1[a-z0-9]", "13"))
<re.Match object; span=(0, 2), match='13'>
In [60]: print(re.match("1[a-z0-9]", "1A"))
None
# [^] > 反列舉
In [54]: print(re.match("1[^3456]", "19"))
<re.Match object; span=(0, 2), match='19'>
In [55]: print(re.match("1[^3456]", "1a"))
<re.Match object; span=(0, 2), match='1a'>
In [56]: print(re.match("1[^3456]", "1@"))
<re.Match object; span=(0, 2), match='1@'>
# . == [^\n]
# \d == [0-9]
# \D == [^0-9]
# \w == [0-9A-Za-z] ...
```
#### 數量
```python
# * > 0到無限次
In [61]: print(re.match("\d*", "")) # 0次
<re.Match object; span=(0, 0), match=''>
In [62]: print(re.match("\d*", "123")) # 3次
<re.Match object; span=(0, 3), match='123'>
In [63]: print(re.match("\d*", "abc")) # 0次
<re.Match object; span=(0, 0), match=''>
# "abc" == """abc" > ""符合規則
# + > 1到無限次
In [64]: print(re.match("\d+", "123"))
<re.Match object; span=(0, 3), match='123'>
In [65]: print(re.match("\d+", "abc"))
None
# 從第一位到無限大為數字, 故第一位不是數字為None
In [66]: print(re.match("\d+", "a1bc"))
None
In [67]: print(re.match("\d+", "1abc"))
<re.Match object; span=(0, 1), match='1'>
# ? > 0次或1次
# "\d?[a-z]" > 有或沒有數字後為a-z
In [68]: print(re.match("\d?[a-z]", "1abc")) # 有數字後a
<re.Match object; span=(0, 2), match='1a'>
In [69]: print(re.match("\d?[a-z]", "abc")) # 沒數字後a
<re.Match object; span=(0, 1), match='a'>
In [70]: print(re.match("\d?[a-z]", "12abc")) # 有數字後2 > None
None
In [75]: print(re.match("\d*[a-z]", "12abc")) # 兩次數字後a
<re.Match object; span=(0, 3), match='12a'>
In [76]: print(re.match("\d+[a-z]", "12abc")) # 兩次數字後a
<re.Match object; span=(0, 3), match='12a'>
# {} > 限定次數
In [77]: print(re.match("\d{3}[a-z]", "123abc")) # == \d\d\d[a-z]
<re.Match object; span=(0, 4), match='123a'>
In [78]: print(re.match("\d{4}[a-z]", "123abc"))
None
In [79]: print(re.match("\d{2}[a-z]", "123abc"))
None
# {,} > 至少次數
In [80]: print(re.match("\d{2,}[a-z]", "123abc")) # 至少兩次
<re.Match object; span=(0, 4), match='123a'>
# * == {0,}
# + == {1,}
# {m,n} > m到n次
In [81]: print(re.match("\d{2,4}[a-z]", "123abc"))
<re.Match object; span=(0, 4), match='123a'>
In [82]: print(re.match("\d{2,4}[a-z]", "12345abc"))
None
In [83]: print(re.match("\d{2,4}[a-z]", "1abc"))
None
# ? == {0,1}
# 應用: 手機號碼
In [84]: print(re.match("09\d{8}", "0987654321"))
<re.Match object; span=(0, 10), match='0987654321'>
```
#### 原始字符串
```python
In [86]: a = "\nabc"
In [88]: print(a)
abc
# 如果我要的是\nabc而不是空行?
In [89]: a = "\\nabc" # 再一個\轉譯
In [90]: print(a)
\nabc
# 那麼我要如何match?
In [95]: print(re.match("\\[a-z]{4}", a))
None
# Q. 為何沒有成功?
# 因為第一個\也被認為是轉譯字符> 故少一個\
In [96]: print(re.match("\\\\[a-z]{4}", a)) # 1轉2,3轉4
<re.Match object; span=(0, 5), match='\\nabc'>
# 如此一個\就要\\, 太麻煩了
In [98]: a = "\nabc"
In [100]: print(a)
abc
In [102]: a = r"\nabc" # r = raw > 原始格式
In [103]: print(a)
\nabc
In [104]: a
Out[104]: '\\nabc'
In [108]: print(re.match(r"\\nabc", a))
<re.Match object; span=(0, 5), match='\\nabc'>
In [109]: r"\\nabc"
Out[109]: '\\\\nabc'
```
#### 邊界
```python
# ^ > 起始 > 在match的效果不明顯, 因為match本來就是從最左開始
# $ > 結尾
# ps. Vim
# A > 跳到句首並編輯
# I > 跳到句尾並編輯
# ^ > 跳到句首
# $ > 跳到句尾
In [2]: print(re.match(r"^09\w{8}$", "0987654321"))
<re.Match object; span=(0, 10), match='0987654321'>
In [3]: print(re.match(r"^09\w{8}$", "09876543212"))
None
# \b > 單詞邊界
In [6]: print(re.match(r"\w+et\b", "wetdream"))
None
In [7]: print(re.match(r"\w+et\b", "wet dream"))
<re.Match object; span=(0, 3), match='wet'>
In [8]: print(re.match(r"\w+\bet\b", "w et dream"))
None # ??? 因為空格沒有match
In [9]: print(re.match(r"\w+\s\bet\b", "w et dream"))
<re.Match object; span=(0, 4), match='w et'>
In [10]: print(re.match(r".+\bet\b", "w et dream"))
<re.Match object; span=(0, 4), match='w et'>
# \B > 非單詞邊界
In [11]: print(re.match(r".+\bet\B", "w et dream"))
None
In [12]: print(re.match(r".+\bet\B", "w etdream"))
<re.Match object; span=(0, 4), match='w et'>
In [13]: print(re.match(r".+\Bet\B", "wetdream"))
<re.Match object; span=(0, 3), match='wet'>
```
### 分組
```python
# match 0-100
#
# 在不考慮0跟100的情況下, 十位數一定是1-9, 個位數為0-9
In [30]: print(re.match(r"[1-9]\d?$", "1"))
<re.Match object; span=(0, 1), match='1'>
In [31]: print(re.match(r"[1-9]\d?$", "99"))
<re.Match object; span=(0, 2), match='99'>
# | > 按位或
In [33]: print(re.match(r"[1-9]\d?$|0$|100$", "0"))
<re.Match object; span=(0, 1), match='0'>
In [34]: print(re.match(r"[1-9]\d?$|0$|100$", "100"))
<re.Match object; span=(0, 3), match='100'>
In [35]: print(re.match(r"[1-9]\d?$|0$|100$", "200"))
None
# 當十位數沒有時, 個位數也能有0
In [37]: print(re.match(r"[1-9]?\d?$|100$", "0"))
<re.Match object; span=(0, 1), match='0'>
# () > 分組
In [38]: print(re.match("<h1>.*</h1>", "<h1>Toyz</h1>"))
<re.Match object; span=(0, 13), match='<h1>Toyz</h1>'>
In [41]: result = re.match("<h1>(.*)</h1>", "<h1>Toyz</h1>")
In [43]: result.group()
Out[43]: '<h1>Toyz</h1>'
In [44]: result.group(1)
Out[44]: 'Toyz'
# <h1> 第一組, </h1> 第二組
In [45]: result = re.match("(<h1>).*(</h1>)", "<h1>Toyz</h1>")
In [46]: result.group(1)
Out[46]: '<h1>'
In [47]: result.group(2)
Out[47]: '</h1>'
In [48]: result.group(0) # group預設值
Out[48]: '<h1>Toyz</h1>'
In [49]: result.groups()
Out[49]: ('<h1>', '</h1>')
In [50]: result.groups()[0]
Out[50]: '<h1>'
In [51]: result.groups()[1]
Out[51]: '</h1>'
# >>情境: 完整的HTML標籤才match
In [52]: s = "<html><h1>Toyz</h1></html>"
In [53]: print(re.match("<.+><.+>.+</.+></.+>", s))
<re.Match object; span=(0, 26), match='<html><h1>Toyz</h1></html>'>
In [54]: s = "<html><h1>Toyz</h1></ht>" # 不完整也match了
In [55]: print(re.match("<.+><.+>.+</.+></.+>", s))
<re.Match object; span=(0, 24), match='<html><h1>Toyz</h1></ht>'>
# \num: 引用分組str
In [56]: print(re.match(r"<(.+)><(.+)>.+</\2></\1>", s))
None
In [57]: s = "<html><h1>Toyz</h1></html>"
In [58]: print(re.match(r"<(.+)><(.+)>.+</\2></\1>", s))
<re.Match object; span=(0, 26), match='<html><h1>Toyz</h1></html>'>
# 實作: 提取mail
In [62]: a = r"(\w+)@(gmail|yahoo|hotmail)\.(com)"
In [64]: r = re.match(a, "toyz5566@gmail.com")
In [65]: r.group(1)
Out[65]: 'toyz5566'
In [66]: r.group(2)
Out[66]: 'gmail'
In [67]: r.group(3)
Out[67]: 'com'
# (?P<name>) > 命名
# (?P=name) > 提取
# P = Parameter
In [24]: a = r"<(?P<key1>.+)><(?P<key2>.+)>.+</(?P=key2)></(?P=key1)>"
In [25]: b = "<html><h1>Toyz</h1></html>"
In [26]: re.match(a, b)
Out[26]: <re.Match object; span=(0, 26), match='<html><h1>Toyz</h1></html>'>
In [27]: b = "<html><h1>Toyz</h1></ht>"
In [28]: re.match(a, b)
```
#### Search
```python
# Search 只要有就行
In [27]: b = "<html><h1>Toyz</h1></ht>"
In [29]: re.search(r"Toyz", b)
Out[29]: <re.Match object; span=(10, 14), match='Toyz'>
# ^在此就有明顯作用
In [30]: re.search(r"^Toyz", b)
In [31]: b = "Toyz</h1></ht>"
In [33]: re.search(r"^Toyz", b)
Out[33]: <re.Match object; span=(0, 4), match='Toyz'>
```
#### findall
```python
In [34]: b = "Toyz</h1></ht>Toyz"
# Search找到第一個即停止
In [35]: re.search(r"Toyz", b)
Out[35]: <re.Match object; span=(0, 4), match='Toyz'>
# findall 搜完才停
In [36]: re.findall(r"Toyz", b)
Out[36]: ['Toyz', 'Toyz']
```
#### sub: 匹配替換
```python
# re.sub("搜", "改", "搜尋內容")
In [37]: re.sub(r"wet", "dream", "wet abc apple wet")
Out[37]: 'dream abc apple dream'
In [38]: re.sub(r"\d+", "50", "apple=1000, banana=20")
Out[38]: 'apple=50, banana=50'
# 實際需求: 搜到哪個做相應修改
# 第二個參數可以接收函數!
# 概念
In [40]: def replace(ret):
...: print(ret.group())
...: return "50"
# replace 被調用過兩次
In [41]: re.sub(r"\d+", replace, "apple=1000, banana=20")
1000
20
Out[41]: 'apple=50, banana=50'
# 完成
In [42]: def replace(ret):
...: r = int(ret.group()) + 50 # ret.group()返回str要改
...: return str(r) # 傳回去要改回str
...:
In [43]: re.sub(r"\d+", replace, "apple=1000, banana=20")
Out[43]: 'apple=1050, banana=70'
# 解題: 把標籤都去掉
In [52]: s = """<div>
...: <p>岗位职责:</p>
...: <p>完成推荐算法、数据统计、接口、后台等服务器端相关工作</p>
...: <p><br></p>
...: <p>必备要求:</p>
...: <p>良好的自我驱动力和职业素养,工作积极主动、结果导向</p>
...: <p> <br></p>
...: <p>技术要求:</p>
...: <p>1、一年以上 Python 开发经验,掌握面向对象分析和设计,了解设计模式</p>
...: <p>2、掌握HTTP协议,熟悉MVC、MVVM等概念以及相关WEB开发框架</p>
...: <p>3、掌握关系数据库开发设计,掌握 SQL,熟练使用 MySQL/PostgreSQL 中的一种<br></
...: p>
...: <p>4、掌握NoSQL、MQ,熟练使用对应技术解决方案</p>
...: <p>5、熟悉 Javascript/CSS/HTML5,JQuery、React、Vue.js</p>
...: <p> <br></p>
...: <p>加分项:</p>
...: <p>大数据,数理统计,机器学习,sklearn,高性能,大并发。</p>
...:
...: </div>"""
# 思考:
# 有<p></p><div></div><br>,
# 皆以<英文字符>組成 > <\w+>
# 還有一些有/ > </?\w+>
In [55]: re.sub("</?\w+>", "", s)
Out[55]: '\n 岗位职责:\n完成...'
```
#### split: 匹配切割
```python
In [59]: s = "apple,bana,cat,dog:daGG-haha"
In [60]: re.split(r",|:|-", s)
Out[60]: ['apple', 'bana', 'cat', 'dog', 'daGG', 'haha']
In [61]: s = "applebanana"
In [62]: re.split("a", s)
Out[62]: ['', 'ppleb', 'n', 'n', '']
```
#### 貪婪 > 在滿足別人的情況下做最大匹配
```python
# 情況一: 剛剛的解標籤
In [63]: s = """<div> ...</div>"""
In [65]: re.sub(r"<.+>", "", s)
Out[65]: '\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n '
# 理論上, 我們想要換掉所有<>裡的東西, 但算法認定每行最外圍的<>為一個,
# 例如"<p>必备要求:</p>" 的 "p>必备要求:</p"為內容還被替換
# 因為+為貪婪模式, 只要能匹配就一直往下走, 故在第一個>沒有停,
# 而是到了最後換行沒辦法走下去了才停
# 情況二:
In [68]: s = "my phone number: 09-87-654321"
In [74]: r = re.match(r".+(\d+-\d+-\d+)", s)
In [76]: r.group(1)
Out[76]: '9-87-654321' # 0不見了
In [77]: r = re.match(r"(.+)(\d+-\d+-\d+)", s)
In [79]: r.groups()
Out[79]: ('my phone number: 0', '9-87-654321') # 0被分到第一組
# 因為\d+是至少有一個, 故第一組只留一個給第二組
# 非貪婪 > 在滿足別人的情況下做最小匹配
#
# ?關掉+的貪婪,
In [80]: r = re.match(r"(.+?)(\d+-\d+-\d+)", s)
In [81]: r.groups()
Out[81]: ('my phone number: ', '09-87-654321')
In [82]: a = "apple5566gg"
In [87]: re.match(r"apple(\d+)", a).group(1)
Out[87]: '5566'
In [88]: re.match(r"apple(\d+?)", a).group(1)
Out[88]: '5' # 最小匹配
In [89]: re.match(r"apple(\d+)gg", a).group(1)
Out[89]: '5566'
In [90]: re.match(r"apple(\d+?)gg", a).group(1)
Out[90]: '5566' # 滿足gg
# 實作: 拿出照片
In [99]: s = """<img data-original="https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973
...: _201611131917_small.jpg" src="https://rpic.douyucdn.cn/appCovers/2016/11/13/1213
...: 973_201611131917_small.jpg" style="display: inline;">"""
In [102]: re.search(r"https.+?\.jpg", s).group()
Out[102]: 'https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973_201611131917_small.jpg'
```
練習
```python
# 練習一:
#
# 將
# http://www.interoem.com/messageinfo.asp?id=35
# http://3995503.com/class/class09/news_show.asp?id=14
# http://lib.wzmc.edu.cn/news/onews.asp?id=769
# http://www.zy-ls.com/alfx.asp?newsid=377&id=6
# http://www.fincm.com/newslist.asp?id=415
#
# 變成
# http://www.interoem.com/
# http://3995503.com/
# http://lib.wzmc.edu.cn/
# http://www.zy-ls.com/
# http://www.fincm.com/
#
# 思考:
# 取要砍掉的東西不容易, 故應該取要留下的,
# 並直接砍掉全部後返回留下的東西
In [124]: s
Out[124]: 'http://www.interoem.com/messageinfo.asp?id=35'
In [127]: re.sub(r"(http://.+?/).+", lambda x:x.group(1), s)
Out[127]: 'http://www.interoem.com/'
# 練習2:
# hello world ha ha 找出所有單詞
In [134]: s = "hello world ha ha"
In [135]: re.split(r" ", s)
Out[135]: ['hello', 'world', 'ha', 'ha']
In [137]: re.findall(r"\b[a-z]+\b", s)
Out[137]: ['hello', 'world', 'ha', 'ha']
```