---
tags: Scrapy Class
---
:::info
### Python Class
- Python's self is the same as the this pointer in C++ or Java, but self is always explicit in both headers and bodies of Python methods to make attribute access more obvious
- If you wnat to define private member in Python class,you should prefix with double double underscore "__"
:::
# Basic html Parsing
- Receive HTML string from a URL
```python=
from urllib.request import urlopen
url = "http://ncnu.ipv6.club.tw/~ncnu/page1.html"
html = urlopen(url)
page = html.read()
print(page)
print(page.decode('utf-8'))
```
## Parse an HTML
```python=
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://ncnu.ipv6.club.tw/~ncnu/page1.html"
html = urlopen(url)
bsObj = BeautifulSoup(html.read(), "html.parser")
print(bsObj.h1)
print(bsObj.findAll("div"))
# You can conveniently access the h1 tag by
# bsObj.h1
# or by any of the following
# bsObj.html.body.h1
# bsObj.body.h1
# bsObj.html.h1
```
- Exception Handling
```python=
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
url = input("Please input a URL -- ")
try:
html = urlopen(url)
except HTTPError as e:
print(e)
except URLError as e:
print("The server could not be found!")
else:
# program continues if no exception occurs
```
## Parsing with tag
```python=
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://www.pythonscraping.com/pages/warandpeace.html"
html = urlopen(url)
bsObj = BeautifulSoup(html, "html.parser")
nameList = bsObj.findAll("span", {"class": "green"}) # (tags, {attributes: value}) or (tags, attributes, text, limit, keywords)
# ex. bsObj.findAll(["h1", "h2", "h3"])
# ex. bsObj.findAll(text = "the prince")
# ex. bsObj.findAll("h1", limit = 2)
for name in nameList:
print(name.get_text())
```
## Parents and Siblings
```python
bsObj.find("img", {"src": "../img/img1.jpg"}).parent.previous_sibling
```
## Regular Expression
| Symbol | Meanings | Example |Example Matches|
|:--------:|:--------:|:--------:|:--------:|
| * | 多個char,以倍數計,可為0倍 | a* b* |aaa,aabbb,bb|
| + | 多個char,用加的,最少1倍 | a+b+ | aaaab,aaabb,abbbb |
| . | 任何單一個的char | b.d | bad,bzd,b=d |
| [] | 任何char在[]裡面 | [A-Z] | A,P,L |
| [^] | 任何char不在[^]裡面 | [^A-Z] | :,+,a |
| () | 在()成為一個group | (a* b)* | aaabaab,abaaab,ababaaab|
| | | or | b(a|i|e)d | bad,bid,bed |
| ^ | 開頭為 | ^a | apple,asdf,a |
| $ | 結尾為 | [A-Z]* [a-z]*$ | ABCabc,zzzyx,Bob |
```python=
import re
pattern = "aa*bb(cc)*(d|e)"
myStr = ["aaaabbbccccdd", 'aabbbbbbcce', "aaaaaaabbccd"]
for oneStr in myStr:
if re.match(pattern, oneStr):
print(True)
else:
print(False)
```
### Regular Expressions in BeautifulSoup
```python
import re
bsObj.findAll("img", {"src": re.compile("\.\./img/img[1-6]\.jpg")})
```
## Lambda Function
### sorted
```python=
mylist = ["a", "b22", "c333", "d4444"]
print(sorted(mylist, reverse = True))
def foo(s):
return len(s)
print(sorted(mylist, key = foo))
```
### sorted with lambda
- With lambda, you don't have to naming a function.
```python
sorted(mylist, key = lambda s: len(s))
```
- reverse = Ture降序 ; False升序
```python
sorted(mylist, key = lambda s: len(s),reverse = ture)
```
### Lambda Expression
```python
bsObj.findAll(lambda tag: len(tag.attrs) == 2)
# attribute count is 2
```
- with find tags such as
- `<div class="body" id="content"></div>`
- `<span style="color:red" class="title"></span>`