Basic html Parsing

--- tags: Scrapy Class --- :::info ### Python Class - Python's self is the same as the this pointer in C++ or Java, but self is always explicit in both headers and bodies of Python methods to make attribute access more obvious - If you wnat to define private member in Python class,you should prefix with double double underscore "__" ::: # Basic html Parsing - Receive HTML string from a URL ```python= from urllib.request import urlopen url = "http://ncnu.ipv6.club.tw/~ncnu/page1.html" html = urlopen(url) page = html.read() print(page) print(page.decode('utf-8')) ``` ## Parse an HTML ```python= from urllib.request import urlopen from bs4 import BeautifulSoup url = "http://ncnu.ipv6.club.tw/~ncnu/page1.html" html = urlopen(url) bsObj = BeautifulSoup(html.read(), "html.parser") print(bsObj.h1) print(bsObj.findAll("div")) # You can conveniently access the h1 tag by # bsObj.h1 # or by any of the following # bsObj.html.body.h1 # bsObj.body.h1 # bsObj.html.h1 ``` - Exception Handling ```python= from urllib.request import urlopen from urllib.error import HTTPError from urllib.error import URLError from bs4 import BeautifulSoup url = input("Please input a URL -- ") try: html = urlopen(url) except HTTPError as e: print(e) except URLError as e: print("The server could not be found!") else: # program continues if no exception occurs ``` ## Parsing with tag ```python= from urllib.request import urlopen from bs4 import BeautifulSoup url = "http://www.pythonscraping.com/pages/warandpeace.html" html = urlopen(url) bsObj = BeautifulSoup(html, "html.parser") nameList = bsObj.findAll("span", {"class": "green"}) # (tags, {attributes: value}) or (tags, attributes, text, limit, keywords) # ex. bsObj.findAll(["h1", "h2", "h3"]) # ex. bsObj.findAll(text = "the prince") # ex. bsObj.findAll("h1", limit = 2) for name in nameList: print(name.get_text()) ``` ## Parents and Siblings ```python bsObj.find("img", {"src": "../img/img1.jpg"}).parent.previous_sibling ``` ## Regular Expression | Symbol | Meanings | Example |Example Matches| |:--------:|:--------:|:--------:|:--------:| | * | 多個char，以倍數計，可為0倍 | a* b* |aaa,aabbb,bb| | + | 多個char，用加的，最少1倍 | a+b+ | aaaab,aaabb,abbbb | | . | 任何單一個的char | b.d | bad,bzd,b=d | | [] | 任何char在[]裡面 | [A-Z] | A,P,L | | [^] | 任何char不在[^]裡面 | [^A-Z] | :,+,a | | () | 在（）成為一個group | (a* b)* | aaabaab,abaaab,ababaaab| | ｜ | or | b(a｜i｜e)d | bad,bid,bed | | ^ | 開頭為 | ^a | apple,asdf,a | | $ | 結尾為 | [A-Z]* [a-z]*$ | ABCabc,zzzyx,Bob | ```python= import re pattern = "aa*bb(cc)*(d|e)" myStr = ["aaaabbbccccdd", 'aabbbbbbcce', "aaaaaaabbccd"] for oneStr in myStr: if re.match(pattern, oneStr): print(True) else: print(False) ``` ### Regular Expressions in BeautifulSoup ```python import re bsObj.findAll("img", {"src": re.compile("\.\./img/img[1-6]\.jpg")}) ``` ## Lambda Function ### sorted ```python= mylist = ["a", "b22", "c333", "d4444"] print(sorted(mylist, reverse = True)) def foo(s): return len(s) print(sorted(mylist, key = foo)) ``` ### sorted with lambda - With lambda, you don't have to naming a function. ```python sorted(mylist, key = lambda s: len(s)) ``` - reverse = Ture降序； False升序 ```python sorted(mylist, key = lambda s: len(s),reverse = ture) ``` ### Lambda Expression ```python bsObj.findAll(lambda tag: len(tag.attrs) == 2) # attribute count is 2 ``` - with find tags such as - `<div class="body" id="content"></div>` - `<span style="color:red" class="title"></span>`