網路爬蟲 - HackMD

###### tags: `YTP共學筆記` # 網路爬蟲 ## 環境設置 ```python= (terminal) pip install beautifulsoup4 pip install requests ``` ## 功能 ### GET 得到網頁內容 ``` python= from cgitb import text import requests res = requests.get("https://www.books.com.tw/?gclid=Cj0KCQjwjvaYBhDlARIsAO8PkE3xss8aBx9yAj2ywe8dnndF7umj3f9renw2DwDp6TeGnutp-HMEjCIaAhC2EALw_wcB") print(res.text) ``` 找出特定網頁內容 ```python= from bs4 import BeautifulSoup html_sample = '\ <html> \ <body> \ <h1 id="title">Hello world</h1> \ <a href="#" class="link">This is link1</a> \ <a href="# link2" class="link">This is link2</a>\ </body> \ </html>' soup = BeautifulSoup(html_sample) print(soup.select('html')[0]) ``` ```python= from cgitb import text import requests from bs4 import BeautifulSoup res = requests.get("https://www.books.com.tw/?gclid=Cj0KCQjwjvaYBhDlARIsAO8PkE3xss8aBx9yAj2ywe8dnndF7umj3f9renw2DwDp6TeGnutp-HMEjCIaAhC2EALw_wcB") soup = BeautifulSoup(res.text) for item in soup.select('.item'): print(item.select('strong')[0].text,item.select('')) ``` ```python= ## books import shutil import requests import sqlite3 as lite from bs4 import BeautifulSoup res = requests.get("https://www.books.com.tw/products/0010610509?sloc=main") image = requests.get("https://www.books.com.tw/img/001/061/05/0010610509.jpg") soup = BeautifulSoup(res.text) print(soup) print(soup.prettify()) tmp = soup.find("meta") print(tmp.find_parent("meta")) print(soup.find_all("meta")) print(soup.find_parents("a")) print(soup.find_all(["meta","meta"])) print(soup.find_all("a",class_="trace_box")) for meta in soup.select("head"): print(meta.text) print(soup.find("title").text) f = open('image.png','wb') shutil.copyfileobj(res.raw,f) f = open('image.png','wb') shutil.copyfileobj(res.raw,f) print(title.select_one("a").get("href")) print(title.select_one("a").getText()) ``` ### POST ```python= import requests payload = { 'parameter1' = 'your parameter1' 'parameter2' = 'your parameter2' . . . 'parameterN' = 'your parameterN' res=requests.post('the request url' ,data=payload) print(res.text) } ```