Python 網路爬蟲 Web Crawler HTML

# Python 網路爬蟲 Web Crawler HTML ###### tags: `python` ## 基本流程 1. 連線到特定網址，抓取資料 2. 解析資料，取得實際想要的部分 ### 抓取資料關鍵：讓程式模仿一個普通使用者的行為 ### 解析資料(JSON在另一篇) 格式：HTML ### HTML格式資料解析使用第三方套件BeautifulSoup 安裝：pip install beautifulsoup4 ### 範例抓ptt電影版用urllib內的request [ptt電影版](https://www.ptt.cc/bbs/movie/index.html) #### 抓資料 ```python= import urllib.request as req #連線 url = "https://www.ptt.cc/bbs/movie/index.html" with req.urlopen(url) as response: data=response.read().decode('utf-8') print(data) ``` >連線會被拒絕開啟網頁network，找到連線細節，點選index.html 找到 user-agent ```python= import urllib.request as req #連線 url = "https://www.ptt.cc/bbs/movie/index.html" #建立一個Request物件，附加Request headers的資訊 request = req.Request(url, headers={ "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36" }) with req.urlopen(request) as response: data=response.read().decode('utf-8') print(data) ``` #### 解析資料（原始碼）使用第三方套件：beautifulsoup > bs4.BeautifulSoup(要處理的原始碼,"格式") >> 格式可以放：html.parser >> [格式參考](https://blog.csdn.net/Winterto1990/article/details/47806175) | 解析器 | 說明 | | ----------- | ---- | | html parser | Python 3.2後開始內建，主要解析為HTML | | lxml | 以C語言編寫，解析速度快，支援性廣。主要解析為HTML | | lxml-xml | 以C語言編寫，解析速度快，支援性廣。主要解析為XML | | html5lib | 使用瀏覽器相同的方式進行解析，所以相容性相對的好，缺點是速度慢。 | 解析器安裝方式 pip install lxml pip install lxml-xml pip install html5lib >find("標籤",class_="class名") 找一個符合條件的 >find_all("標籤",class_="class名") 找到所有符合條件的 >.string 取文字 ```python= import bs4 #美化原始碼格式 root = bs4.BeautifulSoup(data,"html.parser") title=root.title #抓title標籤 stitle = root.title.string #抓標籤內的文字 t = root.find_all("div", class_="tiltle") #尋找class="title" 的div標籤 for title in t : if title.a != None: #如果標題包含a標籤(沒有被刪除)，印出來 print(title.a.string) ``` ### 爬104 用requests ```python= import requests from bs4 import BeautifulSoup import lxml url = ( 'https://www.104.com.tw/jobs/search/?ro=0&keyword=Python&jobcatExpansionType=0&area=6001001005&order=15&asc=0&page=1&mode=s&jobsource=2018indexpoc') dom = requests.get(url).text soup = BeautifulSoup(dom, 'lxml') jobs = soup.find_all('article', class_="b-block--top-bord job-list-item b-clearfix js-job-item") # print(jobs[0].find('a',class_="js-job-link").text) for job in jobs: print(job.find('a',class_="js-job-link").text) print(job.get('data-cust-name')) print(job.find('p').text) print(job.find('span', class_="b-tag--default").text) print("----------------------------------------") ```