5/28 Python與資料分析#10 - Web Scraping

# 5/28 Python與資料分析#10 - Web Scraping ## Web Scraping簡介以程式的方法取得網路資料 import requests 發送http請求 import json import xml.etree.ElementTree as ET 解析XML資料 from bs4 import BeautifulSoup 解析HTML資料 ### 任務: * Transferring data 用Http通訊協定，將自己模擬成網頁瀏覽器拿資料去程：請求資料，分成GET/POST兩種方法後者額外帶有表單資料去請求，並改變伺服器內容，例如搜尋或留言回程：回覆資料 * Parsing data ### 開發者工具瀏覽器-More Tool-Develop Tool 跟資料傳輸有關的是NetWork頁面，重新整理過程可以看到細節裡面的一個資料都是一個完整的請求-回覆 Request = Headers(一些參數) + Method(例如GET/POST) Response =Headers(一些參數) + Body(資料) 可以找尋的資料：（開發者工具中有這兩個項目) * HR(XMLHttpRequest) JavaScript功能關掉之後網頁內容消失 JS CSS Img Media Font * Doc(HTML documents) JavaScript功能關掉之後網頁內容還在 WS Manifest Other Chrome可以用Quick JavaScript Switcher https://chrome.google.com/webstore/detail/quick-javascript-switcher/geddoclleiomckbhadiaipdggiiccfje JavaScript可以遞送(渲染)檔案內容其他則是用後端語言或是寫死 ## 資料內容架構 * Headers General 摘要 Response Headers Request Headers Query String Parameters(if any) Form Data(if any) * Preview(Response body rendered in browser) * Response(body) (以上兩個會看到實際資料內容，可以點進去看是否有想要的) * Cookies(if any) ## Requests套件 Python與Server的連結 https://requests.readthedocs.io/en/master/ pip install requests import requests ``` print(requests.__version__) print(requests.__file__) ``` 2.23.0 /Users/kuoyaojen/pyda/lib/python3.6/site-packages/requests/__init__.py 以下兩個方法可以依照開發者工具 Headers-General-Request Method寫的使用 ### GET method方法 requests.get("request_url", params={query_str_params}) request_url以及query_str_params可以到headers找（網址裡面?之後的東西是一些參數） query_str_params是一個dict，看header裡面的query_str_params有甚麼就放進去回覆看到的會是在開發者工具裡面preview看到的一個"response"物件 ``` request_url = "https://ecshweb.pchome.com.tw/search/v3.3/all/results" query_str_params = { 'q': 'macbook', 'page': 1, 'sort': 'rnk/dc' } response = requests.get(request_url, params=query_str_params) print(response.text) ``` ### POST method方法 requests.post("request_url", data={form_data}) request_url以及query_str_params可以到headers找看到的會是在開發者工具裡面preview看到的東西 ``` request_url = "https://emap.pcsc.com.tw/EMapSDK.aspx" form_data = { 'commandid': 'SearchStore', 'city': '台北市', 'town': '大安區', 'road name': '羅斯福路四段' } response = requests.post(request_url, data=form_data) print(response.text) ``` *response物件本身是response類別 ### Response物件的屬性 response名稱.text 可以看到原始文字內容（以str出現） response名稱.status_code 狀態碼（在header裡面有） response名稱.json() （如下） ### 爬梳資料先看資料本身格式(preview或response) ，各自有其特性 * JSON format : dict list構成 response名稱.json() 若一開始是{}則讀成dict，[]則讀成list 也可以 json.loads(response名稱.text) （跟之前json.load(f)不一樣的是，loads是指load string，load則是載入json file） * XML format : 標記語言，自己定義好自己的標記文字檔最前面可以看到xml字樣 <....>與</....>是一對，會有很多層開頭結尾的（自己定義的）標記，且會有樹狀結構可用內建的ET函數解析使其結構化　 import xml.etree.ElementTree as ET （物件名稱）= ET.fromstring(response名稱.text) 類別是Element 若要擷取某個標籤，須使用XPath語法物件名稱 = [e.text for e in　element名稱.findall(".//標記名稱")] * HTML format 與XML不一樣的是，這裡的標記是事先被定義好的 response名稱.text 得到的東西，type是str 利用beautifulsoup函數解析 ``` from bs4 import BeautifulSoup soup = BeautifulSoup(response.text) print(type(soup)) ``` <class 'bs4.BeautifulSoup'> 利用 CSS Selector挑選資料標記 https://developer.mozilla.org/en-US/docs/Glossary/CSS_Selector A CSS selector can be mixed and matched with Tag names, e.g. a　標記名稱 Class attribute in tags, e.g. .poster　用在不同資料有相同格式 Id attribute in tags, e.g. #title-overview-widget 用在特殊資料格式可以用以下套件幫忙 A Chrome browser plug-in to help us find the specific CSS selector of element(s): https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb ``` request_url = "https://www.imdb.com/title/tt10048342" response = requests.get(request_url) ``` ``` # The CSS Selector for title title = soup.select('h1')[0].text.strip() print(title) ``` 后翼棄兵 ``` # The CSS Selector for rating rating = float(soup.select('.ipc-button__text span')[0].text) print(rating) ``` 8.6 ###### tags: `python` `資料分析`