爬蟲Crawling

# 爬蟲Crawling --- ![](https://hackmd.io/_uploads/ByJVTWg8h.png =30%x) ---- ![](https://hackmd.io/_uploads/rJ1Eabl8n.png =30%x) ---- ![](https://hackmd.io/_uploads/SyyE6beLh.png =30%x) ---- [神奇的link](https://m.facebook.com/story.php?story_fbid=pfbid02QgUKoCxpnbYTyzzFBY2iiXzWdqMyZMiEg3tKGYhUqNWBq8oJEodfzKYKZ1bq3cRyl&id=100000046793737&mibextid=qC1gEa) --- ## 爬蟲（Crawing）是什麼？ - 動機：資訊太多，無法或不想手動擷取 - 原理：網頁由HTML組成 1. case1:在前端就可以爬到 2. case2:前端寄送request到後端，我們去 ---- ## why HTML? 1. HTML : 內容 2. CSS : 外觀 3. js : 互動 ---- ## Today todo 1. review HTML 2. request practice 3. bs4 introduction 4. bs4 practice 5. selenium introduction 6. selenium practice --- # Begin ---- #### open it! [source](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/) ``` <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ``` ---- ## HTML review | tag | attribute | text | | -------- | -------- | -------- | | 什麼樣的內容 | 什麼內容 | 文字 | ![](https://hackmd.io/_uploads/Bk_IVflU3.png =50%x) ``` <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; ``` ---- ## 再偷一張 ![](https://hackmd.io/_uploads/S1QgHzeUh.png =50%x) ---- ## install | requests | beautifulsoup4 | html5lib | | -------- | -------- | -------- | | 抓取HTML | 解析HTML | 慢但嚴謹得bs4 | ``` pip3 install beautifulsoup4 requests html5lib ``` ``` python3 -m pip beautifulsoup4 --user ``` --- #### request example1 ``` import requests r = requests.get('https://www.python.org') print(r.status_code) print(r.content) print(b'Python is a programming language' in r.content) ``` ---- # request practice ---- [click](https://neoj.sprout.tw/status/) 1. 找到 XHR (XML HTTP Request) 2. 找到 XHR 底下的 1. 標頭 (Header) 2. 酬載 (Payload) 3. Preview 3. 應該會類似下圖 ---- ![](https://hackmd.io/_uploads/S1TpAflIh.png) ---- ![](https://hackmd.io/_uploads/Bkv1JmlIn.png) ---- ![](https://hackmd.io/_uploads/Sy9Z17gIh.png) ---- #### 把後端給的資料都拉下來 ``` import requests import json url = 'https://neoj.sprout.tw/api/challenge/list' payload = '{"offset":0,"filter":{"user_uid":null,"problem_uid":null,"result":null},"reverse":true,"chal_uid_base":null}' status_res = requests.post(url, payload) # print(type(status_res)) status = json.loads(status_res.text) # print(status) ``` ---- ## 小整理 1. import 相關的 module 2. 可以提交 post 和 get 等請求（其他還有head, delete, put等） 3. get 無法直接改變狀態，post可以（get可以透過url改變狀態，但不推薦） 4. 給要擷取資料的url 5. 給狀態（payload） 6. 獲得資料後用json等格式排好 7. 整理出你要爬取的資料 ---- ### practice:只要人名 ---- #### Answer ``` status_data = reversed(status['data']) names = [] for x in status_data: names.append(x['submitter']['name']) ``` ---- ### practice:只要題目名稱 ---- #### Answer ``` problems = [] for x in status_data: problems.append(x['problem']['name']) print(", ".join(problems)) ``` ---- ### practice:什麼時候是AC？ ---- - result : 1 ---- ### practice:列出自己AC的題目 ---- #### Answer ``` user_id = 3122 # 換成你的 user id profile_res = requests.post('https://neoj.sprout.tw/api/user/'+str(user_id)+'/profile', '{}') stats_res = requests.post('https://neoj.sprout.tw/api/user/'+str(user_id)+'/statistic', '{}') # 把收到的資料當成 json 來讀取 profile = profile_res.json() # == json.loads(profile_res.text) stats = stats_res.json() # print(profile) # print(stats) categories = {0: 'Universe', 3: 'Python'} print('名字:', profile['name']) print('班別:', categories[profile['category']]) print('Rate:', profile['rate']) print('嘗試題目:') tried = [] for x in stats['tried_problems']: if stats['tried_problems'][x]['result'] == 1: tried.append(x) print(', '.join(tried)) print('通過題目:') passed = [] for x, res in stats['tried_problems'].items(): if res['result'] == 1: passed.append(x) print(', '.join(passed)) ``` ---- - 看起來挺方便的，那為什麼要用bs4或selenium呢？ ![](https://hackmd.io/_uploads/S1QgHzeUh.png =50%x) ---- #### Answer - 想像我們現在要print出一個網頁的小標題 - 當這棵樹很複雜的時候，我們需要有工具幫我們解析這棵樹 --- # bs4 - 幫我們剖析這棵樹的好工具 ---- # 名詞 1. node：節點 2. child (children)：向下的節點 3. parent：向上的節點 4. leaf：沒有child的節點 ---- #### example:爬取PPT基本資料 ``` import requests from bs4 import BeautifulSoup url = "https://www.ptt.cc/bbs/Gossiping/index.html" cookies = {'over18':'1'} htmlfile = requests.get(url, cookies = cookies) doc = BeautifulSoup(htmlfile.text, 'html.parser') articles = doc.find_all('div', class_ = 'r-ent') number = 0 for article in articles: title = article.find('a') author = article.find('div', class_ = 'author') date = article.find('div', class_ = 'date') number += 1 print("id:", number) print("title:", title.text) print("author:", author.text) print("time:", date.text) print("="*10) ``` ![](https://hackmd.io/_uploads/H1klV8l8n.png) ---- ## bs4基本語法 1. BeautifulSoup({request的response的文字部分}, {一個HTML解析器（這邊使用html.parser）}) 2. sth.find_all(TAG, class_ = '{class_name}') -> 找到底下所有tag = $TAG ---- ## 誒 cookie - 功能：跟蹤、儲存 - 簡單來說就是有個小文件在記錄你的state - 目的：下次visit的時候就會有客製化的偏好設定。 ---- ### 想辦法叫出以下畫面 - ![](https://hackmd.io/_uploads/B1pfgLe82.png) ---- #### parent and child - 會輸出什麼 ``` from bs4 import BeautifulSoup html = ''' <html> <body> <div id="container"> <h1>Example</h1> <p>This is a paragraph.</p> <a href="https://example.com">Link</a> </div> </body> </html> ''' soup = BeautifulSoup(html, 'html.parser') h1_element = soup.find('h1') print(h1_element.text) div_element = h1_element.parent print(div_element.name) print(div_element.text) ``` ---- ## 再加上這個呢？ ``` for child in div_element.children: if child.name: print(child.name) p_element = soup.find('p') print(p_element.text) div_element = p_element.parent print(div_element.name) for child in div_element.children: if child.name: print(child.name) ``` ---- ## Ending - bs4感覺還不錯，那剩下的selenium呢？ --- ## selenium - 自動化模擬user在瀏覽器的行為 - ajax -> 滑到底才刷新 ---- ### install ``` pip install selenium ``` [chromedriver](https://chromedriver.chromium.org/) ---- # 什麼是自動化模擬？ ---- #### 開關瀏覽器 ``` from selenium import webdriver import time PATH="" browser = webdriver.Chrome(PATH) browser.get("https://www.google.com") time.sleep(5) browser.quit() ``` ---- #### Search sth ``` import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys PATH="" browser = webdriver.Chrome(PATH) browser.get("https://www.google.com") search_box = browser.find_element(By.NAME, "q") search_box.send_keys("NTHU") search_box.send_keys(Keys.RETURN) time.sleep(10) browser.quit() ``` ---- ### 小整理1 - webdriver會生成一個物件 - 這個物件常透過By去找到element - 如果要自動化使用鍵盤，就用Keys ---- ### 小整理2 - bs4和selenium的最大差別 - bs4可以透過parent和child控制你在tree的哪個node - selenium可以幫你觸發程式的小機關，詳細一點就是給js一些參數 ---- ### practice:search again - 聽說By不只可以用NAME，還可以用ID喔 ---- #### Answer ``` from selenium import webdriver import time from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys PATH="" browser = webdriver.Chrome(PATH) browser.get("https://www.google.com") search_box = browser.find_element(By.ID, "APjFqb") search_box.send_keys("NTHU") search_box.send_keys(Keys.RETURN) time.sleep(10) browser.quit() ``` ---- ### 比較好點開的連結 ![](https://hackmd.io/_uploads/rk7R8veLn.png) ---- ``` browser = webdriver.Chrome() browser.get(“https://www.google.com") link = browser.find_element(By.LINK_TEXT, "關於 Google") link.click() time.sleep(5) browser.quit() ``` ---- ### 比較難點開的連結 ![](https://hackmd.io/_uploads/HkZODvlLn.png) ---- ``` from selenium import webdriver import time from selenium.webdriver.common.by import By PATH = "/Users/nuss/chromedriver_mac_arm64" browser = webdriver.Chrome(PATH) browser.get("https://www.youtube.com") time.sleep(2) l = browser.find_element(By.XPATH, "//a[@title='Shorts']") l.click() time.sleep(5) browser.quit() ``` ---- # Xpath - 標記語言描述路徑的方式 @ 代表後面接的是attribute 所以"//a[@title='Shorts']"就是不論在哪一層的以下屬性。 ---- #### time.sleep感覺是個笨方法 - from selenium.webdriver.support.ui import WebDriverWait - 確認某個ID到底跑完了沒 ``` wait = WebDriverWait(browser, 10) wait.until(EC.presence_of_element_located((By.ID, "..."))) ``` --- # END