# 爬蟲Crawling
---
![](https://hackmd.io/_uploads/ByJVTWg8h.png =30%x)
----
![](https://hackmd.io/_uploads/rJ1Eabl8n.png =30%x)
----
![](https://hackmd.io/_uploads/SyyE6beLh.png =30%x)
----
[神奇的link](https://m.facebook.com/story.php?story_fbid=pfbid02QgUKoCxpnbYTyzzFBY2iiXzWdqMyZMiEg3tKGYhUqNWBq8oJEodfzKYKZ1bq3cRyl&id=100000046793737&mibextid=qC1gEa)
---
## 爬蟲(Crawing)是什麼?
- 動機:資訊太多,無法或不想手動擷取
- 原理:網頁由HTML組成
1. case1:在前端就可以爬到
2. case2:前端寄送request到後端,我們去
----
## why HTML?
1. HTML : 內容
2. CSS : 外觀
3. js : 互動
----
## Today todo
1. review HTML
2. request practice
3. bs4 introduction
4. bs4 practice
5. selenium introduction
6. selenium practice
---
# Begin
----
#### open it!
[source](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/)
```
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
```
----
## HTML review
| tag | attribute | text |
| -------- | -------- | -------- |
| 什麼樣的內容 | 什麼內容 | 文字 |
![](https://hackmd.io/_uploads/Bk_IVflU3.png =50%x)
```
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
```
----
## 再偷一張
![](https://hackmd.io/_uploads/S1QgHzeUh.png =50%x)
----
## install
| requests | beautifulsoup4 | html5lib |
| -------- | -------- | -------- |
| 抓取HTML | 解析HTML | 慢但嚴謹得bs4 |
```
pip3 install beautifulsoup4 requests html5lib
```
```
python3 -m pip beautifulsoup4 --user
```
---
#### request example1
```
import requests
r = requests.get('https://www.python.org')
print(r.status_code)
print(r.content)
print(b'Python is a programming language' in r.content)
```
----
# request practice
----
[click](https://neoj.sprout.tw/status/)
1. 找到 XHR (XML HTTP Request)
2. 找到 XHR 底下的
1. 標頭 (Header)
2. 酬載 (Payload)
3. Preview
3. 應該會類似下圖
----
![](https://hackmd.io/_uploads/S1TpAflIh.png)
----
![](https://hackmd.io/_uploads/Bkv1JmlIn.png)
----
![](https://hackmd.io/_uploads/Sy9Z17gIh.png)
----
#### 把後端給的資料都拉下來
```
import requests
import json
url = 'https://neoj.sprout.tw/api/challenge/list'
payload = '{"offset":0,"filter":{"user_uid":null,"problem_uid":null,"result":null},"reverse":true,"chal_uid_base":null}'
status_res = requests.post(url, payload)
# print(type(status_res))
status = json.loads(status_res.text)
# print(status)
```
----
## 小整理
1. import 相關的 module
2. 可以提交 post 和 get 等請求(其他還有head, delete, put等)
3. get 無法直接改變狀態,post可以(get可以透過url改變狀態,但不推薦)
4. 給要擷取資料的url
5. 給狀態(payload)
6. 獲得資料後用json等格式排好
7. 整理出你要爬取的資料
----
### practice:只要人名
----
#### Answer
```
status_data = reversed(status['data'])
names = []
for x in status_data:
names.append(x['submitter']['name'])
```
----
### practice:只要題目名稱
----
#### Answer
```
problems = []
for x in status_data:
problems.append(x['problem']['name'])
print(", ".join(problems))
```
----
### practice:什麼時候是AC?
----
- result : 1
----
### practice:列出自己AC的題目
----
#### Answer
```
user_id = 3122 # 換成你的 user id
profile_res = requests.post('https://neoj.sprout.tw/api/user/'+str(user_id)+'/profile', '{}')
stats_res = requests.post('https://neoj.sprout.tw/api/user/'+str(user_id)+'/statistic', '{}')
# 把收到的資料當成 json 來讀取
profile = profile_res.json() # == json.loads(profile_res.text)
stats = stats_res.json()
# print(profile)
# print(stats)
categories = {0: 'Universe', 3: 'Python'}
print('名字:', profile['name'])
print('班別:', categories[profile['category']])
print('Rate:', profile['rate'])
print('嘗試題目:')
tried = []
for x in stats['tried_problems']:
if stats['tried_problems'][x]['result'] == 1:
tried.append(x)
print(', '.join(tried))
print('通過題目:')
passed = []
for x, res in stats['tried_problems'].items():
if res['result'] == 1:
passed.append(x)
print(', '.join(passed))
```
----
- 看起來挺方便的,那為什麼要用bs4或selenium呢?
![](https://hackmd.io/_uploads/S1QgHzeUh.png =50%x)
----
#### Answer
- 想像我們現在要print出一個網頁的小標題
- 當這棵樹很複雜的時候,我們需要有工具幫我們解析這棵樹
---
# bs4
- 幫我們剖析這棵樹的好工具
----
# 名詞
1. node:節點
2. child (children):向下的節點
3. parent:向上的節點
4. leaf:沒有child的節點
----
#### example:爬取PPT基本資料
```
import requests
from bs4 import BeautifulSoup
url = "https://www.ptt.cc/bbs/Gossiping/index.html"
cookies = {'over18':'1'}
htmlfile = requests.get(url, cookies = cookies)
doc = BeautifulSoup(htmlfile.text, 'html.parser')
articles = doc.find_all('div', class_ = 'r-ent')
number = 0
for article in articles:
title = article.find('a')
author = article.find('div', class_ = 'author')
date = article.find('div', class_ = 'date')
number += 1
print("id:", number)
print("title:", title.text)
print("author:", author.text)
print("time:", date.text)
print("="*10)
```
![](https://hackmd.io/_uploads/H1klV8l8n.png)
----
## bs4基本語法
1. BeautifulSoup({request的response的文字部分}, {一個HTML解析器(這邊使用html.parser)})
2. sth.find_all(TAG, class_ = '{class_name}') -> 找到底下所有tag = $TAG
----
## 誒 cookie
- 功能:跟蹤、儲存
- 簡單來說就是有個小文件在記錄你的state
- 目的:下次visit的時候就會有客製化的偏好設定。
----
### 想辦法叫出以下畫面
- ![](https://hackmd.io/_uploads/B1pfgLe82.png)
----
#### parent and child
- 會輸出什麼
```
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<div id="container">
<h1>Example</h1>
<p>This is a paragraph.</p>
<a href="https://example.com">Link</a>
</div>
</body>
</html>
'''
soup = BeautifulSoup(html, 'html.parser')
h1_element = soup.find('h1')
print(h1_element.text)
div_element = h1_element.parent
print(div_element.name)
print(div_element.text)
```
----
## 再加上這個呢?
```
for child in div_element.children:
if child.name:
print(child.name)
p_element = soup.find('p')
print(p_element.text)
div_element = p_element.parent
print(div_element.name)
for child in div_element.children:
if child.name:
print(child.name)
```
----
## Ending
- bs4感覺還不錯,那剩下的selenium呢?
---
## selenium
- 自動化模擬user在瀏覽器的行為
- ajax -> 滑到底才刷新
----
### install
```
pip install selenium
```
[chromedriver](https://chromedriver.chromium.org/)
----
# 什麼是自動化模擬?
----
#### 開關瀏覽器
```
from selenium import webdriver
import time
PATH=""
browser = webdriver.Chrome(PATH)
browser.get("https://www.google.com")
time.sleep(5)
browser.quit()
```
----
#### Search sth
```
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
PATH=""
browser = webdriver.Chrome(PATH)
browser.get("https://www.google.com")
search_box = browser.find_element(By.NAME, "q")
search_box.send_keys("NTHU")
search_box.send_keys(Keys.RETURN)
time.sleep(10)
browser.quit()
```
----
### 小整理1
- webdriver會生成一個物件
- 這個物件常透過By去找到element
- 如果要自動化使用鍵盤,就用Keys
----
### 小整理2
- bs4和selenium的最大差別
- bs4可以透過parent和child控制你在tree的哪個node
- selenium可以幫你觸發程式的小機關,詳細一點就是給js一些參數
----
### practice:search again
- 聽說By不只可以用NAME,還可以用ID喔
----
#### Answer
```
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
PATH=""
browser = webdriver.Chrome(PATH)
browser.get("https://www.google.com")
search_box = browser.find_element(By.ID, "APjFqb")
search_box.send_keys("NTHU")
search_box.send_keys(Keys.RETURN)
time.sleep(10)
browser.quit()
```
----
### 比較好點開的連結
![](https://hackmd.io/_uploads/rk7R8veLn.png)
----
```
browser = webdriver.Chrome()
browser.get(“https://www.google.com")
link = browser.find_element(By.LINK_TEXT, "關於 Google")
link.click()
time.sleep(5)
browser.quit()
```
----
### 比較難點開的連結
![](https://hackmd.io/_uploads/HkZODvlLn.png)
----
```
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
PATH = "/Users/nuss/chromedriver_mac_arm64"
browser = webdriver.Chrome(PATH)
browser.get("https://www.youtube.com")
time.sleep(2)
l = browser.find_element(By.XPATH, "//a[@title='Shorts']")
l.click()
time.sleep(5)
browser.quit()
```
----
# Xpath
- 標記語言描述路徑的方式
@ 代表後面接的是attribute
所以"//a[@title='Shorts']"就是不論在哪一層的以下屬性。
----
#### time.sleep感覺是個笨方法
- from selenium.webdriver.support.ui import WebDriverWait
- 確認某個ID到底跑完了沒
```
wait = WebDriverWait(browser, 10)
wait.until(EC.presence_of_element_located((By.ID, "...")))
```
---
# END
{"metaMigratedAt":"2023-06-18T05:32:48.014Z","metaMigratedFrom":"Content","title":"爬蟲Crawling","breaks":true,"contributors":"[{\"id\":\"67f64c5d-de88-4a53-9b64-c2cfb2bffa58\",\"add\":8525,\"del\":0}]"}