owned this note
owned this note
Published
Linked with GitHub
# 【五倍學院聯名推薦工作坊】Rust 從零開始網頁爬蟲 - 朱章祺(Bucky Chu)
{%hackmd @HWDC/BJOE4qInR %}
>#### 》[議程介紹](https://hwdc.ithome.com.tw/2024/lab-page/3301)
>#### 》[填寫議程滿意度問卷|回饋建言給辛苦的講者](https://forms.gle/ABzPoEGkGqiahz478)
Rust 官網連結 -> [https://www.rust-lang.org/](https://www.rust-lang.org/)
Rust 套件網站 -> [https://crates.io/](https://crates.io/)
爬蟲練習網站 -> [https://books.toscrape.com/](https://books.toscrape.com/)
## 取得網頁內容
```rust!
use reqwest;
#[tokio::main]
async fn main() {
let url = "https://books.toscrape.com/";
let response = reqwest::get(url).await.unwrap();
let body = response.text().await.unwrap();
println!("{}", body);
}
```
## 解析 HTML
```rust!
use scraper::Html;
// ...
let document = Html::parse_document(&body);
println!("{}", document);
```
## 建立選取器
```rust!
use scraper::{Html, Selector};
// ...
let book_selector = Selector::parse("article.product_pod").unwrap();
let title_selector = Selector::parse("h3 a").unwrap();
let price_selector = Selector::parse("div.product_price .price_color").unwrap();
```
## 使用迴圈撈取資料
```rust!
for book in document.select(&book_selector) {
let title = book
.select(&title_selector)
.next()
.unwrap()
.text()
.collect::<String>();
let price = book
.select(&price_selector)
.next()
.unwrap()
.text()
.collect::<String>();
println!("書名: {}", title);
println!("價格: {}", price);
println!("---");
}
```
## 利用屬性取得完整書名
```rust=
let title_element = book.select(&title_selector).next().unwrap();
// 使用 title 屬性獲取完整書名
let title = title_element.value().attr("title").unwrap_or("Unknown Title");
```
## 取代 unwrap()
```rust!
let response = reqwest::get(url).await?;
let body = response.text().await?;
let document = Html::parse_document(&body);
let book_selector = Selector::parse("article.product_pod")?;
let title_selector = Selector::parse("h3 a")?;
let price_selector = Selector::parse("div.product_price .price_color")?;
```
### 可以加上失敗後顯示的文字
```rust!
for book in document.select(&book_selector) {
let title_element = book.select(&title_selector).next().ok_or("找不到 Title 元素")?;
let title = title_element
.value()
.attr("title")
.ok_or("找不到 Title 屬性")?;
let price = book
.select(&price_selector)
.next()
.ok_or("找不到 Price 元素")?
.text()
.collect::<String>();
// ...
}
```
## 建立 client
```rust!
let client = reqwest::Client::builder().build()?;
```
## 使用迴圈
```rust!
async fn main() -> Result<(), Box<dyn Error>> {
let client = reqwest::Client::builder().build()?;
for page in 1..=2 {
let url = if page == 1 {
"https://books.toscrape.com".to_string()
} else {
format!("https://books.toscrape.com/catalogue/page-{}.html", page)
};
println!("正在爬取頁面: {}", url);
let response = client.get(&url).send().await?;
println!("狀態: {}", response.status());
if !response.status().is_success() {
println!("狀態碼: {}", response.status());
continue;
}
}
Ok(())
}
```
## 計算每頁抓到的書是否正確
```rust!
for page in 1..=2 {
// 省略
let body = response.text().await?;
let document = Html::parse_document(&body);
let book_selector = Selector::parse("article.product_pod")?;
// 省略
let mut book_count = 0;
for _book in document.select(&book_selector) {
// 省略
book_count += 1;
}
println!("在第 {} 頁找到 {} 本書", page, book_count);
}
```
## 轉出 JSON步驟
1. 使用 serde 以及 File
```rust!
use serde::{Deserialize, Serialize};
use std::fs::File;
```
2. 建立結構體(struct),並標記 serde
```rust!
#[derive(Serialize, Deserialize)]
struct Book {
title: String,
price: String,
}
```
3. 建立一個 Vec
```rust
let mut books = Vec::new();
```
4. 把每本書塞進 JSON
```rust!
for book in document.select(&book_selector) {
book_count += 1;
let title_element = book.select(&title_selector).next().unwrap();
let title = title_element
.value()
.attr("title")
.ok_or("找不到 Title 屬性")?;
let price = book
.select(&price_selector)
.next()
.ok_or("找不到 Price 元素")?
.text()
.collect::<String>();
books.push(Book {
title: title.to_string(),
price,
});
println!("書名: {}", title);
// println!("價格: {}", price);
println!("---");
}
```
```rust!
let file = File::create("books.json")?;
serde_json::to_writer_pretty(file, &books)?;
println!("資料已存到 books.json");
Ok(())
```
## 存成 Excel
```rust!
use xlsxwriter::Workbook;
```
```rust!
let workbook = Workbook::new("books.xlsx")?;
let mut sheet = workbook.add_worksheet(None)?;
sheet.write_string(0, 0, "書名", None)?;
sheet.write_string(0, 1, "價格", None)?;
let mut row = 1;
```
```rust!
for book in document.select(&book_selector) {
book_count += 1;
let title_element = book.select(&title_selector).next().unwrap();
let title = title_element
.value()
.attr("title")
.ok_or("找不到 Title 屬性")?;
let price = book
.select(&price_selector)
.next()
.ok_or("找不到 Price 元素")?
.text()
.collect::<String>();
sheet.write_string(row, 0, title, None)?;
sheet.write_string(row, 1, &price, None)?;
books.push(Book {
title: title.to_string(),
price,
});
row += 1;
}
```
## 自動取得全部頁面資料
```rust!
async fn get_total_pages(client: &reqwest::Client) -> Result<u32, Box<dyn Error>> {
let url = "https://books.toscrape.com/index.html";
let response = client.get(url).send().await?;
let body = response.text().await?;
let document = Html::parse_document(&body);
let pager_selector = Selector::parse("ul.pager li.current")?;
let pager_text = document
.select(&pager_selector)
.next()
.ok_or("無法找到分頁資料")?
.text()
.collect::<String>();
let total_pages = pager_text
.split_whitespace()
.last()
.ok_or("無法取得總頁數")?
.parse::<u32>()?;
Ok(total_pages)
}
```
```rust!
// 取得總頁數
let total_pages = get_total_pages(&client).await?;
println!("總頁數: {}", total_pages);
```