# 【五倍學院聯名推薦工作坊】Rust 從零開始網頁爬蟲 - 朱章祺(Bucky Chu) {%hackmd @HWDC/BJOE4qInR %} >#### 》[議程介紹](https://hwdc.ithome.com.tw/2024/lab-page/3301) >#### 》[填寫議程滿意度問卷|回饋建言給辛苦的講者](https://forms.gle/ABzPoEGkGqiahz478) Rust 官網連結 -> [https://www.rust-lang.org/](https://www.rust-lang.org/) Rust 套件網站 -> [https://crates.io/](https://crates.io/) 爬蟲練習網站 -> [https://books.toscrape.com/](https://books.toscrape.com/) ## 取得網頁內容 ```rust! use reqwest; #[tokio::main] async fn main() { let url = "https://books.toscrape.com/"; let response = reqwest::get(url).await.unwrap(); let body = response.text().await.unwrap(); println!("{}", body); } ``` ## 解析 HTML ```rust! use scraper::Html; // ... let document = Html::parse_document(&body); println!("{}", document); ``` ## 建立選取器 ```rust! use scraper::{Html, Selector}; // ... let book_selector = Selector::parse("article.product_pod").unwrap(); let title_selector = Selector::parse("h3 a").unwrap(); let price_selector = Selector::parse("div.product_price .price_color").unwrap(); ``` ## 使用迴圈撈取資料 ```rust! for book in document.select(&book_selector) { let title = book .select(&title_selector) .next() .unwrap() .text() .collect::<String>(); let price = book .select(&price_selector) .next() .unwrap() .text() .collect::<String>(); println!("書名: {}", title); println!("價格: {}", price); println!("---"); } ``` ## 利用屬性取得完整書名 ```rust= let title_element = book.select(&title_selector).next().unwrap(); // 使用 title 屬性獲取完整書名 let title = title_element.value().attr("title").unwrap_or("Unknown Title"); ``` ## 取代 unwrap() ```rust! let response = reqwest::get(url).await?; let body = response.text().await?; let document = Html::parse_document(&body); let book_selector = Selector::parse("article.product_pod")?; let title_selector = Selector::parse("h3 a")?; let price_selector = Selector::parse("div.product_price .price_color")?; ``` ### 可以加上失敗後顯示的文字 ```rust! for book in document.select(&book_selector) { let title_element = book.select(&title_selector).next().ok_or("找不到 Title 元素")?; let title = title_element .value() .attr("title") .ok_or("找不到 Title 屬性")?; let price = book .select(&price_selector) .next() .ok_or("找不到 Price 元素")? .text() .collect::<String>(); // ... } ``` ## 建立 client ```rust! let client = reqwest::Client::builder().build()?; ``` ## 使用迴圈 ```rust! async fn main() -> Result<(), Box<dyn Error>> { let client = reqwest::Client::builder().build()?; for page in 1..=2 { let url = if page == 1 { "https://books.toscrape.com".to_string() } else { format!("https://books.toscrape.com/catalogue/page-{}.html", page) }; println!("正在爬取頁面: {}", url); let response = client.get(&url).send().await?; println!("狀態: {}", response.status()); if !response.status().is_success() { println!("狀態碼: {}", response.status()); continue; } } Ok(()) } ``` ## 計算每頁抓到的書是否正確 ```rust! for page in 1..=2 { // 省略 let body = response.text().await?; let document = Html::parse_document(&body); let book_selector = Selector::parse("article.product_pod")?; // 省略 let mut book_count = 0; for _book in document.select(&book_selector) { // 省略 book_count += 1; } println!("在第 {} 頁找到 {} 本書", page, book_count); } ``` ## 轉出 JSON步驟 1. 使用 serde 以及 File ```rust! use serde::{Deserialize, Serialize}; use std::fs::File; ``` 2. 建立結構體(struct),並標記 serde ```rust! #[derive(Serialize, Deserialize)] struct Book { title: String, price: String, } ``` 3. 建立一個 Vec ```rust let mut books = Vec::new(); ``` 4. 把每本書塞進 JSON ```rust! for book in document.select(&book_selector) { book_count += 1; let title_element = book.select(&title_selector).next().unwrap(); let title = title_element .value() .attr("title") .ok_or("找不到 Title 屬性")?; let price = book .select(&price_selector) .next() .ok_or("找不到 Price 元素")? .text() .collect::<String>(); books.push(Book { title: title.to_string(), price, }); println!("書名: {}", title); // println!("價格: {}", price); println!("---"); } ``` ```rust! let file = File::create("books.json")?; serde_json::to_writer_pretty(file, &books)?; println!("資料已存到 books.json"); Ok(()) ``` ## 存成 Excel ```rust! use xlsxwriter::Workbook; ``` ```rust! let workbook = Workbook::new("books.xlsx")?; let mut sheet = workbook.add_worksheet(None)?; sheet.write_string(0, 0, "書名", None)?; sheet.write_string(0, 1, "價格", None)?; let mut row = 1; ``` ```rust! for book in document.select(&book_selector) { book_count += 1; let title_element = book.select(&title_selector).next().unwrap(); let title = title_element .value() .attr("title") .ok_or("找不到 Title 屬性")?; let price = book .select(&price_selector) .next() .ok_or("找不到 Price 元素")? .text() .collect::<String>(); sheet.write_string(row, 0, title, None)?; sheet.write_string(row, 1, &price, None)?; books.push(Book { title: title.to_string(), price, }); row += 1; } ``` ## 自動取得全部頁面資料 ```rust! async fn get_total_pages(client: &reqwest::Client) -> Result<u32, Box<dyn Error>> { let url = "https://books.toscrape.com/index.html"; let response = client.get(url).send().await?; let body = response.text().await?; let document = Html::parse_document(&body); let pager_selector = Selector::parse("ul.pager li.current")?; let pager_text = document .select(&pager_selector) .next() .ok_or("無法找到分頁資料")? .text() .collect::<String>(); let total_pages = pager_text .split_whitespace() .last() .ok_or("無法取得總頁數")? .parse::<u32>()?; Ok(total_pages) } ``` ```rust! // 取得總頁數 let total_pages = get_total_pages(&client).await?; println!("總頁數: {}", total_pages); ```