bahamut-pie - chapter2: web crawler🐞

--- tags: bahamut-pie, side project --- # bahamut-pie - chapter2: web crawler🐞 前端要練手最缺的就是資料，最簡單的方式就是接open data 而 Node.js 讓前端有辦法用 javascript 撰寫爬蟲程式放在 server 運行 ## build a crawler 初步的想法是做一個API，被 call 的時候就去爬取網頁資料回傳不確定這個做法適不適當，但先做再說吧 ### how to? 1. 用 request 把要爬的頁面 html 拿來 2. 針對頁面用 cheerio 抓取元素，用法和 jquery 選取一樣 3. 從選取到的元素中提取資料，組成 json 回傳 **以下為範例，並非實際的code** ```javascript= // 1. 把文章列表的 html 拿來 request(url, (err, res, body) => { // 2. 拿到 html (body)，用 cheerio 初始化 const $ = cheerio.load(body); // 3. 遍歷頁面中的 a.article-link ，從元素中取資料並重組 $('.article-link').each(function(i,el) { return{ title: $(el).text(), url: $(el).attr('href') }; }); }) // 4. 單筆資料長這樣，經過上面的遍歷就可以組合成陣列 { title: '文章標題', url: 'https://文章的網址' } ``` #### note🤏 做爬蟲之前要看一下目標網站的 html 結構，巴哈姆特雖然結構稍醜但還算單純好爬像是表單頁、無限滾動甚至防爬蟲的網站就要用更複雜的做法 ### API化前端對API應該不陌生，帶參數 call API 會回傳不同的資料 e.g. 要查 handred800 的文章數據就在API網址中加入id=handred800 所以要做出接口來接收參數 **以下為範例，並非實際的code** ``` javascript // API 的入口，用 ajax call 這支 API 會從這邊開始 app.get("/", async (req, res) => { // 從請求中提取使用者ID的參數 const userId = req.query.id; const url = `http://home.gamer.com.tw/?owner=${userId}`; // 呼叫前面那段爬蟲程式 const articles = await articleCrawler.getArtcles(url); // 跨域請求無限制 res.header('Access-Control-Allow-Origin', '*'); res.header('Access-Control-Allow-Methods', 'GET,PUT,POST,DELETE,OPTIONS'); // 最後把 articles 整個傳回去 return res.json(data); } ``` 沒用過 Node.js 應該看不懂，來看看 API 完成後怎麼 call 應該能懂假設我的程式碼放在 https://crawler-api.com 這個位址 ``` javascript // ajax call API fetch('https://crawler-api.com/?id=handred800') .then((res) => res.json()) // 轉 json 格式 .then((data) => { console.log(data); // 拿到資料囉！ }) ``` ## depoly on Heroku 部屬到 server = 把程式碼放在某台電腦上不眠不休的運行以 API 來說就是等著隨時被 call Heroku 用起來很容易，可以用下指令的方式上傳檔案到 server 我是連結 github repo，當 push 的時候會自動重新部屬相當方便免費方案有次數限制&休眠機制，但對小專案來說很夠用文末附上完成品，readme 有 API 的使用方式，搭配著看會好懂很多 [**github page**](https://github.com/handred800/bahamut-home-article-cralwer)