# 106-2 資訊社課 —— 網頁爬蟲 ## 本文: https://goo.gl/tsGU9p --- ## Part I 爬蟲 - 2018/03/08 ### 複習 * [JavaScript](http://slides.com/hank1997/deck-1#/) * [Repl.it](https://repl.it/) ### 你需要做的是: * [下載安裝node](https://nodejs.org/en/) --- ### Reference: * [蟲蟲危機slide](https://speakerdeck.com/aaaddress1/node-dot-js-chong-chong-wei-ji-shou-zuo-ni-de-di-ge-zi-dong-shang-wang-zi-liao-fen-xi-ji-qi-ren) ### Start: 1. 建資料夾 2. 創個副檔名為 `.js` 的檔案 3. 任意內容 ex: ```jsx= console.log('1 + 1 = ', 1+1); ``` 4. 執行看看 `node [檔名]` ### Step 2: 1. [套件 request](https://github.com/request/request) 2. 用 `npm install request` 安裝 3. ex: ```javascript= const request = require('request'); request('https://www.google.com', (error, response, body) => { console.log('error: ', error); console.log('sratusCode: ', response && response.statusCode); console.log(body); }); ``` ### Step 3: 1. 試著爬爬看 ptt: https://www.ptt.cc 2. 複習 tag: https://flukeout.github.io/ 3. 要把爬下來的東西存起來怎麼辦????? <!-- aaa --> ```javascript= const request = require('request'); const cheerio = require('cheerio'); const fs = require('fs'); let HotBoardsJson = []; let url = 'https://www.ptt.cc' request(url + '/bbs/index.html', (error, response, body) => { console.log('error: ', error); console.log('sratusCode: ', response && response.statusCode); // console.log('body: ', body); var $ = cheerio.load(body); boardArr = $('a.board'); for (var i = 0; i < boardArr.length; i++) { boardName = $($(boardArr[i]).find('.board-name')).text(); boardClass = $($(boardArr [i]).find('.board-class')).text(); boardLink = url + $(boardArr[i]).attr('href'); HotBoardsJson.push({ boardName: boardName, boardClass: boardClass, boardLink: boardLink }) console.log(`${boardName}\n>> ${boardLink}\n>> 分類:${boardClass}\n`); } // 寫入檔案在這邊!!! fs.writeFile('HotBoards.json', JSON.stringify(HotBoardsJson), function (err) { if (err) console.log(err); else console.log('File ' + 'HotBoards.json' + 'written!'); }) console.log('共 ' + HotBoardsJson.length + ' 篇\\n'); }); ``` --- ## Part II 進擊的爬蟲 - 2018/03/14 ### 在開始之前: :::danger ### 請重複上一動! [安裝 node](#%E4%BD%A0%E9%9C%80%E8%A6%81%E5%81%9A%E7%9A%84%E6%98%AF%EF%BC%9A) ::: * 要把爬下來的東西存起來怎麼辦? 1. 執行時用 `>` + 檔名 - [terminal output to a file](https://askubuntu.com/questions/420981/how-do-i-save-terminal-output-to-a-file) 2. use `fs` module to write file. * 講解 `() => {}` es6 的超猛寫法: [**箭頭函示**](https://eyesofkids.gitbooks.io/javascript-start-from-es6/content/part4/arrow_function.html) * [樣板字面值 Template literals](https://developer.mozilla.org/zh-TW/docs/Web/JavaScript/Reference/Template_literals) - console.log 字串運算 ### 將表特版首頁文章爬下來存成 json 的範例: **試試看吧!** ~~謎之音: 不知道json是什麼QQ~~ - 參考上學期的[筆記](https://hackmd.io/ZFIcAv9HTXG8d1i_sBEbIA#JSON-%E6%A0%BC%E5%BC%8F) - [舉個 :chestnut:](https://github.com/jayhung97724/106-2NCHUIT-web/blob/master/HotArticles.json) - [更多說明](https://blog.wu-boy.com/2011/04/%E4%BD%A0%E4%B8%8D%E5%8F%AF%E4%B8%8D%E7%9F%A5%E7%9A%84-json-%E5%9F%BA%E6%9C%AC%E4%BB%8B%E7%B4%B9/) ```javascript= const request = require('request'); const cheerio = require('cheerio'); const fs = require('fs'); let HotArticles = []; let url = 'https://www.ptt.cc' request(url + '/bbs/Beauty/index.html', (error, response, body) => { console.log('error: ', error); console.log('sratusCode: ', response && response.statusCode); // console.log('body: ', body); var $ = cheerio.load(body); articleArr = $('div.r-ent'); // console.log(articleArr[0]); for (var i = 0; i < articleArr.length - 5; i++) { articleName = $(articleArr[i]).find('.title').find('a').text(); articlePush = $(articleArr[i]).find('.nrec').find('.f2').text(); articleDate = $(articleArr[i]).find('.meta').find('.date').text(); articleAuthor = $(articleArr[i]).find('.meta').find('.author').text(); articleLink = url + $(articleArr[i]).find('.title').find('a').attr('href'); // https://www.ptt.cc/bbs/Beauty/M.1520378107.A.181.html if (articleLink.indexOf("undefined") == -1) { HotArticles.push({ articleName: articleName, articleDate: articleDate, articleLink: articleLink, articleAuthor: articleAuthor, articlePush: (articlePush == "") ? 0 : parseInt(articlePush) }); console.log(`${articleName}\n>> ${articleLink}\n>> 日期:${articleDate}\n>> 作者:${articleAuthor}\n>> 推數:${articlePush}\n`); } } fs.writeFile('HotArticles.json', JSON.stringify(HotArticles), function (err) { if (err) console.log(err); else console.log('File ' + 'HotArticles.json' + ' written!'); }) console.log('共 ' + HotArticles.length + ' 篇\n'); }); ``` <!-- ```javascript= ``` --> ### 接下來要講怎麼拿到圖片連結囉! #### 1. 先 [Inspect](https://www.ptt.cc/bbs/Beauty/M.1521033451.A.439.html) 網頁元素 Try 一個看看 ```javascript= var cheerio = require('cheerio'); // Basically jQuery for node.js var request = require('request-promise'); var options = { uri: 'https://www.ptt.cc/bbs/Beauty/M.1521110977.A.EE0.html', transform: function (body) { return cheerio.load(body); } }; request(options) .then(function ($) { // Process html like you would with jQuery... console.log($('#main-content a')) }) .catch(function (err) { // Crawling failed or Cheerio choked... }); ``` #### 2. 處理多個 request Q: 為何需要? A: 因為 request 比較耗時,要等大家都把圖片抓完(request完成)寫檔才有用,不然會是空的檔案。 * [Promise 函式簡介](https://eyesofkids.gitbooks.io/javascript-start-es6-promise/content/contents/basic_usage.html) * [Promise 另一個例子](http://www.oxxostudio.tw/articles/201706/javascript-promise-settimeout.html) * [request-promise 結合](https://www.npmjs.com/package/request-promise) ```javascript= const request = require('request-promise'); const cheerio = require('cheerio'); const q = require('q'); const fs = require('fs'); let articleArr = JSON.parse(fs.readFileSync('./HotArticles.json', 'utf8')); let promises = []; let urls = []; for (let index = 0; index < articleArr.length; index++) { urls.push({ uri: articleArr[index].articleLink, transform: (body) => { return cheerio.load(body); } }); // 這是 request-promise 的變化型 } console.log(urls); let fetchPages = (urls) => { // forEach 寫起來是不是簡潔多了 urls.forEach((url) => { promises.push(request(url)); }); // q 是一個 queue return q.all(promises); } let writeJSON = () => { console.log(articleArr); fs.writeFile('HotBeauties.json', JSON.stringify(articleArr), function (err) { if (err) console.log(err); else console.log('File ' + 'HotBeauties.json' + ' written!'); }) } fetchPages(urls).then((pages) => { let articleImages = []; console.log(pages.length); // [0]('#main-content a')[0].attribs.href for (let pgI = 0; pgI < pages.length; pgI++) { let $ = pages[pgI]; // console.log($) let imgLinks = []; imgArr = $('#main-content a'); // console.log(imgArr.length) for (let imgI = 0; imgI < imgArr.length; imgI++) { // console.log(imgI.toString()) let imgUrl = imgArr[imgI.toString()].attribs.href; if (!(imgUrl == "" || imgUrl.indexOf(".jpg") == -1)) { // console.log(imgUrl) imgLinks.push(imgUrl); } } articleArr[pgI].articleImages = imgLinks; } writeJSON(); }); ``` **補充介紹Queue:** {%youtube UpvDOm3prfI %} <!-- aaa --> ## Part III 帥氣的爬蟲 - 2018/03/22 ``` 火星章魚怪 ~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~ /|||\ /|||\ /|||\ /|||\ /|||\ /|||\ ``` ### 在開始之前: :::success 把你之前的東西載下來 or download from [My repo](https://github.com/jayhung97724/106-2NCHUIT-web) ::: ### 基礎的 sublime text 設定: 1. `ctrl` + `shift` +`P` 2. 打 `PC` 3. Download `package controller` 4. `ctrl` + `shift` +`P` 5. 打 `PCI` 安裝 `emmet` ### 顯示 #### 基本 List: ```htmlembedded= <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>BasicList</title> </head> <body> <ul> <li class='myTitle'>title on web</li> <ol> <li class='myContent'>content on web</li> </ol> </ul> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script> <!-- <script src="./scripts/basic.js"></script> --> </body> </html> ``` ### 基本操作: ```javascript= let article = { "articleName": "[神人]枇杷膏廣告女角", "articleDate": " 3/18", "articleLink": "https://www.ptt.cc/bbs/Beauty/M.1521305393.A.AD0.html", "articleAuthor": "vanillamint", "articlePush": 0, "articleImages": [ "https://i.imgur.com/OJWdFT1.jpg", "https://i.imgur.com/NZbXl0y.jpg", "https://i.imgur.com/OiCnvu8.jpg", "https://i.imgur.com/sVkLdNZ.jpg" ] } let title = article.articleName; let urls = article.articleImages; console.log(urls); let $title = $('.myTitle'); let $content = $('.myContent'); console.log($title.text(), ' + ',$title.siblings('ol').find('.myContent')); $title.text(title); li = $content; urls.forEach((url)=>{ _li = li.clone(); console.log(_li); _li.text(url); $('ol').append(_li); }) ``` ## Part IV 進擊的蟲屍 - 2018/03/29 網址:yoyohung.com/106-2NCHUIT-web/practice1.html ### Before you start: 確認 [sublime](#基礎的-sublime-text-設定) 的套件有裝好 ### 覺得上次的版面太陽春? Let's introduce [Semantic UI](https://semantic-ui.com/usage/layout.html) 先看看這個結果: https://github.com/jayhung97724/106-2NCHUIT-web/blob/master/demo.html 在 head 地方引入他: ```htmlembedded= <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/semantic-ui/2.3.1/semantic.min.css" /> ``` :::success 有沒有差別!!!!? ::: ### 版型 ```htmlembedded= <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta http-equiv="X-UA-Compatible" content="ie=edge"> <title>Practice 1</title> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/semantic-ui/2.3.1/semantic.min.css" /> <link rel="stylesheet" href="./css/main.css"> </head> <body> <div class="ui segment"> <div class="ui centered huge header">練習一</div> </div> <div class="ui segment"> <div class="ui styled accordion"> <!-- <template id="articleTemplate"> --> <div class="title"><i class="dropdown icon"></i><span>title1</span></div> <div class="content"> <p class="transition hidden"> <div> <div class="imgBox"> <img class="myImage" src="https://i.imgur.com/sVkLdNZ.jpg"> </div> </div> </p> </div> <!-- </template> --> </div> </div> <!-- (.title>i.dropdown.icon^.content>p.transition.hidden)*3 --> <!-- 遠端套件 --> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/semantic-ui/2.3.1/semantic.min.js"></script> <!-- 自己寫 --> <script src="./scripts/main.js"></script> </body> </html> ``` ### 好像沒有功用?? 少了一些東西 ```htmlmixed= <script> $('.accordion').accordion(); </script> ``` ### js manipulate: ```javascript= let Beauties =[] let $accordion = $('.accordion') $.getJSON("./HotBeauties.json", (file) => { Beauties = file; console.log(Beauties); Beauties.forEach((b) => { let templateString = $('#articleTemplate').html(); let template = $(templateString); title = template.siblings('.title'); content = template.siblings('.content'); span = title.find('span'); span.text(b.articleName); boxTag = content.find('.imgBox'); imgTag = content.find('img'); b.articleImages.forEach(lnk => { _imgTag = imgTag.clone(); _imgTag.attr('src', lnk); boxTag.append(_imgTag); }); imgTag.remove(); // console.log(title); // console.log(content); $accordion.append(title); $accordion.append(content); }) }); // $('.accordion').accordion(); ``` ### CSS 修飾 ```css= .header { margin-top: 32px !important; } .segment { display: flex; border: none !important; justify-content: center; } .imgBox { display: flex; justify-content: center; flex-direction: column; } .myImage { width:70%; margin: 0 auto 24px; border-radius: 0.8rem; box-shadow: 4px 4px 16px -4px rgba(0, 0, 0, 0.8); } ``` :::success ## 萌萌噠 ::: ###### tags: `NCHUIT` ## Part V 使用GitHub Pages架設網頁 2018/04/12 ### Step0 創 github 帳號,填連結 [表單連結](http://bit.ly/2H0hDxk) ### Step1 開新Repo ![](https://i.imgur.com/jTSEsgI.png) ### Step2 輸入Repo資訊 ![](https://i.imgur.com/pC2OA42.png) ### Step3 從遠端clone至本地端 如果你的電腦沒有 git 就先[安裝](https://git-scm.com/book/zh/v2/%E8%B5%B7%E6%AD%A5-%E5%AE%89%E8%A3%85-Git) ```shell git clone [url] ``` ![](https://i.imgur.com/CeCdkjf.png) > 在本地端得到一個啥都沒有的repo ![](https://i.imgur.com/VK9D6Wr.png) ### Step4 更新網頁資源 > 好不容易爬下來的資料終於要派上用場惹 > [{沒帶code終結者}](http://download1334.mediafire.com/8fzca5b0vcbg/be65zk3382h9s8q/hentai.zip) ![](https://i.imgur.com/CESGuJq.png) ### Step5 上傳網頁資源到遠端 > 別忘了cd到這個repo!! ```bash= git status git add . git commit -m "hello hentai" git push ``` ![](https://i.imgur.com/Ky2h0sx.png) ### Step6 開通GitHub Pages服務 > 因為我們的網頁資源在master這個Branch > ![](https://i.imgur.com/2oCgJyc.png) ![](https://i.imgur.com/OTKstyk.png) ### Step7 發布成功 ![](https://i.imgur.com/YkKWk07.png) ### Step8 享受你的Hentai網頁 ``` 火星章魚怪 ~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~ /|||\ /|||\ /|||\ /|||\ /|||\ /|||\ ``` ## 再來學習如何協作,丟 pull request! ### Step 1. 先到你欲修改的主repo https://github.com/jayhung97724/106-2NCHUIT-web/ 點選右上角的 fork 按鈕,將整個專案複製一份至自己帳號底下。 ![](https://i.imgur.com/mUCzmJI.png) ### Step 2. clone 回本地端修改 打開終端機 CMD ```shell= // clone 綠色按鈕的連結 git clone https://github.com/[Username]/106-2NCHUIT-web.git // 進到資料夾 cd 106-2NCHUIT-web ``` ### Step 3. 在 README.md 中 Other club members’ work 下方 以範例語法新增你的連結。 ``` - [Text](url) ``` 但要注意順序~~~徵求自願者嘗試! ### Step 4. ```shell= git status git add . git commit -m "新增誰誰誰的作品連結" git push ``` ### Step 5. 回到原作者的repo 按 new pull request ### Step 6. 等原作者同意 merge ![](https://i.imgur.com/vnI2dcy.png)