# 106-2 資訊社課 —— 網頁爬蟲
## 本文: https://goo.gl/tsGU9p
---
## Part I 爬蟲 - 2018/03/08
### 複習
* [JavaScript](http://slides.com/hank1997/deck-1#/)
* [Repl.it](https://repl.it/)
### 你需要做的是:
* [下載安裝node](https://nodejs.org/en/)
---
### Reference:
* [蟲蟲危機slide](https://speakerdeck.com/aaaddress1/node-dot-js-chong-chong-wei-ji-shou-zuo-ni-de-di-ge-zi-dong-shang-wang-zi-liao-fen-xi-ji-qi-ren)
### Start:
1. 建資料夾
2. 創個副檔名為 `.js` 的檔案
3. 任意內容 ex:
```jsx=
console.log('1 + 1 = ', 1+1);
```
4. 執行看看 `node [檔名]`
### Step 2:
1. [套件 request](https://github.com/request/request)
2. 用 `npm install request` 安裝
3. ex:
```javascript=
const request = require('request');
request('https://www.google.com', (error, response, body) => {
console.log('error: ', error);
console.log('sratusCode: ', response && response.statusCode);
console.log(body);
});
```
### Step 3:
1. 試著爬爬看 ptt: https://www.ptt.cc
2. 複習 tag: https://flukeout.github.io/
3. 要把爬下來的東西存起來怎麼辦?????
<!-- aaa
-->
```javascript=
const request = require('request');
const cheerio = require('cheerio');
const fs = require('fs');
let HotBoardsJson = [];
let url = 'https://www.ptt.cc'
request(url + '/bbs/index.html', (error, response, body) => {
console.log('error: ', error);
console.log('sratusCode: ', response && response.statusCode);
// console.log('body: ', body);
var $ = cheerio.load(body);
boardArr = $('a.board');
for (var i = 0; i < boardArr.length; i++) {
boardName = $($(boardArr[i]).find('.board-name')).text();
boardClass = $($(boardArr [i]).find('.board-class')).text();
boardLink = url + $(boardArr[i]).attr('href');
HotBoardsJson.push({ boardName: boardName, boardClass: boardClass, boardLink: boardLink })
console.log(`${boardName}\n>> ${boardLink}\n>> 分類:${boardClass}\n`);
}
// 寫入檔案在這邊!!!
fs.writeFile('HotBoards.json', JSON.stringify(HotBoardsJson), function (err) {
if (err)
console.log(err);
else
console.log('File ' + 'HotBoards.json' + 'written!');
})
console.log('共 ' + HotBoardsJson.length + ' 篇\\n');
});
```
---
## Part II 進擊的爬蟲 - 2018/03/14
### 在開始之前:
:::danger
### 請重複上一動! [安裝 node](#%E4%BD%A0%E9%9C%80%E8%A6%81%E5%81%9A%E7%9A%84%E6%98%AF%EF%BC%9A)
:::
* 要把爬下來的東西存起來怎麼辦?
1. 執行時用 `>` + 檔名
- [terminal output to a file](https://askubuntu.com/questions/420981/how-do-i-save-terminal-output-to-a-file)
2. use `fs` module to write file.
* 講解 `() => {}` es6 的超猛寫法: [**箭頭函示**](https://eyesofkids.gitbooks.io/javascript-start-from-es6/content/part4/arrow_function.html)
* [樣板字面值 Template literals](https://developer.mozilla.org/zh-TW/docs/Web/JavaScript/Reference/Template_literals)
- console.log 字串運算
### 將表特版首頁文章爬下來存成 json 的範例:
**試試看吧!** ~~謎之音: 不知道json是什麼QQ~~
- 參考上學期的[筆記](https://hackmd.io/ZFIcAv9HTXG8d1i_sBEbIA#JSON-%E6%A0%BC%E5%BC%8F)
- [舉個 :chestnut:](https://github.com/jayhung97724/106-2NCHUIT-web/blob/master/HotArticles.json)
- [更多說明](https://blog.wu-boy.com/2011/04/%E4%BD%A0%E4%B8%8D%E5%8F%AF%E4%B8%8D%E7%9F%A5%E7%9A%84-json-%E5%9F%BA%E6%9C%AC%E4%BB%8B%E7%B4%B9/)
```javascript=
const request = require('request');
const cheerio = require('cheerio');
const fs = require('fs');
let HotArticles = [];
let url = 'https://www.ptt.cc'
request(url + '/bbs/Beauty/index.html', (error, response, body) => {
console.log('error: ', error);
console.log('sratusCode: ', response && response.statusCode);
// console.log('body: ', body);
var $ = cheerio.load(body);
articleArr = $('div.r-ent');
// console.log(articleArr[0]);
for (var i = 0; i < articleArr.length - 5; i++) {
articleName = $(articleArr[i]).find('.title').find('a').text();
articlePush = $(articleArr[i]).find('.nrec').find('.f2').text();
articleDate = $(articleArr[i]).find('.meta').find('.date').text();
articleAuthor = $(articleArr[i]).find('.meta').find('.author').text();
articleLink = url + $(articleArr[i]).find('.title').find('a').attr('href');
// https://www.ptt.cc/bbs/Beauty/M.1520378107.A.181.html
if (articleLink.indexOf("undefined") == -1) {
HotArticles.push({
articleName: articleName,
articleDate: articleDate,
articleLink: articleLink,
articleAuthor: articleAuthor,
articlePush: (articlePush == "") ? 0 : parseInt(articlePush)
});
console.log(`${articleName}\n>> ${articleLink}\n>> 日期:${articleDate}\n>> 作者:${articleAuthor}\n>> 推數:${articlePush}\n`);
}
}
fs.writeFile('HotArticles.json', JSON.stringify(HotArticles), function (err) {
if (err)
console.log(err);
else
console.log('File ' + 'HotArticles.json' + ' written!');
})
console.log('共 ' + HotArticles.length + ' 篇\n');
});
```
<!--
```javascript=
```
-->
### 接下來要講怎麼拿到圖片連結囉!
#### 1. 先 [Inspect](https://www.ptt.cc/bbs/Beauty/M.1521033451.A.439.html) 網頁元素
Try 一個看看
```javascript=
var cheerio = require('cheerio'); // Basically jQuery for node.js
var request = require('request-promise');
var options = {
uri: 'https://www.ptt.cc/bbs/Beauty/M.1521110977.A.EE0.html',
transform: function (body) {
return cheerio.load(body);
}
};
request(options)
.then(function ($) {
// Process html like you would with jQuery...
console.log($('#main-content a'))
})
.catch(function (err) {
// Crawling failed or Cheerio choked...
});
```
#### 2. 處理多個 request
Q: 為何需要?
A: 因為 request 比較耗時,要等大家都把圖片抓完(request完成)寫檔才有用,不然會是空的檔案。
* [Promise 函式簡介](https://eyesofkids.gitbooks.io/javascript-start-es6-promise/content/contents/basic_usage.html)
* [Promise 另一個例子](http://www.oxxostudio.tw/articles/201706/javascript-promise-settimeout.html)
* [request-promise 結合](https://www.npmjs.com/package/request-promise)
```javascript=
const request = require('request-promise');
const cheerio = require('cheerio');
const q = require('q');
const fs = require('fs');
let articleArr = JSON.parse(fs.readFileSync('./HotArticles.json', 'utf8'));
let promises = [];
let urls = [];
for (let index = 0; index < articleArr.length; index++) {
urls.push({
uri: articleArr[index].articleLink,
transform: (body) => {
return cheerio.load(body);
}
});
// 這是 request-promise 的變化型
}
console.log(urls);
let fetchPages = (urls) => {
// forEach 寫起來是不是簡潔多了
urls.forEach((url) => {
promises.push(request(url));
});
// q 是一個 queue
return q.all(promises);
}
let writeJSON = () => {
console.log(articleArr);
fs.writeFile('HotBeauties.json', JSON.stringify(articleArr), function (err) {
if (err)
console.log(err);
else
console.log('File ' + 'HotBeauties.json' + ' written!');
})
}
fetchPages(urls).then((pages) => {
let articleImages = [];
console.log(pages.length);
// [0]('#main-content a')[0].attribs.href
for (let pgI = 0; pgI < pages.length; pgI++) {
let $ = pages[pgI];
// console.log($)
let imgLinks = [];
imgArr = $('#main-content a');
// console.log(imgArr.length)
for (let imgI = 0; imgI < imgArr.length; imgI++) {
// console.log(imgI.toString())
let imgUrl = imgArr[imgI.toString()].attribs.href;
if (!(imgUrl == "" || imgUrl.indexOf(".jpg") == -1)) {
// console.log(imgUrl)
imgLinks.push(imgUrl);
}
}
articleArr[pgI].articleImages = imgLinks;
}
writeJSON();
});
```
**補充介紹Queue:**
{%youtube UpvDOm3prfI %}
<!-- aaa
-->
## Part III 帥氣的爬蟲 - 2018/03/22
```
火星章魚怪
~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~
/|||\ /|||\ /|||\ /|||\ /|||\ /|||\
```
### 在開始之前:
:::success
把你之前的東西載下來
or download from [My repo](https://github.com/jayhung97724/106-2NCHUIT-web)
:::
### 基礎的 sublime text 設定:
1. `ctrl` + `shift` +`P`
2. 打 `PC`
3. Download `package controller`
4. `ctrl` + `shift` +`P`
5. 打 `PCI` 安裝 `emmet`
### 顯示
#### 基本 List:
```htmlembedded=
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>BasicList</title>
</head>
<body>
<ul>
<li class='myTitle'>title on web</li>
<ol>
<li class='myContent'>content on web</li>
</ol>
</ul>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<!-- <script src="./scripts/basic.js"></script> -->
</body>
</html>
```
### 基本操作:
```javascript=
let article = {
"articleName": "[神人]枇杷膏廣告女角",
"articleDate": " 3/18",
"articleLink": "https://www.ptt.cc/bbs/Beauty/M.1521305393.A.AD0.html",
"articleAuthor": "vanillamint",
"articlePush": 0,
"articleImages": [
"https://i.imgur.com/OJWdFT1.jpg",
"https://i.imgur.com/NZbXl0y.jpg",
"https://i.imgur.com/OiCnvu8.jpg",
"https://i.imgur.com/sVkLdNZ.jpg"
]
}
let title = article.articleName;
let urls = article.articleImages;
console.log(urls);
let $title = $('.myTitle');
let $content = $('.myContent');
console.log($title.text(), ' + ',$title.siblings('ol').find('.myContent'));
$title.text(title);
li = $content;
urls.forEach((url)=>{
_li = li.clone();
console.log(_li);
_li.text(url);
$('ol').append(_li);
})
```
## Part IV 進擊的蟲屍 - 2018/03/29
網址:yoyohung.com/106-2NCHUIT-web/practice1.html
### Before you start:
確認 [sublime](#基礎的-sublime-text-設定) 的套件有裝好
### 覺得上次的版面太陽春?
Let's introduce [Semantic UI](https://semantic-ui.com/usage/layout.html)
先看看這個結果:
https://github.com/jayhung97724/106-2NCHUIT-web/blob/master/demo.html
在 head 地方引入他:
```htmlembedded=
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/semantic-ui/2.3.1/semantic.min.css" />
```
:::success
有沒有差別!!!!?
:::
### 版型
```htmlembedded=
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Practice 1</title>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/semantic-ui/2.3.1/semantic.min.css" />
<link rel="stylesheet" href="./css/main.css">
</head>
<body>
<div class="ui segment">
<div class="ui centered huge header">練習一</div>
</div>
<div class="ui segment">
<div class="ui styled accordion">
<!-- <template id="articleTemplate"> -->
<div class="title"><i class="dropdown icon"></i><span>title1</span></div>
<div class="content">
<p class="transition hidden">
<div>
<div class="imgBox">
<img class="myImage" src="https://i.imgur.com/sVkLdNZ.jpg">
</div>
</div>
</p>
</div>
<!-- </template> -->
</div>
</div>
<!-- (.title>i.dropdown.icon^.content>p.transition.hidden)*3 -->
<!-- 遠端套件 -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/semantic-ui/2.3.1/semantic.min.js"></script>
<!-- 自己寫 -->
<script src="./scripts/main.js"></script>
</body>
</html>
```
### 好像沒有功用??
少了一些東西
```htmlmixed=
<script>
$('.accordion').accordion();
</script>
```
### js manipulate:
```javascript=
let Beauties =[]
let $accordion = $('.accordion')
$.getJSON("./HotBeauties.json", (file) => {
Beauties = file;
console.log(Beauties);
Beauties.forEach((b) => {
let templateString = $('#articleTemplate').html();
let template = $(templateString);
title = template.siblings('.title');
content = template.siblings('.content');
span = title.find('span');
span.text(b.articleName);
boxTag = content.find('.imgBox');
imgTag = content.find('img');
b.articleImages.forEach(lnk => {
_imgTag = imgTag.clone();
_imgTag.attr('src', lnk);
boxTag.append(_imgTag);
});
imgTag.remove();
// console.log(title);
// console.log(content);
$accordion.append(title);
$accordion.append(content);
})
});
// $('.accordion').accordion();
```
### CSS 修飾
```css=
.header {
margin-top: 32px !important;
}
.segment {
display: flex;
border: none !important;
justify-content: center;
}
.imgBox {
display: flex;
justify-content: center;
flex-direction: column;
}
.myImage {
width:70%;
margin: 0 auto 24px;
border-radius: 0.8rem;
box-shadow: 4px 4px 16px -4px rgba(0, 0, 0, 0.8);
}
```
:::success
## 萌萌噠
:::
###### tags: `NCHUIT`
## Part V 使用GitHub Pages架設網頁 2018/04/12
### Step0 創 github 帳號,填連結
[表單連結](http://bit.ly/2H0hDxk)
### Step1 開新Repo
![](https://i.imgur.com/jTSEsgI.png)
### Step2 輸入Repo資訊
![](https://i.imgur.com/pC2OA42.png)
### Step3 從遠端clone至本地端
如果你的電腦沒有 git 就先[安裝](https://git-scm.com/book/zh/v2/%E8%B5%B7%E6%AD%A5-%E5%AE%89%E8%A3%85-Git)
```shell
git clone [url]
```
![](https://i.imgur.com/CeCdkjf.png)
> 在本地端得到一個啥都沒有的repo
![](https://i.imgur.com/VK9D6Wr.png)
### Step4 更新網頁資源
> 好不容易爬下來的資料終於要派上用場惹
> [{沒帶code終結者}](http://download1334.mediafire.com/8fzca5b0vcbg/be65zk3382h9s8q/hentai.zip)
![](https://i.imgur.com/CESGuJq.png)
### Step5 上傳網頁資源到遠端
> 別忘了cd到這個repo!!
```bash=
git status
git add .
git commit -m "hello hentai"
git push
```
![](https://i.imgur.com/Ky2h0sx.png)
### Step6 開通GitHub Pages服務
> 因為我們的網頁資源在master這個Branch
> ![](https://i.imgur.com/2oCgJyc.png)
![](https://i.imgur.com/OTKstyk.png)
### Step7 發布成功
![](https://i.imgur.com/YkKWk07.png)
### Step8 享受你的Hentai網頁
```
火星章魚怪
~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~ ~(@o@)~
/|||\ /|||\ /|||\ /|||\ /|||\ /|||\
```
## 再來學習如何協作,丟 pull request!
### Step 1. 先到你欲修改的主repo
https://github.com/jayhung97724/106-2NCHUIT-web/
點選右上角的 fork 按鈕,將整個專案複製一份至自己帳號底下。
![](https://i.imgur.com/mUCzmJI.png)
### Step 2. clone 回本地端修改
打開終端機 CMD
```shell=
// clone 綠色按鈕的連結
git clone https://github.com/[Username]/106-2NCHUIT-web.git
// 進到資料夾
cd 106-2NCHUIT-web
```
### Step 3. 在 README.md 中
Other club members’ work 下方
以範例語法新增你的連結。
```
- [Text](url)
```
但要注意順序~~~徵求自願者嘗試!
### Step 4.
```shell=
git status
git add .
git commit -m "新增誰誰誰的作品連結"
git push
```
### Step 5. 回到原作者的repo 按 new pull request
### Step 6. 等原作者同意 merge
![](https://i.imgur.com/vnI2dcy.png)