<style>
html, body, .ui-content {
background-color: #333;
color: #ddd;
}
body > .ui-infobar {
display: none;
}
.ui-view-area > .ui-infobar {
display: block;
}
.markdown-body h1,
.markdown-body h2,
.markdown-body h3,
.markdown-body h4,
.markdown-body h5,
.markdown-body h6 {
color: #ddd;
}
.markdown-body h1,
.markdown-body h2 {
border-bottom-color: #ffffff69;
}
.markdown-body h1 .octicon-link,
.markdown-body h2 .octicon-link,
.markdown-body h3 .octicon-link,
.markdown-body h4 .octicon-link,
.markdown-body h5 .octicon-link,
.markdown-body h6 .octicon-link {
color: #fff;
}
.markdown-body img {
background-color: transparent;
}
.ui-toc-dropdown .nav>.active:focus>a, .ui-toc-dropdown .nav>.active:hover>a, .ui-toc-dropdown .nav>.active>a {
color: white;
border-left: 2px solid white;
}
.expand-toggle:hover,
.expand-toggle:focus,
.back-to-top:hover,
.back-to-top:focus,
.go-to-bottom:hover,
.go-to-bottom:focus {
color: white;
}
.ui-toc-dropdown {
background-color: #333;
}
.ui-toc-label.btn {
background-color: #191919;
color: white;
}
.ui-toc-dropdown .nav>li>a:focus,
.ui-toc-dropdown .nav>li>a:hover {
color: white;
border-left: 1px solid white;
}
.markdown-body blockquote {
color: #bcbcbc;
}
.markdown-body table tr {
background-color: #5f5f5f;
}
.markdown-body table tr:nth-child(2n) {
background-color: #4f4f4f;
}
.markdown-body code,
.markdown-body tt {
color: #eee;
background-color: rgba(230, 230, 230, 0.36);
}
a,
.open-files-container li.selected a {
color: #5EB7E0;
}
</style>}
# LineBot 聊天機器人(Python+Flask+Heroku)建立紀錄 Part 2-1 : 指令爬取表特版照片(爬蟲篇)
###### tags: `LineBot`
### 使用套件
* beautifulSoup
* requests
此機器人接收到特定指令後會爬取[表特版](https://www.ptt.cc/bbs/Beauty/index.html)的圖片並回傳至聊天視窗。設定為爬取前三頁所有貼文內的圖片,並隨機回傳一張照片。
### 載入所需模組
```
import json
import os
import requests
import urllib.parse
from bs4 import BeautifulSoup
import random
```
## 取得當前網頁所有貼文內之連結並隨機回傳
### 使用python 登入表特版
```
def looking_picture(url):
base_url = 'https://www.ptt.cc'
header = {
'User-Agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:79.0) Gecko/20100101 Firefox/79.0",
"Cookie" : "_ga=GA1.2.1992706948.1582614518; __cfduid=d3c8d86602470d05242b056fe538d18781595137978; over18=1; __cf_bm=cc1e8fa95958d9231e1c7808f60eeebcf3a131d2-1597193458-1800-Ado72sxElw7hsP5j9giK2ykFqh8lHb109qe5PYqE/twVKIQhjxIq56FNvCHAWWT4VFF/t56l5mEbWAAY0AAHePM=; _gid=GA1.2.25046529.1597193460"
}
html = requests.get(url, headers = header).text
soup = BeautifulSoup(html, 'lxml') # 使用BeautifulSoup 解析網頁
```
### 取得每篇文章的連結
進到表特版後使用Inspect Element功能檢視網頁

由上圖可知每篇post的連結會放在class = 'title' 的div tag 內,因此可以用以下程式碼取得部份的網站連結
```
def looking_picture(url):
...
pptdivs = soup.find_all('div',{'class':'title'})
```
進入所有post, 爬取所有圖片的網址,將開頭是 https://i.imgur 的連結存取到 photos 的 list 內,並使用random.choice 選擇一張照片並回傳
```
def looking_picture(url):
...
photos = []
# Enter to the link and search for each link of picture
for post in pptdivs:
if post.find('a'):
post_link = base_url + post.find('a')['href']
post_html = requests.get(post_link, headers = header).text
post_soup = BeautifulSoup(post_html, 'lxml')
main_container = post_soup.find_all('div', {'id':'main-container'})
url_photos = main_container[0].find_all('a')
for photo in url_photos:
href_photo = photo['href']
if href_photo.startswith('https://i.imgur'):
photos.append(href_photo)
the_photo = random.choice(photos)
return the_photo
```
## 上一頁
只爬一張網頁感覺有點少,此function 可以讓程式去上一頁
### 取得上一頁的網址

由Inspect 功能得到上一頁按鍵的tag位置
```
def ptt_last_page(url):
base_url = 'https://www.ptt.cc'
header = {
'User-Agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:79.0) Gecko/20100101 Firefox/79.0",
"Cookie" : "_ga=GA1.2.1992706948.1582614518; __cfduid=d3c8d86602470d05242b056fe538d18781595137978; over18=1; __cf_bm=cc1e8fa95958d9231e1c7808f60eeebcf3a131d2-1597193458-1800-Ado72sxElw7hsP5j9giK2ykFqh8lHb109qe5PYqE/twVKIQhjxIq56FNvCHAWWT4VFF/t56l5mEbWAAY0AAHePM=; _gid=GA1.2.25046529.1597193460"
}
ptthtml = requests.get(url, headers = header).text
soup = BeautifulSoup(ptthtml, 'lxml')
div_page = soup.find('div', 'btn-group-paging')
last_page = div_page.find_all('a')[1]['href']
last_page_link = base_url + last_page
return (last_page_link)
```
## Main Function 執行上面兩個程式並傳回篩選出的圖片網址
### 想爬多少頁就執行多少次,實際測試爬兩頁就花了不少時間(30 sec), 爬蟲速度需要再優化(未來課題)
```
def get_picture():
# try:
url = 'https://www.ptt.cc/bbs/Beauty/index.html'
gallary = []
pic1 = looking_picture(url)
gallary.append(pic1)
url = ptt_last_page(url)
print('======================================================='+ url)
pic2 = looking_picture(url)
gallary.append(pic2)
print(len(gallary))
the_choice_picture = random.choice(gallary)
return the_choice_picture
```
### 最後只要把回傳出來的圖片接到LineBot上就大功告成了,留給下一篇。