# Facebook 爬蟲 (Python + Selenium)
## Environment
:::info
VS Code
Version: 1.45.0 (user setup)
Electron: 7.2.4
Chrome: 78.0.3904.130
Node.js: 12.8.1
V8: 7.8.279.23-electron.0
OS: Windows_NT x64 10.0.18363
Python v3.7.3
:::
## 爬靜態網頁 (Python import requests )
> [靜態網頁爬蟲 概念說明](https://ithelp.ithome.com.tw/articles/10185864) by [iT邦幫忙 作者:v123582](https://ithelp.ithome.com.tw/users/20091368/ironman)
### 安裝環境
#### 安裝Selenium
Selenium 主要是讓瀏覽器自動化
> [Selenium 詳細說明](https://learngeb-ebook.readbook.tw/intro/selenium.html)
>
``` pip install selenium```
#### 下載Webdriver
:::spoiler WebDriver 介紹
WebDriver是用來執行並操作瀏覽器的一個API介面,程式可以透過WebDriver來控制瀏覽器。Webdriver也包括 Chrome,Safari,Firefox 等主流的瀏覽器,這裡用chromedriver作為示範
:::
> **Step 1. 檢查Google Chrome瀏覽器版本**
>


::: success
懶人tips : 直接在Chrome URL 輸入
```chrome://settings/help```
:::

> **Step 2. 下載Chrome Driver**
> [下載網址](https://chromedriver.chromium.org/)
>

:::info
要記住安裝位置
:::
### 開爬
#### 以Google搜尋特定圖片為範例
> **Step 1. Google圖片"柴犬",將網址複製下來**
>

#### Coding ([Crawler.py](https://github.com/rushshin/Craw-Learning/blob/master/crawler.py))
> **Step 2. 寫爬蟲程式**
> ::: danger
> **要修改的部分**
> 1. 存圖位置 (local_path)
> 2. chromedriver位置 (chromeDriver)
>
> :::
``` python
from selenium import webdriver
import time
import urllib
import os
# 存圖位置
local_path = 'imgs'
# 爬取頁面網址
url = 'https://www.google.com.tw/search?q=%E6%9F%B4%E7%8A%AC&tbm=isch&hl=zh-TW&nfpr=1&hl=zh-TW&ved=2ahUKEwjNpcmLpbLpAhWKw4sBHb21CIQQBXoECAEQKA&biw=1519&bih=722'
# 目標元素的xpath
xpath = '//div[@id="imgid"]/ul/li/a/img'
# 啟動chrome瀏覽器
chromeDriver = r'D:/SHIN/Reserch/Crawler/chromedriver' # chromedriver檔案放的位置
driver = webdriver.Chrome(chromeDriver)
# 最大化窗口,因為每一次爬取只能看到視窗内的圖片
driver.maximize_window()
# 紀錄下載過的圖片網址,避免重複下載
img_url_dic = {}
# 瀏覽器打開爬取頁面
driver.get(url)
# 模擬滾動視窗瀏覽更多圖片
pos = 0
m = 0 # 圖片編號
for i in range(100):
pos += i*500 # 每次下滾500
js = "document.documentElement.scrollTop=%d" % pos
driver.execute_script(js)
time.sleep(1)
for element in driver.find_elements_by_xpath(xpath):
try:
img_url = element.get_attribute('src')
# 保存圖片到指定路徑
if img_url != None and not img_url in img_url_dic:
img_url_dic[img_url] = ''
m += 1
# print(img_url)
ext = img_url.split('/')[-1]
# print(ext)
filename = str(m) + 'kerGee' + '_' + ext +'.jpg'
print(filename)
# 保存圖片
urllib.request.urlretrieve(img_url, os.path.join(local_path , filename))
except OSError:
print('發生OSError!')
print(pos)
break;
driver.close()
```
{"metaMigratedAt":"2023-06-15T08:10:03.916Z","metaMigratedFrom":"Content","title":"Facebook 爬蟲 (Python + Selenium)","breaks":true,"contributors":"[{\"id\":\"4fd546ef-ab3b-490c-beda-222c1aa2f170\",\"add\":4732,\"del\":1850}]"}