Facebook 爬蟲 (Python + Selenium)

# Facebook 爬蟲 (Python + Selenium) ## Environment :::info VS Code Version: 1.45.0 (user setup) Electron: 7.2.4 Chrome: 78.0.3904.130 Node.js: 12.8.1 V8: 7.8.279.23-electron.0 OS: Windows_NT x64 10.0.18363 Python v3.7.3 ::: ## 爬靜態網頁 (Python import requests ) > [靜態網頁爬蟲概念說明](https://ithelp.ithome.com.tw/articles/10185864) by [iT邦幫忙作者:v123582](https://ithelp.ithome.com.tw/users/20091368/ironman) ### 安裝環境 #### 安裝Selenium Selenium 主要是讓瀏覽器自動化 > [Selenium 詳細說明](https://learngeb-ebook.readbook.tw/intro/selenium.html) > ``` pip install selenium``` #### 下載Webdriver :::spoiler WebDriver 介紹 WebDriver是用來執行並操作瀏覽器的一個API介面，程式可以透過WebDriver來控制瀏覽器。Webdriver也包括 Chrome，Safari，Firefox 等主流的瀏覽器，這裡用chromedriver作為示範 ::: > **Step 1. 檢查Google Chrome瀏覽器版本** > ![](https://i.imgur.com/QM3uz7i.png) ![](https://i.imgur.com/92Sf96O.png) ::: success 懶人tips : 直接在Chrome URL 輸入 ```chrome://settings/help``` ::: ![](https://i.imgur.com/Z2BeELQ.png) > **Step 2. 下載Chrome Driver** > [下載網址](https://chromedriver.chromium.org/) > ![](https://i.imgur.com/2q5JWEK.png) :::info 要記住安裝位置 ::: ### 開爬 #### 以Google搜尋特定圖片為範例 > **Step 1. Google圖片"柴犬"，將網址複製下來** > ![](https://i.imgur.com/mBEtHbL.jpg) #### Coding ([Crawler.py](https://github.com/rushshin/Craw-Learning/blob/master/crawler.py)) > **Step 2. 寫爬蟲程式** > ::: danger > **要修改的部分** > 1. 存圖位置 (local_path) > 2. chromedriver位置 (chromeDriver) > > ::: ``` python from selenium import webdriver import time import urllib import os # 存圖位置 local_path = 'imgs' # 爬取頁面網址 url = 'https://www.google.com.tw/search?q=%E6%9F%B4%E7%8A%AC&tbm=isch&hl=zh-TW&nfpr=1&hl=zh-TW&ved=2ahUKEwjNpcmLpbLpAhWKw4sBHb21CIQQBXoECAEQKA&biw=1519&bih=722' # 目標元素的xpath xpath = '//div[@id="imgid"]/ul/li/a/img' # 啟動chrome瀏覽器 chromeDriver = r'D:/SHIN/Reserch/Crawler/chromedriver' # chromedriver檔案放的位置 driver = webdriver.Chrome(chromeDriver) # 最大化窗口，因為每一次爬取只能看到視窗内的圖片 driver.maximize_window() # 紀錄下載過的圖片網址，避免重複下載 img_url_dic = {} # 瀏覽器打開爬取頁面 driver.get(url) # 模擬滾動視窗瀏覽更多圖片 pos = 0 m = 0 # 圖片編號 for i in range(100): pos += i*500 # 每次下滾500 js = "document.documentElement.scrollTop=%d" % pos driver.execute_script(js) time.sleep(1) for element in driver.find_elements_by_xpath(xpath): try: img_url = element.get_attribute('src') # 保存圖片到指定路徑 if img_url != None and not img_url in img_url_dic: img_url_dic[img_url] = '' m += 1 # print(img_url) ext = img_url.split('/')[-1] # print(ext) filename = str(m) + 'kerGee' + '_' + ext +'.jpg' print(filename) # 保存圖片 urllib.request.urlretrieve(img_url, os.path.join(local_path , filename)) except OSError: print('發生OSError!') print(pos) break; driver.close() ```