Python爬蟲

前言

網頁爬蟲可以做到很多事情，EX:股票、天氣等。
因近期要在FB社團找房，於是就想說用Py來爬蟲看看。

知識點

可以學到如何使用py套件進行爬蟲

Part.0

套件使用

selenium：提供web自動化的工具包。
BeautifulSoup：用來解析HTML、XML
requests：py的HTTP套件
其他…(pandas、numpy…資料處理相關套件)

Part.1

基本設定

先引入套件(廢話)




from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests
import time

決定你要爬的網頁


url='https://www.facebook.com/groups/2427883260776511'
#這邊以台北租屋社團為例

因為有些社團會有權限問題需要登陸，所以先使用webdriver進行登錄
因為透過webdriver開啟的瀏覽器是很乾淨得所以每次開啟都須登錄，可將該登陸動作寫成一個function的方式，在進行臉書爬蟲階可使用。











def login_fb():
    options = webdriver.ChromeOptions()#因為開啟瀏覽器時會有一些通知，這些通知會造成我們無法自動開啟，於是需要一些設定
    prefs = {'profile.default_content_setting_values': {'notifications': 2}}#將通知關掉
    options.add_experimental_option('prefs', prefs)#加入
    password = "密碼"#臉書的帳戶
    email = "電子郵件"
    driver = webdriver.Chrome("./chromedriver", options=options)#設定瀏覽器啟動器及設定項，Chrome("瀏覽器啟動位置", options=options)
    driver.get('https://www.facebook.com/')#進到臉書登陸畫面
    driver.find_element('id', 'email').send_keys(email)#找取email欄位進行send_keys的動作
    driver.find_element('id', 'pass').send_keys(password)#找取密碼欄位進行send_keys的動作
    driver.find_element('name', 'login').click()#登陸

然後你會發先登錄後會自動關閉，因為需要增加time讓他delay。


time.sleep(5)#停留5秒

Part.2

抓取網頁元素

若不熟悉HTML標籤，以及屬性，之後我會再寫一篇介紹HTML。
這邊簡單說
html就是由很多標籤組成的網頁元素，例如,<div>這是容器標籤</div>
在瀏覽器按下F12就可看該網頁的元素

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

待新增

Python爬蟲

前言

相關知識

知識點

Part.0

套件使用

Part.1

基本設定

Part.2

抓取網頁元素

tags: `Python`,`網頁`

Python爬蟲

前言

相關知識

知識點

Part.0

套件使用

Part.1

基本設定

Part.2

抓取網頁元素

tags: Python,網頁

Read more

資料品質(Data Quality)的重要性

Linux常用指令

Cookie & Session

Markdown常用語法

tags: `Python`,`網頁`