Scrapy 介紹 - HackMD

# Scrapy 介紹 ###### tags: `Scrapy`,`Crawler`,`Python` ``` 主旨： Scrapy 學習知識文檔更新日期： 2022-04-07 版本： V1.0 作者： Ya-Sheng Chen (Rock) ``` --- ### Referance - why we need to learning crawl? - [The Best And Worst Moments Of My Journey When Crawling Websites](https://towardsdatascience.com/my-journey-of-crawling-website-bd3294322e3c) - 資源網站： - 介紹基礎設定、參數說明[tutorialspoint](https://www.tutorialspoint.com/scrapy/index.htm) - 多IP [scrapy with tor](https://datawookie.dev/blog/2021/06/scrapy-rotating-tor-proxy/) - 爬蟲既看[崔庆才個人站點- 爬蟲教程](https://cuiqingcai.com/) - 書籍： [Learning Scrapy第一版.pdf <全英文>](https://oiipdf.com/dl/5c73b3ed5b63937c3f8b4c77) - 文章： - [教學連結](https://www.jianshu.com/p/6c9baeb60044) - [我们从爬取1000亿个网页中学到了什么？](https://juejin.im/post/5b55901df265da0fb0186fe7) - [Scrapy框架的使用之Scrapy通用爬虫](https://juejin.im/post/5b026d53518825426b277dd5) - 視覺化： [Scrapy Shell GUI](https://blog.scrapinghub.com/building-spiders-made-easy-gui-for-your-scrapy-shell) - Tips： - [5 Tips To Create A More Reliable Web Crawler](https://towardsdatascience.com/my-journey-of-crawling-website-bd3294322e3c) - [Prevent Getting block](https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/) - Proxy: - [多IP <收費>](https://infatica.io/) - [tor](https://www.torproject.org/) - [Tor IP changing and web scraping](https://dm295.blogspot.com/2016/02/tor-ip-changing-and-web-scraping.html) - Git資源: - [Sky<GUI網站爬蟲>](https://github.com/kootenpv/sky) --- # 0. 前言 Scrapy算得上是Python世界中最常用的爬蟲框架了，它也是最難學習的框架，很多初學者時常不清楚Scrapy應該如何入手，雖然不好上手，其架構的思路、執行的效能和可擴展的能力都非常出眾。 ## 0.1 Scrapy的優點 * 提供**內置的http緩存**，以加速本地開發。 * 提供自動節流調節機制，並且具有遵守 **robots.txt的設置能力**。 * 可以**定義爬行深度**的限制，以避免爬蟲進入死循環連結。 * 會**自動保留對話** * 執行**自動HTTP基本認證**，不需要明確保存狀態。 * 可以**自動填寫登錄表單**。 * Scrapy 有一個內置的中間件，可以**自動設置請求中引用的(reffer) Head**。 * **支持通過3xx響應的重導向**，也可以通過HTML元刷新。 * **避免**被網站使用 **(noscript)meta重導向困住**，以檢測有沒有JS支持的頁面。 * 默認**使用Css選擇器**或**Xpath編寫解析器**。 * 可以通過 **Splash** 或 **任何其他技術(ex:selenium)** 呈現JavaScript頁面。 * 擁有強大的社群支持和豐富的插件來擴展其功能。 * **提供**了**通用Spider**來抓取常見的格式，如點地圖、CSV和XML。 * 內置 **支援**以**多種格式**（JSON, CSV, XML, JSON-lines) 導出收集的數據並將其儲存在多個後端（FTP、S3、本地文件系統)中。 ## 0.2 Scrapy架構 ![架構圖](https://i.imgur.com/xwdGwqb.png) 1. **Spider**發送最初的請求(`Requests`)給Engine。 2. **Engine**在`Scheduler`調度一個請求(Requests)，並要求下一次`Requests`做爬取。 3. **Scheduler**回傳下一個`Requests`給`Engine`。 4. **Engine**透過`Downloader Middlewares`發送請求給`Downloader`。 5. 只要頁面結束下載，**Downloader**產生一個`Response`透過`Downloader Middlewares`傳送給`Engine`。 6. **Engine**收到來自`Downloader`的`Response`並透過`Spider Middlewares`發送給`Spider`處理。 7. **Spider**處理`Response`並爬取的項目(`item`)和新的請求(Requests)，透過`Spider Middlewares`回傳給`Engine`。 8. **Engine**發送處理的項目(`item`)給Item `Pipelines`接著發送處理的請求(`Requests`)到`Scheduler`要求下一個可能的爬蟲請求。 ## 0.3 元件介紹： 1. **Scrapy Engine**: >整個框架的核心負責處理整個系統的資料流與事件。 2. **Scheduler**: >排程器，接收`Engine`的請求放到佇列中並決定下一個要抓取的網址（預設也會去除重複的網址）。 3. **Downloader**: >用於下載網頁內容(發送HTTP請求/接收HTTP回應)，並將內容返回給`Spider`。 4. **Spiders**: >從網頁中提取自己需要的項目。 5. **Item Pipeline**: >用來持久化實體、驗證實體的有效性、清除不需要的資訊。當頁面被解析後發送到專案管道經過幾個特定的順序處理資料。 6. **Downloader middlewares**: >`Scrapy Engine`和`Downloader`間的中介，負責處理`Scrapy Engine`與`Downloader`之間的`Requests`及`Response`。 ## 0.4 文件目錄結構 :::info |---`project_folder` |　　｜--`__init__.py` 　　　*專案定義* |　　｜--`items.py`　　　　　*物件定義* |　　｜--`pipelines.py`　　　*管道定義* |　　｜`settings.py`　　　　*配置文件* |　　｜`spiders`　　　　　 *爬蟲文件夾* |　　　　|__ `__init__.py`　*默認的爬蟲代碼文件* |__ `scrapy.py`　　　　　　*scrapy運行配置文件* ::: --- # 1. 要擊敗敵人需要先瞭解敵人爬蟲是就像是去網站取得對方的資料，就很像攻防戰一般，如何能以最有效率的方式取得資料，是爬蟲的重要指標。想要打敗敵人就要知道敵人，所以我們必須了解網站架構，在網頁架構中，有兩種結構 1. `Xpath` **(xml tree 的結構路徑）** 2. `CSS` **(網頁版面風格的物件風格)** ，通常我們會利用者兩種來定位我們需要爬取的資料。那接下來我們就來了解如何編寫Xpath --- ## 1.1 Xpath 的編寫 ### Xpath 基本語法 | Expression | Description | | ---------- | ---------------------------------------- | | nodename | 取得`<nodename>`所有節點 | | / | 從主目錄開始取得 | | // | 模糊取得錄境內所有符合節點名稱的節點內容 | | \. | 所在位置節點 | | \.\. | 上層節點 | | \@ | 取得屬性 | `eaxmple` | Expression | Description | | ---------- | ---------------------------------------- | | bookstore | 選擇所有名字為`bookstore`的節點 | | /bookstore | 選擇 `root`節點中名字為`bookstore`的節點 `note: 如果節點以/為起點，則找尋路徑為絕對路徑`| | bookstore/book | 選擇`bookstore`中名字為`book`的節點內容 | | //book |取得所有`book`名稱的節點內容，不管在哪一層級 | |bookstore//book| 選擇`bookstore`中名字為`book`的所有`book`節點內容| | //@lang |取得所有名字為`lang`的所有屬性內容 | ### 指定特定位置或內容 | Expression | Description | | ------------------------------ | ------------------------------- | | /bookstore/book[1] | 取的第一個位置為1 與名字為books | | /bookstore/book[last()] | 取得最後一個名字為`books` 位置內容 | | /bookstore/book[last()-1] |取得最後一的前一個位置，名字為`books` 位置內容| | /bookstore/book[postition()<3] | 取得`books`層級位置小於3的內容 | | //title[@lang] | 取得所有`title`中，屬性有`lang`的內容 | | //title[@lang = 'en'] | 取得所有`title`中，屬性有`lang`的內容，並`lang`='en' | |/bookstore/book[price>35.00]| 取得`books`層級中價格大於35.00 | |/bookstore/book[price>35.00]/title| 取得所有`title`, `books`層級中價格大於35.00| ### 模糊節點查詢 | Expression | Description | | ---------- | ----------- | | * | 取得任何節點內容 | | @* | 取得任何節點屬性 | | node() | 取得所有節點 | ### 多項查詢 | Expression | Description | | ---------- | ----------- | | //book/title \| //book/price |取得所有`book`中`title`和`price`節點內容| | //title \|//price | 取得所有`title`和`price`節點內容 | | /bookstore/book/title \| //price |取得`bookstore`中`book`中`title`內容和所有`price`節點內容| 參考資源: - [w3schools](https://www.w3schools.com/xml/xpath_syntax.asp) - [Xpath 詳細說明文檔](https://hal.inria.fr/hal-01612689/document) #### contains & start-with() ![](https://i.imgur.com/4yPyRvw.png) ![](https://i.imgur.com/O6pAwF5.png) 範例： ``` 取得所有<a>@href 中出現.txt 字的 /@href <注意：此用法只取締一個符合的結果> fe = '//a[contains(@href,".txt")]/@href' ``` # Itemloaders 利用Itemloaders 取代Extract() & xpath() ``` def parse(self, response): l = ItemLoader(item=PropertiesItem(), response=response) l.add_xpath('title', '//*[@itemprop="name"][1]/text()') l.add_xpath('price', './/*[@itemprop="price"]' '[1]/text()', re='[,.0-9]+') l.add_xpath('description', '//*[@itemprop="description"]' '[1]/text()') l.add_xpath('address', '//*[@itemtype=' '"http://schema.org/Place"][1]/text()') l.add_xpath('image_URL', '//*[@itemprop="image"][1]/@src') return l.load_item() ``` `其他處理功能` | 功能 | 說明 | | ---------------------------------------------- | -------------------------------- | | Join() | 合併多個結果 | | MapCompose(unicode.strip) | 除去空格 | | MapCompose(unicode.strip, unicode.title) | 除去空格，字首字母大寫 | | MapCompose(float) | 將字串轉為數字 | | MapCompose(lambda x: x.replace(',',''), float) | 將字串轉化為數字，逗號替換為空格 | | MapCompose(lambda x: urlparse.urljoin(response.url, x)) | 使用response.url為開頭，並與相對URL與之合併為絕對URL | 範例 : ``` def parse(self, response): link.add_xpath('title'. '//*[@itemprop="name"][1]/text()') link.add_xpath('price', './/*[@itemprop="price"][1]/text()'), re='[,0-9]+' link.add_xpath('description', '//*[@itemprop="description"][1]/text()', MapCompose(unicode.strip),join() link.add_xpath('address', '//*[@itemtype="http://schema.org/Place"][1]/text()', MapCompos(unicode.strip) link.add_xpath('image_URL', '//*[@itemprop]="image"[1]/@src', MapCompose(lambda x: urlparse.urljoin(response.url, x))) ``` # URL 利用文件當作URL來源 ``` start_URL = [i.strip() for i in open('todo.URL.txt').readlines()] ``` --- # Item 想到要如何儲存我們爬取的資訊，通常大家都會直接想到Python的dict，不過使用字典的缺點就是： 1. 爬取的資料為**非結構化**，無法一目瞭然的檢視數據 2. 缺乏對字串名字的檢測，容易因工程師的筆誤而出錯 3. 不便於未來使用（例如：API串接）透過使用ITEM將我們爬取的內容結構化 ## Item & Field 可以將Item & Field 想像成 Item = table Field = columns 透過定義Class 創建Table並定義field(ColumnName) `定義class` ``` from scrapy import Item,Field class BookItem(Item): name = Field() price= field() ``` ``` >>>book = BookItem(name='Needful Things',price=45.0) >>>book {'name'='Needful Things','price'=45.0} ``` ## 拓展 Item 有時我們想利用原本設定的結構化Item再定義新的Field，透過繼承增加。 `定義class` ``` Class ForeignBook(BookItem): translator =Field() ``` ``` >>>book2 = ForeignBook(name='Needful Things',price=45.0,translator='Jant') >>>book2 {'name'='Needful Things','price'=45.0,'translator'='Jant'} ``` # Splash 參考資源: - [splash document](https://readthedocs.org/projects/splash/downloads/pdf/stable/) ## 介紹 Splash 是一個JavaScript渲染服務。透過Twisted和QT5的基底，可以讓其併行化處理發揮的淋漓盡致。 ### 優點 * 並行處理多個網頁 * 獲得HTML處理結果或屏幕擷取 * 採用Adblock plus的規則消除廣告相關的圖片，以加快網頁渲染速度 * 在頁面上下文內執行定義的JavaScript腳本 * 通過Lua執行腳本 * 在Splash-Jupyter Notebooks內用Lua腳本開發Splash應用 * 獲取渲染HAR格式內容的詳細訊息 #### Scrapy與Selenium+WebDriver相比，優點有以下幾點: * Splash作為JS渲染服務是基於Twisted和QT5開發的輕量瀏覽器引擎，並且提供直接的HTML API。快速、輕量的特點使其容易進行分散式開發。 * Splash & Scrapy融合，兩者互相兼容彼此的特點，效率較好。 * 處理速度更優於Selenium ### 透過docker 起splash的instance服務去解析javascrpit頁面 #### 安裝splash 1. 使用docker安裝splash `docker pull scrapinghub/splash` 2. 起服務 `sudo docker run -it -p 8050:8050 --rm scrapinghub/splash` 3. 安裝scrapy-splash `pip install scrapy-splash` 4. 配置設定檔案配置splash服務（以下操作全部在`settings.py`）： - 1）新增splash伺服器地址： ``` SPLASH_URL = 'localhost:8050' ``` - 2）將splash middleware新增到DOWNLOADER_MIDDLEWARE中： ``` DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } ``` - 3)Enable SplashDeduplicateArgsMiddleware: ``` SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } ``` - 4)Set a custom DUPEFILTER_CLASS: ``` DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' ``` - 5)a custom cache storage backend: ``` HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' ``` ### Splash Shell 透過Shell 可以利用splash 預覽網頁渲染後的狀況，並透過Shell 測試取得相關資料。 ``` scrapy shell 'http://localhost:8050/render.html?url=<url>?menu=1&nav=1&subnav=0#link_0&timeout=10&wait=0.5' ``` ### Lua_script Splash 主要腳本透過`lua` 語言撰寫腳本 - [script Doc連結](https://splash.readthedocs.io/en/stable/scripting-tutorial.html) - [Lua 連結](http://devdoc.net/python/scrapy-splash.html) 滾動頁面範例： `滾動網頁` ``` function scroll_to(splash, x, y) local js = string.format( "window.scrollTo(%s, %s);", tonumber(x), tonumber(y) ) return splash:runjs(js) end function get_doc_height(splash) return splash:runjs([[ Math.max( Math.max(document.body.scrollHeight, document.documentElement.scrollHeight), Math.max(document.body.offsetHeight, document.documentElement.offsetHeight), Math.max(document.body.clientHeight, document.documentElement.clientHeight) ) ]]) end function scroll_to_bottom(splash) local y = get_doc_height(splash) return scroll_to(splash, 0, y) end function main(splash) -- e.g. http://infiniteajaxscroll.com/examples/basic/page1.html local url = splash.args.url assert(splash:go(url)) assert(splash:wait(0.5)) splash:stop() for i=1,10 do scroll_to_bottom(splash) assert(splash:wait(0.2)) end splash:set_viewport("full") return { html = splash:html(), png = splash:png{width=640}, har = splash:har(), } end ``` 登入範例: ``` function main(splash, args) assert(splash:go(args.url)) assert(splash:wait(0.5)) local search_input = splash:select('input[name=email]') search_input:send_text("args.username") local search_input = splash:select('input[name=pass]') search_input:send_text("args.password") assert(splash:wait(0.5)) local login_button = splash:select('button[name=login]') login_button:click() assert(splash:wait(3)) return { url = splash:url(), html = splash:html(), cookies = splash.args.cookies, png = splash:png(), har = splash:har(), } end ``` 登入 yield 範例 ``` yield scrapy_splash.SplashRequest( url=self.login_url, endpoint="execute", args={ "lua_source": lua_script, "username": username, # 在Lua腳本中，可透過args.user_name於腳本中取用 "password": password }, callback=self.second_auth, ) ``` --- # Item pipline 修改`piplines.py` ## Null值處理確保資料insert時出現Null資料，如果發現就不要儲存或丟棄。 ``` from scrapy.exceptions import DropItem class DeleteNullTitlePipeline(object): def process_item(self, item, spider): title = item['title'] if title: return item else: raise DropItem('found null title %s', item) ``` 並啟用 ==Function== `DeleteNullTitlePipeline` : 將任務新增至`setting.py` ``` ITEM_PIPELINES = { 'myFirstScrapyProject.pipelines.MyfirstscrapyprojectPipeline': 300, 'myFirstScrapyProject.pipelines.DeleteNullTitlePipeline': 200, } ``` 以數字大小為優先執行，越大越晚執行再執行程式碼： `scrapy crawl ptt -o ptt.json` 若收到空值則會看到爬蟲的log出現警告： ``` [scrapy.core.scraper] WARNING: Dropped: ('found null title %s', {'author': '-', 'date': '10/23', 'href': None, 'push': None,'title': None}) {'author': '-', 'date': '10/23', 'href': None, 'push': None, 'title': None} ``` ## 重複值處理範例為：如果發現title一樣就丟棄該item ``` class DuplicatesTitlePipeline(object): def __init__(self): self.article = set() def process_item(self, item, spider): title = item['title'] if title in self.article: raise DropItem('duplicates title found %s', item) self.article.add(title) return(item) ``` 將任務新增至`setting.py` ``` ITEM_PIPELINES = { 'myFirstScrapyProject.pipelines.MyfirstscrapyprojectPipeline': 300, 'myFirstScrapyProject.pipelines.DeleteNullTitlePipeline': 200, 'myFirstScrapyProject.pipelines.DuplicatesTitlePipeline': 400, } ``` --- # 資料儲存 ## Mongo **安裝pymongo api** ``` pip install pymongo ``` **連結mongo** ``` class MongoDBPipeline: def open_spider(self, spider): db_uri = spider.settings.get('MONGODB_URI', 'mongodb://localhost:27017') db_name = spider.settings.get('MONGODB_DB_NAME', 'ptt_scrapy') self.db_client = MongoClient('mongodb://localhost:27017') self.db = self.db_client[db_name] def process_item(self, item, spider): self.insert_article(item) return item def insert_article(self, item): item = dict(item) self.db.article.insert_one(item) def close_spider(self, spider): self.db_clients.close() ``` 新增任務至`setting.py` ``` MONGODB_URI = 'mongodb://localhost:27017' MONGODB_DB_NAME = 'ptt_scrapy' ITEM_PIPELINES = { 'myFirstScrapyProject.pipelines.MongoDBPipeline': 400, } ``` --- # 備註使用模組中的模板`basic`創建了一个爬蟲`basic`： `properties.spiders.basic` ## 問題解決Tips | 問題 | 方案 | | -------------------------------------------------- | ------------------------------------- | | 和抓取的網站有關 | 修改爬蟲 | | 在特定區域修改或儲存Item，可能在整個項目中使用 | 寫一個ItemPipeline | | 再特地欲修改或丟棄請求或回應，可能在整個項目中使用 | 寫一個爬蟲中間件，middleware | | 執行請求回應，如：支援自定義登入或特別處理cookies | 寫一個下載中間件，Download_Middleware | | 其他問題 | 寫一個擴展 | --- ## 爬蟲遇到的困難 Level 1 js Level 2 iframe Level 3 no url level 4 block turn page # 新增Proxy 自動變換IP ## 先建立Proxy IP爬蟲取的Proxy名單本範例使用[US-PROXY](https://www.us-proxy.org/)取得免費Proxy名單 `Proxy_Spider.py` ``` import json import scrapy import pandas as pd import share_config as shc class Proxy_Spider(scrapy.Spider): name = "Proxy_Spider" allowed_domains = 'https://www.us-proxy.org/' def start_requests(self): url = 'https://www.us-proxy.org/' yield scrapy.Request( url =url, headers =shc.google_headers, callback =self.pharse, dont_filter=True, ) def pharse(self, response): r = response # 取得站內Proxy表格 xp_selector = r.xpath('/html/body/section[1]/div/div[2]/div/table') html_txt = xp_selector.get() df = pd.read_html(html_txt)[0] df['Scheme'] = df['Https'].apply(lambda x: 'https' if x=='yes' else 'http') combine_http = lambda x: "{scheme}://{ip}:{port}".format(scheme= x['Scheme'], ip = x['IP Address'], port = x['Port']) df['Proxy'] = df[['Scheme','IP Address','Port']].apply(combine_http, axis=1) df = df.fillna('Null') df_dict = df.to_dict('records') df_dict = json.dumps(df_dict, indent=4) with open('./proxy.json', 'w') as f: f.write(df_dict) ``` ## 建立Proxy_middleware `middlewares.py` ``` import os import json import random from scrapy import signals from collections import defaultdict from scrapy.exceptions import NotConfigured from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware class RandomProxyMiddleware(HttpProxyMiddleware): def __init__(self, auth_encoding="latin-1", proxy_list_file=None): if (not proxy_list_file) or (not os.path.exists(proxy_list_file)): raise NotConfigured self.auth_encoding = auth_encoding self.proxies = defaultdict(list) with open(proxy_list_file) as f: proxy_list = json.load(f) for proxy in proxy_list: scheme = proxy["Scheme"] url = proxy["Proxy"] self.proxies[scheme].append(self._get_proxy(url, scheme)) @classmethod def from_crawler(cls, crawler): auth_encoding = crawler.settings.get("HTTPPROXY_AUTH_ENCODING", "latin-1") proxy_list_file = crawler.settings.get("PROXY_LIST_FILE") return cls(auth_encoding, proxy_list_file) def _set_proxy(self, request, scheme): creds, proxy = random.choice(self.proxies[scheme]) request.meta["proxy"] = proxy if creds: request.headers["Proxy-Authorization"] = b"Basic" + creds ``` `setting.py` ``` DOWNLOADER_MIDDLEWARES = { 'projectname.middlewares.RandomProxyMiddleware':745 } # Proxy 檔案位置 PROXY_LIST_FILE = './proxy.json' ``` ## 測試建立測試Spider `Test_Proxy.py` ``` import scrapy class ProxyExampleSpider(scrapy.Spider): name = "Test_Proxy" # start_urls = ['https://httpbin.org/ip'] def start_requests(self): for i in range(10): yield scrapy.Request('https://httpbin.org/ip', dont_filter=True) def parse(self, response): print(response.meta['proxy']) print(response.text) ``` - run ``` scrapy crawl Test_Proxy ``` - result: <img src="https://i.imgur.com/SbTg0Xv.png" width=600> > 看到ip 有在變就是成功了!