Scrapy教學 - HackMD

Scrapy教學 = ## 前置作業 Scrapy是Python下的爬蟲程式，用來爬取大量網頁資料，此處記錄爬取ptt的所有文章的方式。由於ptt有提供pyptt做為爬蟲程式，但需要登入帳號密碼，且可能會對帳號限制，因此使用正常的爬蟲程式進行。 ### 環境 * 作業系統：Ubutnu 22.04 * Python：3.10.16 * Docker：用來跑mongodb和mongo-express * 虛擬環境：使用`pipenv`而不用`conda`，都一樣 * 開發環境：vscode從Windows端遠連線至Ubuntu ### scrapy安裝參閱這裏：https://hackmd.io/uvrlutTrT92ohSPl8jk9zw?view#%E5%AE%89%E8%A3%9Dscrapy ## 建立專案 ### 建立scrapy專案使用`startproject`參數建立專案，必須使用命令列，要先進入selenium的虛擬環境中。如果是在vscode下，則按下<kbd>Ctrl</kbd>-<kbd>F5</kbd> ```shell= cd ~/workspace/selenium pipenv shell ``` 接下來建立專案。 ```shell= cd ~/workspace/scrapy scrapy startproject ptt_project ``` ``` (selenium) (base) must@must:~/workspace/scrapy$ scrapy startproject ptt_project New Scrapy project 'ptt_project', using template directory '/home/must/.local/share/virtualenvs/selenium-strGPMzh/lib/python3.10/site-packages/scrapy/templates/project', created in: /home/must/workspace/scrapy/ptt_project You can start your first spider with: cd ptt_project scrapy genspider example example.com ``` ### 建立爬蟲利用`genspider`參數建立爬蟲。注意要連續進入兩個同名的目錄，例如我們剛才建立的ptt專案名稱為`ptt_project`，會建立兩層。 ```shell= cd ptt_project/ptt_project/spiders scrapy genspider ptt www.ptt.cc ``` 結果如下： ```shell= must@must:~/workspace/scrapy/ptt_project/ptt_project/spiders$ scrapy genspider ptt www.ptt.cc Created spider 'ptt' using template 'basic' in module: ptt_project.spiders.ptt ``` ``` 建立完之後的專案結構如下： . ├── base.yaml ├── ptt_project │ ├── ptt_project │ │ ├── __init__.py │ │ ├── items.py │ │ ├── middlewares.py │ │ ├── pipelines.py │ │ ├── settings.py │ │ └── spiders │ │ ├── __init__.py │ │ └── ptt.py │ └── scrapy.cfg └── 獲得ptt所有看版.ipynb 3 directories, 10 files (selenium) (base) must@must:~/workspace/scrapy$ ``` 主要的爬蟲程式就是在`ptt_project/ptt_project/spiders/ptt.py`之下。 ### 啟動爬蟲在`spiders`目錄下，直接執行下列指令即可： ```shell= scrapy crawl ptt ``` ### 爬蟲程式說明 `ptt.py` ```python= import scrapy class PttSpider(scrapy.Spider): name = "ptt" allowed_domains = ["www.ptt.cc"] start_urls = ["https://www.ptt.cc"] def parse(self, response): print("明新科技大學資管系") # pass ```