網頁爬蟲應用程式實作

# 網頁爬蟲應用程式實作黃安聖 ###### tags: `Python程式設計與網頁爬蟲應用程式實作(一)` ---- ### 什麼是網頁爬蟲 ---- ### 基礎的網路爬蟲透過對伺服器發送HTTP Request取得特定網頁的原始碼(通常為==HTML==)，擷取HTML內特定的文字。 ---- ### pyquery [![](https://i.imgur.com/khx6BDh.png)](https://pythonhosted.org/pyquery/) ---- ### pyquery 一套包含==取得網頁原始碼==與==內容擷取規則==Python模組。 ---- #### 網頁內容擷取規則 Pyquery採用[CSS選擇器](https://zh.wikipedia.org/zh-tw/CSS_%E6%BF%BE%E5%99%A8)作為網頁內容的篩選擷取規則，這是一種用於網頁設計師描述內容樣式時選擇規則。 ---- ### 引用pyquey至專案內 ```python= from pyquery import PyQuery ``` ---- ### 如果出現ModuleNotFoundError 代表我們的電腦環境尚未安裝過pyquery模組 ---- 解決方法，透過指令安裝pyquery ```shell= !pip install pyquery ``` ---- ### 取得網頁全部原始碼 ```python= html = PyQuery('目標網址') print(html) # 印出網頁全部原始碼 ``` ---- 網頁原始碼的內容大部分都不是我們要的，所以我們必須透過==CSS選擇器的擷取規則==來取得特定的資訊。 ---- ### HTML的基本組成網頁的內容由許多的==HTML標籤==所組成，每個標籤有代表著不同的內容格式 ---- 標籤可能帶有多個屬性，這些資訊都是==重要截取條件==喔 ```htmlmixed= <標籤名稱屬性1="值" 屬性2="值"> 內容 </標籤名稱> ``` ---- ### 擷取規則標籤名稱: `標籤名稱` class： `.class名稱` id: `#ID名稱` 其他屬性: `[屬性名稱=值]` ---- ```python= from pyquery import PyQuery as pq html = ''' <h1 class="title">嗨我是目標文字</h1> <h1>嗨我是一個普通的文字</h1> ''' selector = pq(html) print(selector('h1.title')) ``` ---- ```python= from pyquery import PyQuery as pq html = ''' <h1 id="title1">嗨我是目標文字</h1> <h1>嗨我是一個普通的文字</h1> ''' selector = pq(html) print(selector('h1#title1')) ``` ---- ```python= from pyquery import PyQuery as pq html = ''' <h1 class="title" data-title="important">嗨我是目標文字</h1> <h1 class="title">嗨我是一個普通的文字</h1> ''' selector = pq(html) print(selector('h1.title[data-title="important"]')) ``` ---- ### 擷取規則標籤名稱: `標籤名稱` class： `.class名稱` id: `#ID名稱` 其他屬性: `[屬性名稱=值]` ---- ### 擷取規則標籤名稱: `h1` class： `.title` id: `#title1` 其他屬性: `[data-title=important]` ---- ### 課堂實作練習爬爬看==臺灣銀行牌告匯率== https://rate.bot.com.tw/xrt?Lang=zh-TW ---- ## 整合檔案管理(1) 試著寫入一個txt檔案紀錄爬取的內容 ```python= with open('example.txt', 'w') as f: f.write('第一行\n') f.write('第二行\n') f.write('第三行\n') ``` ---- ## 整合檔案管理(2) 寫入xlsx檔案，首先需要安裝xlsxwriter ``` pip install xlsxwriter ``` ---- ## 引用Xlswriter ```python= import xlsxwriter ``` ---- ## 建立工作表 ```python= workbook = xlsxwriter.Workbook('檔案名稱.xlsx') # 建立一個工作表 worksheet = workbook.add_worksheet() ``` ---- ## 寫入格子 ```python= # 寫入資料至格子內 # worksheet.write(row, column, "要新增的資料") worksheet.write(0, 0, "0.0") # 第一行第一格 worksheet.write(0, 1, "0.1") # 第一行第二格 # 關閉並保存檔案 workbook.close() ``` ---- ### 隨堂練習請結合剛才透過爬蟲擷取的資料透過Python寫一個Xlsx檔案 |貨幣名稱|現金買入|現金賣出| |-|-|-| |美金|30.91|30.21| |港幣|3.120|3.27| ---- #### 爬蟲目標列表參考(1) 104工作職缺 https://www.104.com.tw/jobs/search/?ro=0&order=11&asc=0&zone=16&page=1&mode=s&jobsource=2018indexpoc ---- #### 爬蟲目標列表參考(2) 貴金屬交易中心 https://www.truney.com/product-category/silver/silver-coins/ ---- #### 爬蟲目標列表參考(3) 科技新報 https://technews.tw/ ---- #### 爬蟲目標列表參考(4) 數位時代 https://www.bnext.com.tw/ ---- #### 期末成果展示(每人5分鐘內) 1. 應用程式運作展示(亦可使用螢幕錄影) 2. 製作過程中遇到的困難？未來想加入的新功能？