爬蟲實作筆記

# 爬蟲實作筆記爬蟲-讓我們省去瀏覽網站的時間，自動化抓取網站上的特定內容，另外也可搭配不同功能將資料儲存成特定資料型態，方便檢視及觀看。 <div class="name">中央資工-朱禹澄著</div> <style> .name{ text-align:right; } </style> :::success 我們需要的模組(Module): - requests 下載網頁 - beautifulsoup4 解析網頁、獲得資料 - json 把到的資料轉換成JSON格式 - openpyxl 將df寫入excel表格中 (若有遇到任何安裝模組問題可直接詢問ChatGPT) ::: ![image](https://hackmd.io/_uploads/H1ri7rG_A.png) <div class="name">圖片來源:CodeShiba程式柴</div> <style> .name{ text-align:right; } </style> --- :::spoiler 完整程式碼(存成excel版本) ```python= import requests from bs4 import BeautifulSoup import json import pandas as pd url = "https://www.ptt.cc/bbs/NBA/index.html" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser") articles = soup.find_all("div", class_="r-ent") data_list = [] for a in articles: data = {} title = a.find("div", class_="title") if title and title.a: title = title.a.text else: title = "沒有標題" data["標題"] = title popularity = a.find("div", class_="nrec") if popularity and popularity.span: popularity = popularity.span.text else: popularity = "N/A" data["人氣值"] = popularity date = a.find("div", class_="date") if date: date = date.text else: date = "N/A" data["日期"] = date data_list.append(data) df = pd.DataFrame(data_list) df.to_excel("ptt_nbaa.xlsx", index=False, engine="openpyxl") print(df) print("資料已成功儲存在: ptt_nbaa.xlsx 中") ``` ::: --- ## 第壹部-讀取網頁 :::info #### 1-1 黃金六行 ::: ```python= import requests headers = {'User-Agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.0 Safari/605.1.15" } url = "https://www.ptt.cc/bbs/NBA/index.html" response = requests.get(url, headers=headers) with open('output.html', 'w', encoding='utf-8') as f: f.write(response.text) ``` :::spoiler 逐行解釋 - 一開始我們引入requests模組(需在終端機先pip install requests) - 模仿正常人，加上request的header避免被反爬蟲機制阻擋 - 定義url這個變數為想要爬取的網站網址 - 定義response這個變數為使用request get方法爬到url資料的值 - 將response.text 的內容以 utf-8 編碼寫入到名為 output.html 的文件中。 ![useragent](https://hackmd.io/_uploads/SkB9Frf_A.png) ::: --- ## 第貳步-觀察網頁 :::info #### 2-1 右鍵->檢查->按下elements左邊兩格的圖案->觀察要爬取的資料在哪一欄裡 ::: :::spoiler 圖示步驟 ![image](https://hackmd.io/_uploads/ry2DSHzuR.png) ![image](https://hackmd.io/_uploads/Hyjcrrz_R.png) ![image](https://hackmd.io/_uploads/S1_seeAOR.png) ![image](https://hackmd.io/_uploads/Hk_0BSGd0.png) ![image](https://hackmd.io/_uploads/SJsw8BzuC.png) ::: --- ## 第參步-解析網頁 :::info #### 3-1 變數soup-把剛剛得到的response(爬到的網站資料)丟到beautifulSoup()裡面解析 ::: ```python= from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, "html.parser") # "html.parser"用html解析器解析response.text產生的html articles = soup.find_all("div", class_='r-ent') #find_all會把所有找到符合的加在articles這個list裡面 #print(articles[0]) 在終端機測試看看有沒有輸出r-ent第一大條的html ``` --- :::info #### 3-2 一張圖解釋一切 ::: ```python= soup = BeautifulSoup(response.text, "html.parser") # "html-parser"用html解析器解析response.text產生的html articles = soup.find_all("div", class_='r-ent') #find_all會把所有找到符合的加在articles這個list裡面 print(articles[0]) ``` ![image](https://hackmd.io/_uploads/Sy15vgCO0.png) :::spoiler ### **如果要更詳細的得到某一項，那我們就要去找到該項對應到的html碼** **(例如標題 -> div, 'title', a, "文章標題文字")** **(一種attribute的感覺，eg:title.a)** ![image](https://hackmd.io/_uploads/BkUCnBG_A.png) --- **可以觀察到整個標題條是包含在"r-ent"下(包含"nrec"對應到的人氣及"title對應到的標題** ![image](https://hackmd.io/_uploads/SJirGe0OC.png) ::: --- :::info #### 3-3 如果我們想要萃取出所有的標題，我們可以搭配for迴圈迭代整個articles列表 ::: :::danger - **迴圈的a跟html的a是不一樣的東西，請不要搞混** - **變數的值可以持續不斷的被覆蓋更新** ::: ![image](https://hackmd.io/_uploads/B1VsixCuC.png) :::　　spoiler 三個解釋點 - 用for迴圈迭代每一個articles列表find到的元素 - 利用.text使得輸出的是純文字而不是html程式碼 - 發現找到的其中一項(articles中找到的其中一個 class="r-ent"中是沒有東西的 -> 這是因為該文章已被刪除，故所有裡面的內容皆不見了，只留下一個空殼。 ::: --- **問題:有些文章可能被刪除，只剩下空殼，無法將None轉換解決方法:判斷該"r-ent"是否被刪除，是的話需要另做處理，避免發生錯誤** ```python= for a in articles: #每一個文章就是一個a title = a.find("div", class_="title") if title and title.a: #有title且有title下面的a title = title.a.text #如果有標題的話，我們讓title這個變數名稱的值為該標題 else: title = "沒有標題" print(title) #此處的a非上面的a，此為html title下面的連結a(a href=......的a) ``` ::: spoiler 輸出結果 ![image](https://hackmd.io/_uploads/H1GRMbRu0.png) ::: ::: spoiler **如果我們還想得到發佈日期、人氣值等等資訊，我們可以這麼做...** **step1:我們先找到人氣值與日期的對應html區塊** ![image](https://hackmd.io/_uploads/B1-nrWCOC.png) **step2:對症下藥** ```python= for a in articles: #每一個文章就是一個a title = a.find("div", class_="title") if title and title.a: #有title且有title下面的a title = title.a.text #如果有標題的話，我們讓title這個變數名稱的值為該標題 else: title = "沒有標題" # print(title) #此處的a非上面的a，此為html title下面的連結a(a href=......的a) popularity = a.find("div", class_="nrec") if popularity and popularity.span: popularity = popularity.span.text # 輸出nrec下面span區塊裡面的純文字 else: popularity = "N/A" date = a.find("div", class_="date") if date: #date區塊內就直接是我們要的內容了 date = date.text else: date = "N/A" print(f"標題:{title} 人氣值:{popularity} 日期:{date}") ``` ![image](https://hackmd.io/_uploads/Sy-iubAdR.png) ::: --- ## 第肆部-儲存內容 #### 儲存以便爾後重複使用 :::info #### 4-1結構化資料-先把資料以一一對應的dictionary元素加入list中(需要想一下結構長怎麼樣) ::: data_list內每一個元素都是一大條r-ent內的數據(包含該條的標題，人氣值及時間) :::danger - 字典的key重複時，value會被新進來的給覆蓋取代 ::: ```python= data_list = [] #儲存所有artcles內("r-ent"區塊內)資料的容器 for a in articles: data = {} #儲存單一"r-ent"區塊內的標題、人氣值及日期元素 title = a.find("div", class_="title") ... data["標題"] = title # key:"標題", value:title popularity = a.find("div", class_="nrec") ... data["人氣"] = popularity date = a.find("div", class_="date") data["時間"] = date data_list.append(data) # [ {k11:val11, k12:val12... }, {k21, val21, k22:val22...}, ... ] print(data_list) #測試 print(data_list[0]) #測試 ``` :::spoiler 輸出結果 ![image](https://hackmd.io/_uploads/S1JwyUCuC.png) ::: :::info #### 4-2將結構化資料儲存成json格式的檔案 ::: :::success 想想看:json格式有甚麼功用? 獲取伺服器資料時所獲得的格式? 還有嗎? ::: ```python= import json with open("ptt_nba_data.json", "w", encoding="utf-8") as file: json.dump(data_list, file, ensure_ascii=False, indent=4) print("資料已經成功儲存為: ptt_nba_data.json") ``` :::spoiler 輸出結果 ![image](https://hackmd.io/_uploads/SJN0yLA_C.png) ![image](https://hackmd.io/_uploads/S1qQlUC_C.png) ::: :::info #### 4-3將結構化資料儲存成Excel格式的檔案-使用pandas將列表data_list轉換成dataframe ::: :::danger - 檔名命名時盡量使用小寫英文字母及底線，不要使用大寫、空格或特殊符號 ::: ```python= df = pd.DataFrame(data_list) df.to_excel("ptt_nba.xlsx", index=False, engine="openpyxl") ``` :::spoiler 詳細解說 ![image](https://hackmd.io/_uploads/rJ3I9tR_0.png) - DataFrame是一個pandas中的一種資料結構，類似excel的表格或SQL表中的一個表格 - df.to_excel -> 將 DataFrame df 輸出到一個 Excel 文件中 - "ptt_nba.xlsx" -> 保存 Excel 文件的名稱 - index=False -> 表示不要將 DataFrame 的索引寫入到 Excel 文件中。如果設置為 True，索引將作為 Excel 文件中的第一列。 - engine="openpyxl" -> 指定使用 openpyxl 作為寫入 Excel 文件的引擎。openpyxl 是一個 Python 庫，用於讀取和寫入 Excel 2010 xlsx/xlsm/xltx/xltm 文件。 ![image](https://hackmd.io/_uploads/BJZP9tAOR.png) ::: :::spoiler 成果 ![image](https://hackmd.io/_uploads/ryq9YtC_A.png) ![image](https://hackmd.io/_uploads/By_pdt0dR.png) ::: --- :::spoiler 完整程式碼(存成excel版本) ```python= import requests from bs4 import BeautifulSoup import json import pandas as pd url = "https://www.ptt.cc/bbs/NBA/index.html" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser") articles = soup.find_all("div", class_="r-ent") data_list = [] for a in articles: data = {} title = a.find("div", class_="title") if title and title.a: title = title.a.text else: title = "沒有標題" data["標題"] = title popularity = a.find("div", class_="nrec") if popularity and popularity.span: popularity = popularity.span.text else: popularity = "N/A" data["人氣值"] = popularity date = a.find("div", class_="date") if date: date = date.text else: date = "N/A" data["日期"] = date data_list.append(data) df = pd.DataFrame(data_list) df.to_excel("ptt_nbaa.xlsx", index=False, engine="openpyxl") print(df) print("資料已成功儲存在: ptt_nbaa.xlsx 中") ``` :::