web crawler 實作--PTT 爆新聞抓取

###### tags: `web crawler`,`網路爬蟲` # web crawler 實作--PTT 爆新聞抓取 **抓取與計算PTT八卦版當日[爆]新聞數量** <font color="#f00">Information</font> myCookies={"over18":"1"} ==>可避免每次抓取資料時,都要按我同意 ![](https://i.imgur.com/XYirFA5.jpg) 下面截圖,PTT一頁資料才15筆,這次的程式是抓取多頁的模式 ![](https://i.imgur.com/MbNksih.jpg) ```python= import requests from bs4 import BeautifulSoup import datetime #載入[datetime]模組，協助處理日期問題 myurl="https://www.ptt.cc/bbs/Gossiping/index.html" myCookies={"over18":"1"} today1=datetime.date.today() #2022/6/27 today2=str(today1).split("-") #['2022', '06', '27']一個串列(list) if int(today2[1]) < 10: today_now=" "+str(int(today2[1]))+"/"+today2[2] ## 6/27月份若為單位數，前面會有一個空格 else: today_now=today2[1]+"/"+today2[2] times=100 #此變數是用來剔除管理員公告之用 hot=0 #計算[爆]新聞的總數 tot=0 #計算今日新聞的總數 while True: rq1=requests.get(myurl,cookies=myCookies).text #模擬送出cookies的驗證值 soup=BeautifulSoup(rq1,"html5lib") for mysoup in soup.find_all("div","r-ent"): if mysoup.find("div","date").text != today_now: #判斷文章日期是否不是今日 times = times - 1 #不是的話就把[times]減一 else: tot=tot+1 try: if mysoup.find("div","nrec").text=="爆": #判斷推文是否為[爆]等級 hot=hot+1 #計算[爆]等級的新聞共有幾篇 print("日期: ",mysoup.find("div","date").text) #日期 print(mysoup.find("div","title").text.strip()) #標題 print("ptt.cc"+mysoup.find("div","title").a["href"]) #連結 print("作者: ",mysoup.find("div","author").text) #作者 print("=================================================") except: continue if times<=0: #[times]小於零時，離開結束迴圈 break myurl="https://www.ptt.cc"+soup.find("div","btn-group btn-group-paging").find_all("a")[1]["href"] #可以抓多頁 print("今天是: ",today1) print("全部新聞共有: ",tot," 篇") print("爆新聞總共有:",hot,"篇") ``` **這是spyder的變數視窗(很適合程式小白拿來check資料格式型態) ![](https://i.imgur.com/te6Bj0d.jpg) **結果** ![](https://i.imgur.com/gAH5h8P.jpg)