HackMD
    • Create new note
    • Create a note from template
    • Sharing Link copied
    • /edit
    • View mode
      • Edit mode
      • View mode
      • Book mode
      • Slide mode
      Edit mode View mode Book mode Slide mode
    • Customize slides
    • Note Permission
    • Read
      • Only me
      • Signed-in users
      • Everyone
      Only me Signed-in users Everyone
    • Write
      • Only me
      • Signed-in users
      • Everyone
      Only me Signed-in users Everyone
    • Commenting & Invitee
    • Publishing
      Please check the box to agree to the Community Guidelines.
      Everyone on the web can find and read all notes of this public team.
      After the note is published, everyone on the web can find and read this note.
      See all published notes on profile page.
    • Commenting Enable
      Disabled Forbidden Owners Signed-in users Everyone
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Invitee
    • No invitee
    • Options
    • Versions and GitHub Sync
    • Transfer ownership
    • Delete this note
    • Note settings
    • Template
    • Save as template
    • Insert from template
    • Export
    • Dropbox
    • Google Drive Export to Google Drive
    • Gist
    • Import
    • Dropbox
    • Google Drive Import from Google Drive
    • Gist
    • Clipboard
    • Download
    • Markdown
    • HTML
    • Raw HTML
Menu Note settings Sharing Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Transfer ownership Delete this note
Export
Dropbox Google Drive Export to Google Drive Gist
Import
Dropbox Google Drive Import from Google Drive Gist Clipboard
Download
Markdown HTML Raw HTML
Back
Sharing
Sharing Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Comment & Invitee
Publishing
Please check the box to agree to the Community Guidelines.
Everyone on the web can find and read all notes of this public team.
After the note is published, everyone on the web can find and read this note.
See all published notes on profile page.
More (Comment, Invitee)
Commenting Enable
Disabled Forbidden Owners Signed-in users Everyone
Permission
Owners
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Invitee
No invitee
   owned this note    owned this note      
Published Linked with GitHub
Like BookmarkBookmarked
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# Python網路爬蟲研習馬拉松-期末專題 ## AGENDA * 專題摘要 * 實作方法介紹 * 成果展示 * 結論 --- ### 一、專題摘要 1.期末專題主題 * 任選 Cupoy 新聞服務之某一種分類 (如熱門新聞、科技、商業....),使 用你學習過的爬蟲程式,爬取前 500 篇的文章 2.期末專題基本目標 2.1基礎實作 * 透過開發者工具,觀察網站在列出News Feed,是屬於動態網站還是靜態網站,或是否有API可以直接送出requests取得資料 * 根據上述網站選擇requests/BeautifulSoup/Selenium等工具進行爬蟲處理 * 整理為pandas.DataFrame,再透過統計,使用matplotlib.pyplot等工具進行資料圖像化 2.2進階實作 * 爬取文章,透過jieba套件將文章斷詞拆解 * 計算詞頻、透過TFIDF統計關鍵字 * 過濾掉停用詞(stop words),對詞頻進行排名整理 * 將結果透過WordCloud套件(文字雲)的方式呈現 --- ### 二、實作方法 首先,先匯入所需使用的套件 ``` from selenium import webdriver from bs4 import BeautifulSoup import time import pandas as pd ``` 爬取文章設定,因為有多頁網頁資訊,需使用webdriver操作網頁下滾的動作 ``` N = 500 # 設定要爬取文章數 # 打開瀏覽器, 進入欲爬取網頁 browser = webdriver.Chrome(executable_path='../chromedriver') browser.get("https://www.cupoy.com/newsfeed/topicgrp/tech_tw") start_time=time.time() count = 0 articles_info = [] print('目前文章數: ', end='') while count<N: html_source = browser.page_source soup = BeautifulSoup(html_source, 'html5lib') target = soup.find_all('a', class_='sc-jxGEyO') dummy = 0 #重複、需扣掉的文章數量 for d in target: article = {} article['title']= d['title'] article['url'] = d['href'] article['origin'] = '/'.join(d['href'].split('/', 3)[:-1]) # 新聞來源的網站主頁 if article not in articles_info: #如果有因為往下滑的時間延遲造成爬取到重複的新聞 articles_info.append(article) else: dummy += 1 #重複、需扣掉的文章數量 count = count + len(target) - dummy #修正後的文章數量 print(count, end=' ') browser.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(5) # 每隔五秒鐘自動往下滑, 擷取 500 篇新聞的 url browser.quit() if len(articles_info) > N: # 取前 500 筆 articles_info = articles_info[:N] end_time = time.time() print('take time:', end_time-start_time, 's') ``` 確認一下爬取到的500筆新聞資訊,以及標題 ``` for i, a in enumerate(articles_info, start=1): print(i, ' ', a['title']) # 轉成 dataframe 並存檔 import os os.getcwd() df = pd.DataFrame(articles_info) df.to_csv('C:/Users/vincentLee1231995/OneDrive/Documents/Personal/Crawling-in-60Days/Homework/final project/news_info.csv', index=False) # 讀取 csv import pandas as pd news_info = pd.read_csv('news_info.csv') news_info.head() ``` 接著,需要爬取每篇新聞的內容,這裡我們先使用單線程爬蟲的方式(多線程的寫法會放在文末的"問題解決"段落中) 把所用到的代碼封裝成函式 ``` content=[] #存放最終內容 def analysis(url_list, content): import requests from bs4 import BeautifulSoup import time import re headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38'} for url in url_list: response = requests.get(url, headers=headers) response.encoding='utf-8' soup = BeautifulSoup(response.text, 'html.parser') single_content=[] #每則新聞的內容(盤整過的第二手資料) paragraphs = [] #每則新聞的內容(爬回的第一手資料) if soup.find('p'): paragraphs.append(soup.find_all('p')) if soup.find('section'): paragraphs.append(soup.find_all('section')) for pars in paragraphs: # pars 是 <p>集合 和 <section>集合 for par in pars: text = re.sub('[\W]+', ' ', par.text) # 找出特殊字元 (\n,\t,...等) 替換成空白 single_content.append(text) content.append(single_content) ``` 寫完函式後,呼叫並帶入參數做使用 ``` target = news_info['url'] #目標網址 print(len(target)) import time start=time.time() analysis(target,content) #呼叫函式 end=time.time() print('Done!', end-start, 's') ``` 把得到的結果存入一個新的列表,加進一個新欄位到原本的資料框中 ``` news = [] for i in content: news.append(i) news_info['content'] = news news_info.to_csv('C:/Users/vincentLee1231995/OneDrive/Documents/Personal/Crawling-in-60Days/Homework/final project/news_contents.csv', index=False) ``` 接著,我們用新得到的資料框,整理資料並做初步統計 ``` data = pd.read_csv('./news_contents.csv') data.head() ``` 以來源網址做計數 ``` origin_state = data['origin'].value_counts().reset_index() origin_state.columns=['origin', 'count'] origin_state ``` 發現有部分來源網址出現的次數極少,把它們歸類為"其他" ``` def mapping_new_origin(origin, count): if count >= 5: return origin else: return 'Others' new_origin = [] for i in range(len(origin_state)): new_origin.append(mapping_new_origin(origin_state.iloc[i, 0], origin_state.iloc[i, 1])) origin_state['new_origin'] = new_origin #重整過的結果存成一個新欄位 origin_state ``` 重新做一次計數,並存檔為新的資料框 ``` new_origin_state = origin_state.groupby(by='new_origin').sum().reset_index().sort_values(by='count', ascending=False).reset_index(drop=True) new_origin_state new_origin_state.to_csv('C:/Users/vincentLee1231995/OneDrive/Documents/Personal/Crawling-in-60Days/Homework/final project/new_origin_state.csv', index=False) ``` 接著,繪製圓餅圖,可以更清楚的了解分布 ``` import matplotlib.pyplot as plt %matplotlib inline plt.figure(figsize=(25,10)) plt.pie(new_origin_state['count'], labels=new_origin_state['new_origin'], autopct='%0.1f%%') #plt.rcParams['font.sans-serif']=['FangSong'] plt.legend() plt.title('新聞來源網站分布') plt.show() ``` ![](https://i.imgur.com/OFQNwxj.png) 再來我們要將新聞內容斷詞,根據關鍵詞繪製文字雲 ``` import pandas as pd data = pd.read_csv('./news_contents.csv') data.head() news = list(data['content'].values) print(len(news), type(news)) ``` 安裝、匯入相關套件 ``` pip install jieba import jieba import jieba.analyse ``` 準備停用詞表 ``` stopwords=[] with open('cn_stopwords.txt', 'r', encoding='utf-8') as f: # 使用從網路上抓來的停用詞表 for data in f.readlines(): data = data.strip() stopwords.append(data) len(stopwords) ``` 使用jieba斷詞 ``` print('Jobs just begin!') remained_news = [] startTime=time.time() for n in news: seg = jieba.cut(n, cut_all=False) # 預設模式斷詞,分詞後去除停用詞, 然後再重組回文章 try: remained_news.append(''.join(list(filter(lambda a: a not in stopwords and a != '\n', seg)))) except Exception as e: print(e) # 若有 'nan' 的項目會引發錯誤 print(news.index(n)) # 取得序號再次確認新聞內容是否有問題 endTime=time.time() print('Take time: ', endTime-startTime, 's') print('All jobs Done!') ``` 關鍵詞分析 ``` keywords = [] startTime=time.time() for n in remained_news: keywords.append(jieba.analyse.extract_tags(n, topK=20, withWeight=False)) for j in keywords: print(j,'\n') endTime=time.time() print('Take time: ', endTime-startTime, 's') print('All jobs Done!') ``` 斷詞與關鍵字分析分別得出兩個文本來源,先以前者製作文字雲看看;一樣先封裝成函式再做使用 ``` def plt_wordcloud(content): from wordcloud import WordCloud import jieba import matplotlib.pyplot as plt from PIL import Image import numpy as np %matplotlib inline words = jieba.cut(content, cut_all=False) all_words = '' #要在wordcloud函式使用的參數,即目標文本 for word in words: all_words += ' '+word wcloud = WordCloud(width=500, height=500, background_color='white', mask=None, min_font_size=8, font_path='C:/Windows/Fonts/kaiu.ttf').generate(all_words) plt.figure(figsize=(20, 10), facecolor=None) plt.imshow(wcloud) plt.axis('off') plt.tight_layout(pad=0) plt.show() ``` 資料轉換,並呼叫函式,傳入參數 ``` allnews = '' for n in remained_news: allnews += n plt_wordcloud(allnews) ``` ![](https://i.imgur.com/GlXlAFw.png) 觀察所得到的結果,雖成功繪製出文字雲,但內容似乎與"科技版"沒有太大關連性,因此決定改用篩選出的關鍵詞重新製作一次 ``` keyword_text='' for lst in keywords: for item in lst: keyword_text+=' '+item print(len(keyword_text)) ``` 重新封裝一個內容較簡單的函式 ``` def wordCloud(text): wcloud = WordCloud(width=500, height=500, background_color='white', mask=None, min_font_size=8, font_path='C:/Windows/Fonts/msjh.ttc').generate(text) plt.figure(figsize=(20, 10), facecolor=None) plt.imshow(wcloud) plt.axis('off') plt.tight_layout(pad=0) plt.show() ``` 呼叫函式,傳入參數 ``` wordCloud(keyword_text) ``` ![](https://i.imgur.com/csf1Rnp.png) --- ### 三、成果展示 [Gist](https://gist.github.com/Steph958/e82b4e2eafdee1a95b98e1ac60909c3c) --- ### 四、問題解決 1.遇到chromedriver版本相容問題: > selenium报错Message: This version of ChromeDriver only supports Chrome version xx:` > * Solution: > https://blog.csdn.net/qq_41605934/article/details/116330227 > http://npm.taobao.org/mirrors/chromedriver/ 2.使用單線程爬蟲的方式蒐集內容所花費的時間較多,改以多線程方式做看看 爬取內容的函式,較原來的單線程,需稍微調整: > * 加入一個參數num,用來做迴圈的次數計數 > * 遇到requests.exceptions.ConnectionError问题,在函式末端加上time.sleep(5)避免過度頻繁的訪問,詳文可參考:https://blog.csdn.net/wancongconga/article/details/111030335 ``` content=[] def analysis(url_list, content, num, start, end): import requests from bs4 import BeautifulSoup import time import re headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38'} for url in url_list[start:end]: response = requests.get(url, headers=headers) response.encoding='utf-8' soup = BeautifulSoup(response.text, 'html.parser') single_content=[] paragraphs = [] if soup.find('p'): paragraphs.append(soup.find_all('p')) if soup.find('section'): paragraphs.append(soup.find_all('section')) for pars in paragraphs: # pars 是 <p>集合 和 <section>集合 for par in pars: text = re.sub('[\W]+', ' ', par.text) # 找出特殊字元 (\n,\t,...等) 替換成空白 single_content.append(text) content.append(single_content) time.sleep(5) ``` 設定目標網址: ``` N=500 n_thread=10 target = news_info['url'] ``` 設定多線程: ``` # 建立 10個子執行緒 threads = [] startTime=time.time() for i in range(10): start = int(N/n_thread)*num # 設定此項 job 所負責的起始和終點序號 end = int(N/n_thread)*(num+1) threads.append(threading.Thread(target = analysis, args = (target,content,i,start,end))) threads[i].start() # 等待所有子執行緒結束 for i in range(5): threads[i].join() endTime=time.time() print("Done.") print('Take time: ', endTime-startTime,'s') ``` 較單線程節省了大約130秒的時間 3.Jieba套件-補充 > * Jieba的三種斷詞模式 -預設模式 -全文模式 -搜尋引擎模式 > * 文字雲的形狀 ### 五、結論 --- ### 六、期末專題作者資訊 1.個人Github連結: [Github](https://github.com/Steph958) 2.個人於百日馬拉松的顯示名稱:Vincent_1231995

Import from clipboard

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lost their connection.

Create a note from template

Create a note from template

Oops...
This template is not available.


Upgrade

All
  • All
  • Team
No template found.

Create custom template


Upgrade

Delete template

Do you really want to delete this template?

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Tutorials

Book Mode Tutorial

Slide Mode Tutorial

YAML Metadata

Contacts

Facebook

Twitter

Discord

Feedback

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions

Versions and GitHub Sync

Sign in to link this note to GitHub Learn more
This note is not linked with GitHub Learn more
 
Add badge Pull Push GitHub Link Settings
Upgrade now

Version named by    

More Less
  • Edit
  • Delete

Note content is identical to the latest version.
Compare with
    Choose a version
    No search result
    Version not found

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub

      Please sign in to GitHub and install the HackMD app on your GitHub repo. Learn more

       Sign in to GitHub

      HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Available push count

      Upgrade

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Upgrade

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully