# Python - Day 9
###### tags: `Python`
## 網路爬蟲 Web Crawler
### Part 1
1. 連線到特定網址,抓取資料
2. 解析資料,取得實際想要的部分
#### HTML格式
1. class 為(某東西)的格式
```htmlembedded=
<html>
<head>
<title>HTML格式</title>
</head>
<body>
<div class = "list" >
<span>樹狀格式</span>
<span>樹狀格式</span>
</div>
<body>
<htmL>
```
#### 引入外部模組 BeautifulSoup4
1. 安裝BeautifulSoup套件
```CommandLine=
pip install beautifulsoup4
```
#### 模擬真人(使用者代理 UA)
1. 網站用來識別是否為真人使用網頁
2. root 解析html網站原始碼
3. find函式搜尋特定內容()
```python=
import urllib.request as req
import ssl
context = ssl._create_default_https_context = ssl._create_unverified_context
url = "https://www.ptt.cc/bbs/movie/index.html"
#建立一個request物件,附加 Request Headers 的資訊
request = req.Request(url, headers={
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
} )
#在Urlopen中用request物件,而不是url物件
with req.urlopen(request) as response:
data = response.read().decode( "UTF-8" )
import bs4
root = bs4.BeautifulSoup(data, "html.parser")
titles = root.find_all( "div" , class_ = "title" ) #尋找 "title" 的 div (class 標籤)
for title in titles:
if title.a != None: #如果標題含有a標籤(沒有被刪除),就印出來
print ( title.a.string )
```
### Part 2
#### Cookie
[Cookie 是什麼 ? 讓網站瀏覽更順暢的關鍵元素](https://www.waca.net/support/id/445)
1. 網站會在使用者的瀏覽器存一小段內容
2. 連線時,會放在Request Header中送出
3. 一個小型文字檔案存在瀏覽器中
#### 追蹤網頁連結
1. BS4中的Find函式(尋找單一項)
2. BS4中的Find_all函式(尋找全部)
```python=
titles = root.find_all( "div" , class_ = "title" ) #尋找 class = "title" 的 div標籤
for title in titles:
if title.a != None: #如果標題含有a標籤(沒有被刪除),就印出來
print ( title.a.string )
nextLink = root.find( "a" , string = "‹ 上頁" ) #找到網站中a標籤 裡面 名為[上頁]的字串(上一頁的超連結)
return nextLink["href"] # hraf -> 屬性名稱
#抓取一個頁面的標題
```
#### 連續抓取頁面
1. 將讀取網站寫成函式(以URl作為傳入參數)
2. 主程式以for/while loop重複執行
3. 將追蹤的網頁以retrun形式傳出函式
4. 把新的網址傳入函式
```python=
import urllib.request as req
import ssl
from webbrowser import get
def getData ( url ):
context = ssl._create_default_https_context = ssl._create_unverified_context
#建立一個request物件,附加 Request Headers 的資訊
request = req.Request(url, headers={
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36",
"cookie" : "over18=1" #讓程式帶上cookies (是否已滿18歲)
} )
#在Urlopen中用request物件,而不是url物件
with req.urlopen(request) as response:
data = response.read().decode( "UTF-8" )
import bs4
root = bs4.BeautifulSoup(data, "html.parser")
titles = root.find_all( "div" , class_ = "title" ) #尋找 class = "title" 的 div標籤
for title in titles:
if title.a != None: #如果標題含有a標籤(沒有被刪除),就印出來
print ( title.a.string )
nextLink = root.find( "a" , string = "‹ 上頁" ) #找到網站中a標籤 裡面 名為[上頁]的字串(上一頁的超連結)
return nextLink["href"] # hraf -> 屬性名稱
#抓取一個頁面的標題
pageURL = "https://www.ptt.cc/bbs/Gossiping/index.html"
for i in range(10) :
pageURL = "https://www.ptt.cc" + getData ( pageURL )
```
### Part 3
#### 一般網站
1. 會一次傳回帶資料的HTML
2. 網站瞬間跑出來
#### AJAX
1. 網頁前端的JavaScript技術
2. 第一次會傳回不帶資料的HTML網頁,第二次才傳回資料
3. 先跑出框架,停頓一下,資料才跑出來
#### 確認網站運作模式
1. 網站瞬間跑出來
2. 先跑出框架,停頓一下,資料才跑出來
```python=
import urllib.request as req
import ssl
context = ssl._create_default_https_context = ssl._create_unverified_context
url = "https://medium.com/_/api/home_feed"
#建立一個request物件,附加 Request Headers 的資訊
request = req.Request(url, headers={
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36",
"cookie" : "vary=enable_medium_app",
"cookie" : "_dd_s=rum=0&expire=1644246317617",
"cookie" : "__cfruid=3f75a8755b6ff096cdedc0fee3e51500debf7f2f-1644239763"
} )
#在Urlopen中用request物件,而不是url物件
with req.urlopen(request) as response:
data = response.read().decode( "UTF-8" )
import json
data = json.loads(data)
data = data.replace ( "])}while(1):</X>" , "" ) #替換非字典形式的json原始碼
posts = data ["playload"]["references"]["Post"] #找尋標題所在的字典
for key in posts: #用for迴圈跑所有的post
post = posts[key] #取出所有post
print ( post["title"] ) #印出post中title所在的標籤
```
### Part 4
#### AJAX進階解法- Request Data
1. 在傳送請求時,附帶傳送一些資料
2. 透過那些資料,伺服器才會提供回傳的東西
```python=
import urllib.request as req
import json
import ssl
context = ssl._create_default_https_context = ssl._create_unverified_context
#提供伺服器憑證
url = "https://medium.com/_/graphql"
requestData = [{"operationName":"TopicHandlerHomeFeed","variables":{"topicSlug":"editors-picks","feedPagingOptions":{"limit":25,"to":"1644340994836"}},"query":"query TopicHandlerHomeFeed($topicSlug: ID!, $feedPagingOptions: PagingOptions) {\n topic(slug: $topicSlug) {\n ...TopicHandlerHomeFeed_topic\n __typename\n }\n}\n\nfragment TopicHandlerHomeFeed_topic on Topic {\n id\n name\n latestPosts(paging: $feedPagingOptions) {\n postPreviews {\n ...TopicHandlerHomeFeed_postPreview\n __typename\n }\n pagingInfo {\n next {\n limit\n to\n __typename\n }\n __typename\n }\n __typename\n }\n __typename\n}\n\nfragment TopicHandlerHomeFeed_postPreview on PostPreview {\n postId\n post {\n id\n ...HomeFeedItem_post\n __typename\n }\n __typename\n}\n\nfragment HomeFeedItem_post on Post {\n __typename\n id\n title\n firstPublishedAt\n mediumUrl\n collection {\n id\n name\n domain\n logo {\n id\n __typename\n }\n __typename\n }\n creator {\n id\n name\n username\n imageId\n mediumMemberAt\n __typename\n }\n previewImage {\n id\n __typename\n }\n previewContent {\n subtitle\n __typename\n }\n readingTime\n tags {\n ...TopicPill_tag\n __typename\n }\n ...BookmarkButton_post\n ...OverflowMenuButtonWithNegativeSignal_post\n ...PostPresentationTracker_post\n ...PostPreviewAvatar_post\n}\n\nfragment TopicPill_tag on Tag {\n __typename\n id\n displayTitle\n}\n\nfragment BookmarkButton_post on Post {\n visibility\n ...SusiClickable_post\n ...AddToCatalogBookmarkButton_post\n __typename\n id\n}\n\nfragment SusiClickable_post on Post {\n id\n mediumUrl\n ...SusiContainer_post\n __typename\n}\n\nfragment SusiContainer_post on Post {\n id\n __typename\n}\n\nfragment AddToCatalogBookmarkButton_post on Post {\n ...AddToCatalogBase_post\n __typename\n id\n}\n\nfragment AddToCatalogBase_post on Post {\n id\n viewerEdge {\n catalogsConnection {\n catalogsContainingThis(type: LISTS) {\n catalogId\n catalogItemIds\n __typename\n }\n predefinedContainingThis {\n catalogId\n predefined\n catalogItemIds\n __typename\n }\n __typename\n }\n ...editCatalogItemsMutation_postViewerEdge\n ...useAddItemToPredefinedCatalog_postViewerEdge\n __typename\n id\n }\n ...WithToggleInsideCatalog_post\n __typename\n}\n\nfragment editCatalogItemsMutation_postViewerEdge on PostViewerEdge {\n id\n catalogsConnection {\n catalogsContainingThis(type: LISTS) {\n catalogId\n version\n catalogItemIds\n __typename\n }\n predefinedContainingThis {\n catalogId\n predefined\n version\n catalogItemIds\n __typename\n }\n __typename\n }\n __typename\n}\n\nfragment useAddItemToPredefinedCatalog_postViewerEdge on PostViewerEdge {\n id\n catalogsConnection {\n predefinedContainingThis {\n catalogId\n version\n predefined\n catalogItemIds\n __typename\n }\n __typename\n }\n __typename\n}\n\nfragment WithToggleInsideCatalog_post on Post {\n id\n viewerEdge {\n catalogsConnection {\n catalogsContainingThis(type: LISTS) {\n catalogId\n __typename\n }\n predefinedContainingThis {\n predefined\n __typename\n }\n __typename\n }\n __typename\n id\n }\n __typename\n}\n\nfragment OverflowMenuButtonWithNegativeSignal_post on Post {\n id\n ...OverflowMenuWithNegativeSignal_post\n ...CreatorActionOverflowPopover_post\n __typename\n}\n\nfragment OverflowMenuWithNegativeSignal_post on Post {\n id\n creator {\n id\n __typename\n }\n collection {\n id\n __typename\n }\n ...OverflowMenuItemUndoClaps_post\n __typename\n}\n\nfragment OverflowMenuItemUndoClaps_post on Post {\n id\n clapCount\n ...ClapMutation_post\n __typename\n}\n\nfragment ClapMutation_post on Post {\n __typename\n id\n clapCount\n ...MultiVoteCount_post\n}\n\nfragment MultiVoteCount_post on Post {\n id\n ...PostVotersNetwork_post\n __typename\n}\n\nfragment PostVotersNetwork_post on Post {\n id\n voterCount\n recommenders {\n name\n __typename\n }\n __typename\n}\n\nfragment CreatorActionOverflowPopover_post on Post {\n allowResponses\n id\n statusForCollection\n isLocked\n isPublished\n clapCount\n mediumUrl\n pinnedAt\n pinnedByCreatorAt\n curationEligibleAt\n mediumUrl\n responseDistribution\n visibility\n ...useIsPinnedInContext_post\n pendingCollection {\n id\n name\n creator {\n id\n __typename\n }\n avatar {\n id\n __typename\n }\n domain\n slug\n __typename\n }\n creator {\n id\n ...MutePopoverOptions_creator\n ...auroraHooks_publisher\n __typename\n }\n collection {\n id\n name\n creator {\n id\n __typename\n }\n avatar {\n id\n __typename\n }\n domain\n slug\n ...MutePopoverOptions_collection\n ...auroraHooks_publisher\n __typename\n }\n ...NewsletterV3EmailToSubscribersMenuItem_post\n ...OverflowMenuItemUndoClaps_post\n __typename\n}\n\nfragment useIsPinnedInContext_post on Post {\n id\n collection {\n id\n __typename\n }\n pendingCollection {\n id\n __typename\n }\n pinnedAt\n pinnedByCreatorAt\n __typename\n}\n\nfragment MutePopoverOptions_creator on User {\n id\n __typename\n}\n\nfragment auroraHooks_publisher on Publisher {\n __typename\n ... on Collection {\n isAuroraEligible\n isAuroraVisible\n viewerEdge {\n id\n isEditor\n __typename\n }\n __typename\n id\n }\n ... on User {\n isAuroraVisible\n __typename\n id\n }\n}\n\nfragment MutePopoverOptions_collection on Collection {\n id\n __typename\n}\n\nfragment NewsletterV3EmailToSubscribersMenuItem_post on Post {\n id\n creator {\n id\n newsletterV3 {\n id\n subscribersCount\n __typename\n }\n __typename\n }\n isNewsletter\n isAuthorNewsletter\n __typename\n}\n\nfragment PostPresentationTracker_post on Post {\n id\n visibility\n previewContent {\n isFullContent\n __typename\n }\n collection {\n id\n slug\n __typename\n }\n __typename\n}\n\nfragment PostPreviewAvatar_post on Post {\n __typename\n id\n collection {\n id\n name\n ...CollectionAvatar_collection\n __typename\n }\n creator {\n id\n username\n name\n ...UserAvatar_user\n ...userUrl_user\n __typename\n }\n}\n\nfragment CollectionAvatar_collection on Collection {\n name\n avatar {\n id\n __typename\n }\n ...collectionUrl_collection\n __typename\n id\n}\n\nfragment collectionUrl_collection on Collection {\n id\n domain\n slug\n __typename\n}\n\nfragment UserAvatar_user on User {\n __typename\n id\n imageId\n mediumMemberAt\n name\n username\n ...userUrl_user\n}\n\nfragment userUrl_user on User {\n __typename\n id\n customDomainState {\n live {\n domain\n __typename\n }\n __typename\n }\n hasSubdomain\n username\n}\n"}]
request = req.Request( url ,headers = {
"Content-Type" : "application/json",
"User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36"
}, data = json.dumps(requestData).encode( "UTF-8" ) )
#提供伺服器一些資料
#requestData 為字典形式,須以變成json格式,並且以UTF-8編碼
with req.urlopen ( request ) as response:
result = response.read().decode( "UTF-8" )
result = json.loads( result )
#將字串result,以json模組解讀為字典格式
items = result[0]["data"]["topic"]["latestPosts"]["postPreviews"]
#將result的固定部分寫成列表
for item in items:
print ( item["post"]["title"] )
#以for迴圈印出剩下的24筆資料
```