# Python - Day 9 ###### tags: `Python` ## 網路爬蟲 Web Crawler ### Part 1 1. 連線到特定網址,抓取資料 2. 解析資料,取得實際想要的部分 #### HTML格式 1. class 為(某東西)的格式 ```htmlembedded= <html> <head> <title>HTML格式</title> </head> <body> <div class = "list" > <span>樹狀格式</span> <span>樹狀格式</span> </div> <body> <htmL> ``` #### 引入外部模組 BeautifulSoup4 1. 安裝BeautifulSoup套件 ```CommandLine= pip install beautifulsoup4 ``` #### 模擬真人(使用者代理 UA) 1. 網站用來識別是否為真人使用網頁 2. root 解析html網站原始碼 3. find函式搜尋特定內容() ```python= import urllib.request as req import ssl context = ssl._create_default_https_context = ssl._create_unverified_context url = "https://www.ptt.cc/bbs/movie/index.html" #建立一個request物件,附加 Request Headers 的資訊 request = req.Request(url, headers={ "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36" } ) #在Urlopen中用request物件,而不是url物件 with req.urlopen(request) as response: data = response.read().decode( "UTF-8" ) import bs4 root = bs4.BeautifulSoup(data, "html.parser") titles = root.find_all( "div" , class_ = "title" ) #尋找 "title" 的 div (class 標籤) for title in titles: if title.a != None: #如果標題含有a標籤(沒有被刪除),就印出來 print ( title.a.string ) ``` ### Part 2 #### Cookie [Cookie 是什麼 ? 讓網站瀏覽更順暢的關鍵元素](https://www.waca.net/support/id/445) 1. 網站會在使用者的瀏覽器存一小段內容 2. 連線時,會放在Request Header中送出 3. 一個小型文字檔案存在瀏覽器中 #### 追蹤網頁連結 1. BS4中的Find函式(尋找單一項) 2. BS4中的Find_all函式(尋找全部) ```python= titles = root.find_all( "div" , class_ = "title" ) #尋找 class = "title" 的 div標籤 for title in titles: if title.a != None: #如果標題含有a標籤(沒有被刪除),就印出來 print ( title.a.string ) nextLink = root.find( "a" , string = "‹ 上頁" ) #找到網站中a標籤 裡面 名為[上頁]的字串(上一頁的超連結) return nextLink["href"] # hraf -> 屬性名稱 #抓取一個頁面的標題 ``` #### 連續抓取頁面 1. 將讀取網站寫成函式(以URl作為傳入參數) 2. 主程式以for/while loop重複執行 3. 將追蹤的網頁以retrun形式傳出函式 4. 把新的網址傳入函式 ```python= import urllib.request as req import ssl from webbrowser import get def getData ( url ): context = ssl._create_default_https_context = ssl._create_unverified_context #建立一個request物件,附加 Request Headers 的資訊 request = req.Request(url, headers={ "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36", "cookie" : "over18=1" #讓程式帶上cookies (是否已滿18歲) } ) #在Urlopen中用request物件,而不是url物件 with req.urlopen(request) as response: data = response.read().decode( "UTF-8" ) import bs4 root = bs4.BeautifulSoup(data, "html.parser") titles = root.find_all( "div" , class_ = "title" ) #尋找 class = "title" 的 div標籤 for title in titles: if title.a != None: #如果標題含有a標籤(沒有被刪除),就印出來 print ( title.a.string ) nextLink = root.find( "a" , string = "‹ 上頁" ) #找到網站中a標籤 裡面 名為[上頁]的字串(上一頁的超連結) return nextLink["href"] # hraf -> 屬性名稱 #抓取一個頁面的標題 pageURL = "https://www.ptt.cc/bbs/Gossiping/index.html" for i in range(10) : pageURL = "https://www.ptt.cc" + getData ( pageURL ) ``` ### Part 3 #### 一般網站 1. 會一次傳回帶資料的HTML 2. 網站瞬間跑出來 #### AJAX 1. 網頁前端的JavaScript技術 2. 第一次會傳回不帶資料的HTML網頁,第二次才傳回資料 3. 先跑出框架,停頓一下,資料才跑出來 #### 確認網站運作模式 1. 網站瞬間跑出來 2. 先跑出框架,停頓一下,資料才跑出來 ```python= import urllib.request as req import ssl context = ssl._create_default_https_context = ssl._create_unverified_context url = "https://medium.com/_/api/home_feed" #建立一個request物件,附加 Request Headers 的資訊 request = req.Request(url, headers={ "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36", "cookie" : "vary=enable_medium_app", "cookie" : "_dd_s=rum=0&expire=1644246317617", "cookie" : "__cfruid=3f75a8755b6ff096cdedc0fee3e51500debf7f2f-1644239763" } ) #在Urlopen中用request物件,而不是url物件 with req.urlopen(request) as response: data = response.read().decode( "UTF-8" ) import json data = json.loads(data) data = data.replace ( "])}while(1):</X>" , "" ) #替換非字典形式的json原始碼 posts = data ["playload"]["references"]["Post"] #找尋標題所在的字典 for key in posts: #用for迴圈跑所有的post post = posts[key] #取出所有post print ( post["title"] ) #印出post中title所在的標籤 ``` ### Part 4 #### AJAX進階解法- Request Data 1. 在傳送請求時,附帶傳送一些資料 2. 透過那些資料,伺服器才會提供回傳的東西 ```python= import urllib.request as req import json import ssl context = ssl._create_default_https_context = ssl._create_unverified_context #提供伺服器憑證 url = "https://medium.com/_/graphql" requestData = [{"operationName":"TopicHandlerHomeFeed","variables":{"topicSlug":"editors-picks","feedPagingOptions":{"limit":25,"to":"1644340994836"}},"query":"query TopicHandlerHomeFeed($topicSlug: ID!, $feedPagingOptions: PagingOptions) {\n topic(slug: $topicSlug) {\n ...TopicHandlerHomeFeed_topic\n __typename\n }\n}\n\nfragment TopicHandlerHomeFeed_topic on Topic {\n id\n name\n latestPosts(paging: $feedPagingOptions) {\n postPreviews {\n ...TopicHandlerHomeFeed_postPreview\n __typename\n }\n pagingInfo {\n next {\n limit\n to\n __typename\n }\n __typename\n }\n __typename\n }\n __typename\n}\n\nfragment TopicHandlerHomeFeed_postPreview on PostPreview {\n postId\n post {\n id\n ...HomeFeedItem_post\n __typename\n }\n __typename\n}\n\nfragment HomeFeedItem_post on Post {\n __typename\n id\n title\n firstPublishedAt\n mediumUrl\n collection {\n id\n name\n domain\n logo {\n id\n __typename\n }\n __typename\n }\n creator {\n id\n name\n username\n imageId\n mediumMemberAt\n __typename\n }\n previewImage {\n id\n __typename\n }\n previewContent {\n subtitle\n __typename\n }\n readingTime\n tags {\n ...TopicPill_tag\n __typename\n }\n ...BookmarkButton_post\n ...OverflowMenuButtonWithNegativeSignal_post\n ...PostPresentationTracker_post\n ...PostPreviewAvatar_post\n}\n\nfragment TopicPill_tag on Tag {\n __typename\n id\n displayTitle\n}\n\nfragment BookmarkButton_post on Post {\n visibility\n ...SusiClickable_post\n ...AddToCatalogBookmarkButton_post\n __typename\n id\n}\n\nfragment SusiClickable_post on Post {\n id\n mediumUrl\n ...SusiContainer_post\n __typename\n}\n\nfragment SusiContainer_post on Post {\n id\n __typename\n}\n\nfragment AddToCatalogBookmarkButton_post on Post {\n ...AddToCatalogBase_post\n __typename\n id\n}\n\nfragment AddToCatalogBase_post on Post {\n id\n viewerEdge {\n catalogsConnection {\n catalogsContainingThis(type: LISTS) {\n catalogId\n catalogItemIds\n __typename\n }\n predefinedContainingThis {\n catalogId\n predefined\n catalogItemIds\n __typename\n }\n __typename\n }\n ...editCatalogItemsMutation_postViewerEdge\n ...useAddItemToPredefinedCatalog_postViewerEdge\n __typename\n id\n }\n ...WithToggleInsideCatalog_post\n __typename\n}\n\nfragment editCatalogItemsMutation_postViewerEdge on PostViewerEdge {\n id\n catalogsConnection {\n catalogsContainingThis(type: LISTS) {\n catalogId\n version\n catalogItemIds\n __typename\n }\n predefinedContainingThis {\n catalogId\n predefined\n version\n catalogItemIds\n __typename\n }\n __typename\n }\n __typename\n}\n\nfragment useAddItemToPredefinedCatalog_postViewerEdge on PostViewerEdge {\n id\n catalogsConnection {\n predefinedContainingThis {\n catalogId\n version\n predefined\n catalogItemIds\n __typename\n }\n __typename\n }\n __typename\n}\n\nfragment WithToggleInsideCatalog_post on Post {\n id\n viewerEdge {\n catalogsConnection {\n catalogsContainingThis(type: LISTS) {\n catalogId\n __typename\n }\n predefinedContainingThis {\n predefined\n __typename\n }\n __typename\n }\n __typename\n id\n }\n __typename\n}\n\nfragment OverflowMenuButtonWithNegativeSignal_post on Post {\n id\n ...OverflowMenuWithNegativeSignal_post\n ...CreatorActionOverflowPopover_post\n __typename\n}\n\nfragment OverflowMenuWithNegativeSignal_post on Post {\n id\n creator {\n id\n __typename\n }\n collection {\n id\n __typename\n }\n ...OverflowMenuItemUndoClaps_post\n __typename\n}\n\nfragment OverflowMenuItemUndoClaps_post on Post {\n id\n clapCount\n ...ClapMutation_post\n __typename\n}\n\nfragment ClapMutation_post on Post {\n __typename\n id\n clapCount\n ...MultiVoteCount_post\n}\n\nfragment MultiVoteCount_post on Post {\n id\n ...PostVotersNetwork_post\n __typename\n}\n\nfragment PostVotersNetwork_post on Post {\n id\n voterCount\n recommenders {\n name\n __typename\n }\n __typename\n}\n\nfragment CreatorActionOverflowPopover_post on Post {\n allowResponses\n id\n statusForCollection\n isLocked\n isPublished\n clapCount\n mediumUrl\n pinnedAt\n pinnedByCreatorAt\n curationEligibleAt\n mediumUrl\n responseDistribution\n visibility\n ...useIsPinnedInContext_post\n pendingCollection {\n id\n name\n creator {\n id\n __typename\n }\n avatar {\n id\n __typename\n }\n domain\n slug\n __typename\n }\n creator {\n id\n ...MutePopoverOptions_creator\n ...auroraHooks_publisher\n __typename\n }\n collection {\n id\n name\n creator {\n id\n __typename\n }\n avatar {\n id\n __typename\n }\n domain\n slug\n ...MutePopoverOptions_collection\n ...auroraHooks_publisher\n __typename\n }\n ...NewsletterV3EmailToSubscribersMenuItem_post\n ...OverflowMenuItemUndoClaps_post\n __typename\n}\n\nfragment useIsPinnedInContext_post on Post {\n id\n collection {\n id\n __typename\n }\n pendingCollection {\n id\n __typename\n }\n pinnedAt\n pinnedByCreatorAt\n __typename\n}\n\nfragment MutePopoverOptions_creator on User {\n id\n __typename\n}\n\nfragment auroraHooks_publisher on Publisher {\n __typename\n ... on Collection {\n isAuroraEligible\n isAuroraVisible\n viewerEdge {\n id\n isEditor\n __typename\n }\n __typename\n id\n }\n ... on User {\n isAuroraVisible\n __typename\n id\n }\n}\n\nfragment MutePopoverOptions_collection on Collection {\n id\n __typename\n}\n\nfragment NewsletterV3EmailToSubscribersMenuItem_post on Post {\n id\n creator {\n id\n newsletterV3 {\n id\n subscribersCount\n __typename\n }\n __typename\n }\n isNewsletter\n isAuthorNewsletter\n __typename\n}\n\nfragment PostPresentationTracker_post on Post {\n id\n visibility\n previewContent {\n isFullContent\n __typename\n }\n collection {\n id\n slug\n __typename\n }\n __typename\n}\n\nfragment PostPreviewAvatar_post on Post {\n __typename\n id\n collection {\n id\n name\n ...CollectionAvatar_collection\n __typename\n }\n creator {\n id\n username\n name\n ...UserAvatar_user\n ...userUrl_user\n __typename\n }\n}\n\nfragment CollectionAvatar_collection on Collection {\n name\n avatar {\n id\n __typename\n }\n ...collectionUrl_collection\n __typename\n id\n}\n\nfragment collectionUrl_collection on Collection {\n id\n domain\n slug\n __typename\n}\n\nfragment UserAvatar_user on User {\n __typename\n id\n imageId\n mediumMemberAt\n name\n username\n ...userUrl_user\n}\n\nfragment userUrl_user on User {\n __typename\n id\n customDomainState {\n live {\n domain\n __typename\n }\n __typename\n }\n hasSubdomain\n username\n}\n"}] request = req.Request( url ,headers = { "Content-Type" : "application/json", "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36" }, data = json.dumps(requestData).encode( "UTF-8" ) ) #提供伺服器一些資料 #requestData 為字典形式,須以變成json格式,並且以UTF-8編碼 with req.urlopen ( request ) as response: result = response.read().decode( "UTF-8" ) result = json.loads( result ) #將字串result,以json模組解讀為字典格式 items = result[0]["data"]["topic"]["latestPosts"]["postPreviews"] #將result的固定部分寫成列表 for item in items: print ( item["post"]["title"] ) #以for迴圈印出剩下的24筆資料 ```