爬蟲 - HackMD

# 爬蟲 ![](https://i.imgur.com/RxzJfCd.jpg) ```python= import requests import webbrowser address=input('請輸入地址:') webbrowser.open('http://www.google.com.tw/maps/search/'+address) ``` 請輸入地址:高雄火車站 True request→get(取得網頁) webbrowser→open(開啟網頁) urlopen→open(取得網頁) ```python= #GET請求 #普通單傳的網頁,只需要用GET就可以直接下載 #透過GET取得伺服器狀態回應碼 import requests #使用GET下載 r=requests.get('http://www.google.com.tw/') print(r) #伺服器的回應碼 print(r.status_code) #檢查伺服器狀態 if r.status_code==requests.codes.ok: print('ok') ``` <Response [200]> 200 ok ```python= import requests r=requests.get('http://www.grandtech.info') print(type(r)) ``` <class 'requests.models.Response'> ```python= import requests r=requests.get('http://www.grandtech.info') print(type(r)) if r.status_code==requests.codes.ok: print('取得網頁內容成功') else: print('取的網業類容失敗') ``` <class 'requests.models.Response'> 取得網頁內容成功 ```python= import requests print('開啟搜尋網頁') param={'wd':'yahoo購物'}#搜尋的信息"網址?wd=" r=requests.get('https://www.baidu.com/s',params=param) webbrowser.open(r.url) ``` 開啟搜尋網頁 True 這個是百度的 ```python= import requests print('開啟搜尋網頁') param={'q':'yahoo購物'}#搜尋信息"網址?q=" r=requests.get('https://www.google.com/search',params=param) webbrowser.open(r.url) ``` 開啟搜尋網頁 True 這個是GOOGLE的 ```python= from urllib.request import urlopen from bs4 import BeautifulSoup html=urlopen('http://www.pythonscraping.com/pages/page1.html') bsObj=BeautifulSoup(html) print(type(bsObj)) print(bsObj.h1) print(bsObj.h1.text) ``` <class 'bs4.BeautifulSoup'> <h1>An Interesting Title</h1> An Interesting Title ```python= from bs4 import BeautifulSoup from urllib.request import urlopen html=urlopen('https://morvanzhou.github.io/static/scraping/basic-structure.html').read().decode('utf-8') soup=BeautifulSoup(html,'html.parser') print(type(soup)) print(soup.h1.text) print('\n',soup.p.text) #透過find_all抓取連結 all_href=soup.find_all('a') print(all_href) for i in all_href: print(i['href']) ``` ![](https://i.imgur.com/IsJvZ9Y.png) ```python= from bs4 import BeautifulSoup from urllib.request import urlopen html=urlopen('http://www.pythonscraping.com/pages/warandpeace.html') bsObj=BeautifulSoup(html,'html.parser') #找出為紅色字的 namelist1=bsObj.find_all('span',{"class":"red"}) print(type(namelist1)) for name in namelist1: print(name.get_text()) ``` <class 'bs4.element.ResultSet'> Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don't tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist- I really believe he is Antichrist- I will have nothing more to do with you and you are no longer my friend, no longer my 'faithful slave,' as you call yourself! But how do you do? I see I have frightened you- sit down and tell me all the news. If you have nothing better to do, Count [or Prince], and if the prospect of spending an evening with a poor invalid is not too terrible, I shall be very charmed to see you tonight between 7 and 10- Annette Scherer. Heavens! what a virulent attack! First of all, dear friend, tell me how you are. Set your friend's mind at rest, Can one be well while suffering morally? Can one be calm in times like these if one has any feeling? You are staying the whole evening, I hope? And the fete at the English ambassador's? Today is Wednesday. I must put in an appearance there, My daughter is coming for me to take me there. I thought today's fete had been canceled. I confess all these festivities and fireworks are becoming wearisome. If they had known that you wished it, the entertainment would have been put off, Don't tease! Well, and what has been decided about Novosiltsev's dispatch? You know everything. What can one say about it? What has been decided? They have decided that Buonaparte has burnt his boats, and I believe that we are ready to burn ours. Oh, don't speak to me of Austria. Perhaps I don't understand things, but Austria never has wished, and does not wish, for war. She is betraying us! Russia alone must save Europe. Our gracious sovereign recognizes his high vocation and will be true to it. That is the one thing I have faith in! Our good and wonderful sovereign has to perform the noblest role on earth, and he is so virtuous and noble that God will not forsake him. He will fulfill his vocation and crush the hydra of revolution, which has become more terrible than ever in the person of this murderer and villain! We alone must avenge the blood of the just one.... Whom, I ask you, can we rely on?... England with her commercial spirit will not and cannot understand the Emperor Alexander's loftiness of soul. She has refused to evacuate Malta. She wanted to find, and still seeks, some secret motive in our actions. What answer did Novosiltsev get? None. The English have not understood and cannot understand the self-abnegation of our Emperor who wants nothing for himself, but only desires the good of mankind. And what have they promised? Nothing! And what little they have promised they will not perform! Prussia has always declared that Buonaparte is invincible, and that all Europe is powerless before him.... And I don't believe a word that Hardenburg says, or Haugwitz either. This famous Prussian neutrality is just a trap. I have faith only in God and the lofty destiny of our adored monarch. He will save Europe! I think, that if you had been sent instead of our dear Wintzingerode you would have captured the King of Prussia's consent by assault. You are so eloquent. Will you give me a cup of tea? In a moment. A propos, I am expecting two very interesting men tonight, le Vicomte de Mortemart, who is connected with the Montmorencys through the Rohans, one of the best French families. He is one of the genuine emigres, the good ones. And also the Abbe Morio. Do you know that profound thinker? He has been received by the Emperor. Had you heard? I shall be delighted to meet them, But tell me, is it true that the Dowager Empress wants Baron Funke to be appointed first secretary at Vienna? The baron by all accounts is a poor creature. Baron Funke has been recommended to the Dowager Empress by her sister, Now about your family. Do you know that since your daughter came out everyone has been enraptured by her? They say she is amazingly beautiful. I often think, I often think how unfairly sometimes the joys of life are distributed. Why has fate given you two such splendid children? I don't speak of Anatole, your youngest. I don't like him, Two such charming children. And really you appreciate them less than anyone, and so you don't deserve to have them. I can't help it, Lavater would have said I lack the bump of paternity. Don't joke; I mean to have a serious talk with you. Do you know I am dissatisfied with your younger son? Between ourselves he was mentioned at Her Majesty's and you were pitied.... What would you have me do? You know I did all a father could for their education, and they have both turned out fools. Hippolyte is at least a quiet fool, but Anatole is an active one. That is the only difference between them. And why are children born to such men as you? If you were not a father there would be nothing I could reproach you with, I am your faithful slave and to you alone I can confess that my children are the bane of my life. It is the cross I have to bear. That is how I explain it to myself. It can't be helped! ```python= #找出為綠色字的 namelist2=bsObj.find_all('span',{'class':'green'}) for i in namelist2: print(i.get_text()) ``` Anna Pavlovna Scherer Empress Marya Fedorovna Prince Vasili Kuragin Anna Pavlovna St. Petersburg the prince Anna Pavlovna Anna Pavlovna the prince the prince the prince Prince Vasili Anna Pavlovna Anna Pavlovna the prince Wintzingerode King of Prussia le Vicomte de Mortemart Montmorencys Rohans Abbe Morio the Emperor the prince Prince Vasili Dowager Empress Marya Fedorovna the baron Anna Pavlovna the Empress the Empress Anna Pavlovna's Her Majesty Baron Funke The prince Anna Pavlovna the Empress The prince Anatole the prince The prince Anna Pavlovna Anna Pavlovna ```python= #印出h1,h2 t=bsObj.find_all({'h1','h2'}) print(t) ``` [<h1>War and Peace</h1>, <h2>Chapter 1</h2>] ```python= #找出紅色與綠色的字 s=bsObj.find_all('span',{'class':['red','green']}) s ``` find_all抓取參數的方式 ![](https://i.imgur.com/2faXQZ1.jpg) ```python= #以樂透網頁作範例 import bs4,requests url='https://www.taiwanlottery.com.tw/' html=requests.get(url) #用get方式取網頁 print(html.raise_for_status()) #驗證網頁是否回應200 objSoup=BeautifulSoup(html.text,'html.parser') #透過select()去選出.contents_box02 dataTag=objSoup.select('.contents_box02') print('串列長度:',len(dataTag)) #列印出contents_box02的串列 for i in range(len(dataTag)): print(dataTag[i]) ``` None 串列長度: 4 <div class="contents_box02"> <div id="contents_logo_02"></div><div class="contents_mine_tx02"><span class="font_black15">109/3/9 第109000020期 </span><span class="font_red14"><a href="Result_all.aspx#01">開獎結果</a></span></div><div class="contents_mine_tx04">開出順序<br/>大小順序<br/>第二區</div><div class="ball_tx ball_green">27 </div><div class="ball_tx ball_green">18 </div><div class="ball_tx ball_green">30 </div><div class="ball_tx ball_green">19 </div><div class="ball_tx ball_green">22 </div><div class="ball_tx ball_green">35 </div><div class="ball_tx ball_green">18 </div><div class="ball_tx ball_green">19 </div><div class="ball_tx ball_green">22 </div><div class="ball_tx ball_green">27 </div><div class="ball_tx ball_green">30 </div><div class="ball_tx ball_green">35 </div><div class="ball_red">01 </div> </div> <div class="contents_box02"> <div id="contents_logo_03"></div><div class="contents_mine_tx02"><span class="font_black15">109/3/9 第109000020期 </span><span class="font_red14"><a href="Result_all.aspx#07">開獎結果</a></span></div><div class="contents_mine_tx04">開出順序<br/>大小順序</div><div class="ball_tx ball_green">27 </div><div class="ball_tx ball_green">18 </div><div class="ball_tx ball_green">30 </div><div class="ball_tx ball_green">19 </div><div class="ball_tx ball_green">22 </div><div class="ball_tx ball_green">35 </div><div class="ball_tx ball_green">18 </div><div class="ball_tx ball_green">19 </div><div class="ball_tx ball_green">22 </div><div class="ball_tx ball_green">27 </div><div class="ball_tx ball_green">30 </div><div class="ball_tx ball_green">35 </div> </div> <div class="contents_box02"> <div id="contents_logo_04"></div><div class="contents_mine_tx02"><span class="font_black15">109/3/10 第109000028期 </span><span class="font_red14"><a href="Result_all.aspx#02">開獎結果</a></span></div><div class="contents_mine_tx04">開出順序<br/>大小順序<br/>特別號</div><div class="ball_tx ball_yellow">01 </div><div class="ball_tx ball_yellow">12 </div><div class="ball_tx ball_yellow">34 </div><div class="ball_tx ball_yellow">37 </div><div class="ball_tx ball_yellow">04 </div><div class="ball_tx ball_yellow">25 </div><div class="ball_tx ball_yellow">01 </div><div class="ball_tx ball_yellow">04 </div><div class="ball_tx ball_yellow">12 </div><div class="ball_tx ball_yellow">25 </div><div class="ball_tx ball_yellow">34 </div><div class="ball_tx ball_yellow">37 </div><div class="ball_red">42 </div> </div> <div class="contents_box02"> <div id="contents_logo_05"></div><div class="contents_mine_tx02"><span class="font_black15">109/3/10 第109000028期 </span><span class="font_red14"><a href="Result_all.aspx#08">開獎結果</a></span></div><div class="contents_mine_tx04">開出順序<br/>大小順序</div><div class="ball_tx ball_yellow">01 </div><div class="ball_tx ball_yellow">12 </div><div class="ball_tx ball_yellow">34 </div><div class="ball_tx ball_yellow">37 </div><div class="ball_tx ball_yellow">04 </div><div class="ball_tx ball_yellow">25 </div><div class="ball_tx ball_yellow">01 </div><div class="ball_tx ball_yellow">04 </div><div class="ball_tx ball_yellow">12 </div><div class="ball_tx ball_yellow">25 </div><div class="ball_tx ball_yellow">34 </div><div class="ball_tx ball_yellow">37 </div> </div> ```python= #找出開出的順序 #綠球 balls=dataTag[0].find_all('div',{'class':'ball_tx ball_green'}) print('開出順序',end=' ') for i in range(6): print(balls[i].text,end=' ') ``` 開出順序 27 18 30 19 22 35 ```python= #找出開出的順序 #黃球 balls=dataTag[2].find_all('div',{'class':'ball_tx ball_yellow'}) print('開出順序',end=' ') for i in range(6): print(balls[i].text,end=' ') ``` 開出順序 01 12 34 37 04 25 ```python= #找出大小順序 #綠球 balls=dataTag[0].find_all('div',{'class':'ball_tx ball_green'}) print('大小順序',end=' ') for i in range(6,len(balls)): print(balls[i].text,end=' ') ``` 大小順序 18 19 22 27 30 35 ```python= #找出大小順序 #黃球 balls=dataTag[2].find_all('div',{'class':'ball_tx ball_yellow'}) print('大小順序',end=' ') for i in range(6,len(balls)): print(balls[i].text,end=' ') ``` 大小順序 01 04 12 25 34 37 ```python= #威力彩第二區 special=dataTag[0].find_all('div',{'class':'ball_red'}) print('第二區:',special[0].text,end=' ') ``` 第二區: 01 ```python= #大樂透特別號 special=dataTag[2].find_all('div',{'class':'ball_red'}) print('特別號:',special[0].text,end=' ') ``` 特別號: 42 ```python= from bs4 import BeautifulSoup from urllib.request import urlopen url='https://morvanzhou.github.io/static/scraping/list.html' html=requests.get(url) soup=BeautifulSoup(html.text,'html.parser') month=soup.find_all('li',{'class':'month'}) for i in month: print(i.text,end=' ') ``` 一月二月三月四月五月 ```python= from bs4 import BeautifulSoup from urllib.request import urlopen html=urlopen('https://en.wikipedia.org/wiki/Kevin_Bacon').read().decode('utf-8') bsObj=BeautifulSoup(html,features='lxml') network=bsObj.find_all('a') for i in network: print(i.get_text()) ``` ![](https://i.imgur.com/vXStqa0.png) ```python= from bs4 import BeautifulSoup from urllib.request import urlopen import re html=urlopen('https://morvanzhou.github.io/static/scraping/table.html').read().decode('utf-8') soup=BeautifulSoup(html,'html.parser') img_links=soup.find_all('img',{'src':re.compile('.*?\.jpg')}) for link in img_links: print(link['src']) print('\n') course_link=soup.find_all('a',{'href':re.compile('https://morvanzhou.*')}) for link in course_link: print(link['href']) ``` ![](https://i.imgur.com/RLjUeaH.png) ```python= import requests import re from bs4 import BeautifulSoup def main(): resp=requests.get('http://blog.castman.net/py-scraping-analysis-book/ch2/blog/blog.html') soup=BeautifulSoup(resp.text,'html.parser') #找出所有'h'開頭的標題文字 titles=soup.find_all(['h1','h2','h3','h4','h5','h6']) for title in titles: print(title.text.strip()) #用正則表達式找出所有'h'開頭 for title in soup.find_all(re.compile('h[1-6]')): print(title.text.strip()) #找出所有.png結尾的圖片 imgs=soup.find_all('img') for img in imgs: if 'src' in img.attrs: if img['src'].endswith('.png'): print(img['src']) #透過政則表達式 for img in soup.find_all('img',{'src':re.compile('\.png$')}): print(img['src']) if __name__=='__main__': main() ``` Python教學文章開發環境設定 Mac使用者資料科學給初學者的 Python 網頁爬蟲與資料分析資料科學給初學者的 Python 網頁爬蟲與資料分析資料科學給初學者的 Python 網頁爬蟲與資料分析資料科學給初學者的 Python 網頁爬蟲與資料分析資料科學給初學者的 Python 網頁爬蟲與資料分析 Python教學文章開發環境設定 Mac使用者資料科學給初學者的 Python 網頁爬蟲與資料分析資料科學給初學者的 Python 網頁爬蟲與資料分析資料科學給初學者的 Python 網頁爬蟲與資料分析資料科學給初學者的 Python 網頁爬蟲與資料分析資料科學給初學者的 Python 網頁爬蟲與資料分析 static/python-for-beginners.png static/python_crawler.png static/python_crawler.png static/python_crawler.png static/python_crawler.png static/python_crawler.png static/python-for-beginners.png static/python_crawler.png static/python_crawler.png static/python_crawler.png static/python_crawler.png static/python_crawler.png ```python= from bs4 import BeautifulSoup import requests import re url='http://www.deepstone.com.tw/' try: htmlfile=requests.get(url) print('下載成功') except Exception as error: #error是系統自訂的錯誤訊息 print('網頁原始碼下載失敗:%s'%error) #儲存檢視的內容 fn='output.txt' with open(fn,'wb') as fileObj: #以二進位儲存 for diskStorage in htmlfile.iter_content(10240):#Response物件處理 size=fileObj.write(diskStorage)#Response物件寫入 print(size)#列出每次寫入大小 print('以%s儲存網頁成功'%fn) ``` 下載成功 10240 10240 7961 以output.txt儲存網頁成功 ```python= from bs4 import BeautifulSoup import requests url='https://taqm.epa.gov.tw/pm25/tw/PM25A.aspx?area=1' html=requests.get(url) sp=BeautifulSoup(html.text,'html.parser') print(sp.select('title')[0].text.strip())#在body裡面網頁的標題 print(sp.find('span',{'id':'ctl08_labText1'}).text.strip()) print(sp.find('a',{'href':'HourlyData.aspx'}).get('title').strip()) rs=sp.find_all('tr',{'align':'center','style':'border-width:1px;border-style:Solid;'}) for r in rs: name=r.find('a').text.strip() pm25=r.find_all('span') print(name,end=' ') for p in pm25: print(p.text.strip(),end=' ') ``` 行政院環保署-細懸浮微粒發布時間：2020/03/17 21:00 資料下載富貴角 16 18 萬里 20 14 淡水 20 21 林口 28 24 三重 31 32 菜寮 23 22 汐止 24 26 新莊 30 27 永和 27 26 板橋 32 26 土城 37 26 新店 29 29 陽明 15 18 士林 26 23 大同 29 26 中山 32 32 松山 27 24 萬華 25 21 古亭 27 23 基隆 22 21 大園 25 21 觀音 20 19 桃園 33 28 平鎮 36 30 中壢 36 26 龍潭 43 34 ```python= from requests import get from bs4 import BeautifulSoup import pandas as pd from time import sleep from random import randint from time import time url='https://www.imdb.com/search/title/?release_date=2017&sort=num_votes,desc&page=1' respone=get(url) print(respone.text[:300]) html_soup= BeautifulSoup(respone.text,'html.parser') type(html_soup) movie_containers=html_soup.find_all('div',class_='lister-item mode-advanced') movie_containers=html_soup.find_all('div',{'class':'lister-item mode-advanced'}) print(type(movie_containers)) print(len(movie_containers)) first_movie=movie_containers[0] print('電影名稱:',first_movie.h3.a.text) first_year=first_movie.h3.find('span',{'class':'lister-item-year text-muted unbold'}) first_year=first_year.text[1:5] print('發行年分:',first_year) first_imDb=float(first_movie.strong.text) print(type(first_imDb)) print('電影評分:',first_imDb) first_metascore=first_movie.find('span',class_='metascore favorable') first_metascore=int(first_metascore.text) print('Metascore評分:',first_metascore) first_votes=first_movie.find('span',attrs={'name':'nv'}) print(first_votes.text) print('投票數:'+first_votes['data-value'])#int ``` ![](https://i.imgur.com/e2nCCd0.png) ![](https://i.imgur.com/lNZuPZG.png) 電影名稱: 羅根發行年分: 2017 <class 'float'> 電影評分: 8.1 Metascore評分: 77 604,359 投票數:604359 ```python= names=[] years=[] imdb_ratings=[] metascores=[] votes=[] for container in movie_containers: #內容具Meatscore時才會提取 if container.find('div',class_='ratings-metascore')is not None: #The name name=container.h3.a.text names.append(name) #The year year=container.h3.find('span',class_='lister-item-year').text years.append(year) #The IMDB rating imdb=float(container.strong.text) imdb_ratings.append(imdb) #The metascore m_score=container.find('span',class_='metascore').text metascores.append(int(m_score)) #The number of votes vote=container.find('span',attrs={'name':'nv'})['data-value'] votes.append(vote) test_df=pd.DataFrame({'movie':names,'year':years,'imdb':imdb_ratings,'metascore':metascores,'votes':votes}) print(test_df.info) test_df['year'].unique() test_df.loc[:,'year']=test_df['year'].str[-5:-1].astype(int) test_df.to_csv('movie_ratings.csv',encoding='utf=8-sig') ``` ```python= import requests from bs4 import BeautifulSoup def main(): print('蘋果今日焦點') dom=requests.get('https://tw.appledaily.com/hot/daily').text soup=BeautifulSoup(dom,'html.parser') for ele in soup.find('ul','all').find_all('li'): print(ele.find('div','aht_title_num').text,ele.find('div','aht_title').text) if __name__=='__main__': main() ``` 蘋果今日焦點 01 最年幼 4歲童確診再爆26例累計直逼200「離封城還非常遠」 02 44歲劉真天上起舞搶救45天不治辛龍吻別「撕心裂肺」 03 劉真後事龍巖董座出手幫辛龍淚覓塔位 04 港媒踢爆中國確診數造假逾4萬人無症狀感染竟剔除治癒者復陽就醫也遭拒收 05 劉真粉紅美裳飛仙雙雙節留永恆印記辛龍慟約來世夫妻 06 真辛夫妻6年沒吵過「我回來了」甜嗓絕響 07 肺炎改名去歧視？國際合作才有解（張傳賢） 08 德國才宣布新規禁2人以上集會接觸中鏢醫師梅克爾採檢陰性 09 《女人我最大》結緣藍心湄大哭中斷錄影 10 【Video Talk】主動脈瓣膜狹窄存活率2年不到5成 11 劉真共舞留口紅印華仔遺憾緣滅 12 明起第二輪網購口罩縮短為1周就可取貨 13 辛龍寵妻給驚喜飛法國買包當生日禮 14 紐約疫情慘重確診逾萬死亡近百 15 Ella同理人母放不下蕭亞軒打氣辛龍 16 劉真撒手對岸也心碎 10億觸擊熱搜首位 17 劉真留憾來不及陪4歲愛女長大 18 大Ｓ鬥鞋惺惺相惜難忘搶贏舞后5分鐘 19 20年國標搭檔像兄妹李志堯「美麗永停留」 20 林熙蕾泣未見奇蹟關穎同擁4歲女鼻酸 21 動刀置換主動脈瓣膜葉克膜開腦仍救不回 22 中研院群聚擴大德籍男傳女友 23 南韓驚爆性剝削聊天室主嫌遭起底為學霸付費會員達26萬逾70名女性淪性奴 24 曾馨瑩Pizza廣告交心哭嘆劉真幸福太短 25 蘋中信：病毒改變了我們的未來生活（王浩威） 26 【疫情最前線】美12歲童染疫靠機器呼吸家人po警世文 27 人渣文本專欄：你相信中國的病例數？（周偉航） 28 吳宗憲和霓霓通話哽咽憂辛龍走不出 29 鞋痴想開鞋子博物館時尚咖不手軟「保值傳家」 30 【45天病程解析】定情日變忌日！北榮證實劉真腦壓過高 22日22時22分離世 ```python= import requests from bs4 import BeautifulSoup resp=requests.get('http://blog.castman.net/py-scraping-analysis-book/ch2/table/table.html') soup=BeautifulSoup(resp.text,'html.parser') prices=[] rows=soup.find('table','table').tbody.find_all('tr') print(len(rows)) for row in rows: price=row.find_all('td')[2].text prices.append(int(price)) print('均價:',sum(prices)/len(prices)) ``` 6 均價: 1823.3333333333333 ```python= rows=soup.find('table','table').tbody.find_all('tr') for row in rows: all_tds=row.find_all('td') if 'href' in all_tds[3].a.attrs: href=all_tds[3].a['href'] else: href=None print(all_tds[0].text,all_tds[1].text,all_tds[2].text,href,all_tds[3].a.img['src']) ``` 初心者 - Python入門初學者 1490 http://www.pycone.com img/python-logo.png Python 網頁爬蟲入門實戰有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png Python 機器學習入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png Python 資料科學入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png Python 資料視覺化入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png Python 網站架設入門實戰 (預計) 有程式基礎的初學者 1890 None img/python-logo.png ```python= #attrs屬性 html='<div data-name="123">foo</div>' soup=BeautifulSoup(html,'html.parser') data_tag=soup.find(attrs={"data-name":"123"}) print(data_tag) ``` <div data-name="123">foo</div> ```python= import requests import re from bs4 import BeautifulSoup Y_MOVIE_URL='https://movies.yahoo.com.tw/movie_thisweek.html' def get_web_page(url): resp=requests.get(url) if resp.status_code!=200: print('Invalid url',resp.url) return None else: return resp.text def get_date(date_str): #e.g的上映日期:2017-03-23 pattern='\d+-\d+-\d+' match=re.search(pattern,date_str) if match!=None: return date_str else: return match.group(0) def get_movie_id(url): try: movie_id=url.split('.html')[0].split('-')[-1] except: movie_id=url return movie_id def get_movies(dom): soup=BeautifulSoup(dom,'html.parser') movies=[] rows=soup.find_all('div','release_info_text') for row in rows: movie=dict() movie['expectation']=row.find('div','leveltext').span.text.strip() movie['ch_name']=row.find('div','release_movie_name').a.text.strip() movie['eng_name']=row.find('div','release_movie_name').find('div','en').a.text.strip() movie['movie-id']=get_movie_id(row.find('div','release_movie_name').a['href']) movie['poster_url']=row.parent.find_previous_sibling('div','release_foto').a.img['src'] movie['release_date']=get_date(row.find('div','release_movie_time').text) movies.append(movie) return movies if __name__=='__main__': page=get_web_page(Y_MOVIE_URL) movies=get_movies(page) for movie in movies: print(movie) #回家作業:如何從裡面撈資料出來?????? ``` ![](https://i.imgur.com/pgjhJkr.png) ```python= names = [] years = [] imdb_ratings = [] metascores = [] votes = [] pages=[str(i) for i in range(1,5)] years_url=[str(i) for i in range(2010,2019)] requests=0 for year_url in years_url: for page in pages: response = get('http://www.imdb.com/search/title?release_date=' + year_url + '&sort=num_votes,desc&page=' + page) sleep(randint(3,5)) requests+=1 if requests>72: warn('!!!') break page_html=BeautifulSoup(response.text,'html.parser') mv_containers=page_html.find_all('div',class_='lister-item mode-advanced') for container in mv_containers: if container.find('div',class_='ratings-metascore') is not None: name=container.h3.a.text names.append(name) year=container.h3.find('span',class_='lister-item-year').text years.append(year) imdb=float(container.strong.text) imdb_ratings.append(imdb) m_score=container.find('span',class_='metascore').text metascores.append(m_score) vote=container.find('span',attrs={'name':'nv'})['data-value'] votes.append(int(vote)) movie_ratings=pd.DataFrame({'movie':names,'year':years,'imdb':imdb_ratings,'metascore':metascores,'votes':votes}) print(movie_ratings.info()) movie_ratings.head(10) ``` ![](https://i.imgur.com/CWRXhxN.png) ```python= movie_ratings['year'].unique() ``` array(['(2010)', '(I) (2010)', '(2011)', '(I) (2011)', '(2012)', '(I) (2012)', '(2013)', '(I) (2013)', '(2014)', '(I) (2014)', '(II) (2014)', '(2015)', '(I) (2015)', '(II) (2015)', '(2016)', '(II) (2016)', '(I) (2016)', '(IX) (2016)', '(2017)', '(I) (2017)', '(2018)', '(I) (2018)', '(III) (2018)'], dtype=object) ```python= movie_ratings.loc[:,'year']=movie_ratings['year'].str[-5:-1].astype(int) movie_ratings['year'].head(3) ``` 0 2010 1 2010 2 2010 Name: year, dtype: int32 ```python= movie_ratings.to_csv('movie_ratings') movie_ratings.head(10) ``` ![](https://i.imgur.com/7wbzmFV.png) ```python= import requests from bs4 import BeautifulSoup def append_list_pm25(): url = 'https://taqm.epa.gov.tw/pm25/tw/PM25A.aspx?area=1' html = requests.get(url) sp = BeautifulSoup(html.text, 'html.parser') rs = sp.find_all("tr", {"align": "center", "style": "border-width:1px;border-style:Solid;"}) for r in rs: name = r.find('a') pm25 = r.find_all('span') dic = {} dic.setdefault('name', name.text.strip()) dic.setdefault('pm25', pm25[0].text.strip()) dic.setdefault('pm25_1', pm25[1].text.strip()) list.append(dic) def get_pm25(name): for d in list: if d.get('name') == name: return d list=[] append_list_pm25() print(list) name=input('請輸入地區?(例如:林口,桃園):') d=get_pm25(name) print(d) print(d.get('pm25')) ``` [{'name': '富貴角', 'pm25': '15', 'pm25_1': '18'}, {'name': '萬里', 'pm25': '', 'pm25_1': '21'}, {'name': '淡水', 'pm25': '21', 'pm25_1': '23'}, {'name': '林口', 'pm25': '36', 'pm25_1': '30'}, {'name': '三重', 'pm25': '29', 'pm25_1': '29'}, {'name': '菜寮', 'pm25': '26', 'pm25_1': '25'}, {'name': '汐止', 'pm25': '21', 'pm25_1': '22'}, {'name': '新莊', 'pm25': '28', 'pm25_1': '27'}, {'name': '永和', 'pm25': '24', 'pm25_1': '24'}, {'name': '板橋', 'pm25': '18', 'pm25_1': '23'}, {'name': '土城', 'pm25': '34', 'pm25_1': '20'}, {'name': '新店', 'pm25': '21', 'pm25_1': '25'}, {'name': '陽明', 'pm25': '28', 'pm25_1': '26'}, {'name': '士林', 'pm25': '17', 'pm25_1': '24'}, {'name': '大同', 'pm25': '21', 'pm25_1': '22'}, {'name': '中山', 'pm25': '26', 'pm25_1': '29'}, {'name': '松山', 'pm25': '28', 'pm25_1': '25'}, {'name': '萬華', 'pm25': '23', 'pm25_1': '30'}, {'name': '古亭', 'pm25': '29', 'pm25_1': '26'}, {'name': '基隆', 'pm25': '16', 'pm25_1': '13'}, {'name': '大園', 'pm25': '27', 'pm25_1': '31'}, {'name': '觀音', 'pm25': '19', 'pm25_1': '22'}, {'name': '桃園', 'pm25': '37', 'pm25_1': '33'}, {'name': '平鎮', 'pm25': '37', 'pm25_1': '35'}, {'name': '中壢', 'pm25': '36', 'pm25_1': '42'}, {'name': '龍潭', 'pm25': '34', 'pm25_1': '40'}] 請輸入地區?(例如:林口,桃園):三重 {'name': '三重', 'pm25': '29', 'pm25_1': '29'} 29 ```python= import requests from bs4 import BeautifulSoup F_FINANCE_URL='https://www.google.com/search?q=' def get_web_page(url,stock_id): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/66.0.3359.181 Safari/537.36'} resp=requests.get(url+stock_id,headers=headers) if resp.status_code!=200: print('找不到網頁:',resp.url) return None else: return resp.text ``` ```python= import requests from bs4 import BeautifulSoup G_FINANCE_URL='https://www.google.com/search?q=' def get_web_page(url,stock_id): #瀏覽器請求,有些網路多家請求會報錯 headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/66.0.3359.181 Safari/537.36'} resp=requests.get(url+stock_id,headers=headers) if resp.status_code!=200: print('網頁錯誤') return None else: return resp.text def get_stock_info(dom): soup=BeautifulSoup(dom,'html.parser') stock=dict() sections=soup.find_all('g-card-section') stock['name']=sections[1].div.text spans=sections[1].find_all('div',recursive=False)[1].find_all('span',recursive=False) stock['current_price']=spans[0].text stock['current_change']=spans[1].text for table in sections[3].find_all('table'): for tr in table.find_all('tr')[:3]: key=tr.find_all('td')[0].text.lower().strip() value=tr.find_all('td')[1].text.strip() stock[key]=value return stock if __name__=='__main__': page=get_web_page(G_FINANCE_URL,'TPE:2330') if page: stock=get_stock_info(page) for k,v in stock.items(): print(k,v) ``` name 台灣積體電路製造TPE: 2330 current_price 已收盤: 3月31日下午1:30 [GMT+8] · current_change 免責聲明開盤 273.00 最高 274.00 最低 269.50 殖利率 3.47% 上次收盤價 267.50 52 週高點 346.00 https://acg.gamer.com.tw/billboard.php?p=ANIME&t=3&period=all 回家作業 ```python= import requests import re import random from bs4 import BeautifulSoup target_url='https://gas.goodlife.tw/' rs=requests.session() res=rs.get(target_url,verify=False) res.encoding='utf-8' soup=BeautifulSoup(res.text,'html.parser') title=soup.select('#main')[0].text.replace('\n','').split('(')[0] gas_price=soup.select('#gas-price')[0].text.replace('\n\n\n','').replace(' ','') cpc=soup.select('#cpc')[0].text.replace(' ','') corrtent='{}\n{}{}'.format(title,gas_price,cpc) print(corrtent) ``` 最後更新時間: 2020-03-31 20:20 柴油預計調整: -1.1元下週一2020年04月06日起,預計汽油每公升: 降0.9元今日中油油價 92: 18.2 95油價: 19.7 98: 21.7 柴油: 15.4 C:\ProgramData\Anaconda3\envs\work\lib\site-packages\urllib3\connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'gas.goodlife.tw'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning, ```python= import requests import re import random from bs4 import BeautifulSoup target_url='https://disp.cc/b/PttHot' rs=requests.session() res=rs.get(target_url,verify=False) soup=BeautifulSoup(res.text,'html.parser') content='' for data in soup.select('#list div.row2 div span.listTitle'): title=data.text link='http://disp.cc/b/'+data.find('a')['href'] content+='{}\n{}\n\n'.format(title,link) print(content) ``` ■ [新聞] 博恩「我從國小到高中都被強X！」　重訓健身背後藏瘡疤： http://disp.cc/b/796-cdVP ■ [政治] 川普兒子發明「責任輪盤」? http://disp.cc/b/796-cdVS ■ [新聞] 中國真的零確診？「黃安躲在台灣」打臉中共 http://disp.cc/b/796-cdVX ■ Re: [影音] 博恩被強姦的故事 http://disp.cc/b/796-cdWf ■ [問卦] 網咖櫃檯穿這樣我怎麼專心發廢文 http://disp.cc/b/796-cdYj ■ [新聞] 讚大陸防疫得宜！WHO專家：全世界都欠武漢人民一次 http://disp.cc/b/796-c86x ■ [問卦] 同事為了月子中心跟老婆吵架 http://disp.cc/b/796-cdF4 ■ [新聞] 無視禁足令！21歲金髮妹炫耀「我才不會感染」　2天後中… http://disp.cc/b/796-cdGk ■ Re: [新聞] 快訊／美股崩盤！道瓊暴跌近900點 http://disp.cc/b/796-cdKO ■ [新聞] 海邊情侶搭帳篷作愛全都錄里長證實在漁 http://disp.cc/b/796-cdWU ■ [新聞] 不只博恩！呱吉揭童年慘事「小4被強暴」　地點對象全曝光 http://disp.cc/b/796-cdYd ■ [新聞] 開槍不慎打死女列被告民眾打爆電話力挺警察 http://disp.cc/b/796-cdYI ■ 日本宣佈「大鎖國」 http://disp.cc/b/796-ce3b ■ [新聞] 王子首相都確診！英國超氣譙中國「蓋牌」：疫情後一定算帳 http://disp.cc/b/796-ce3o ■ [問卦] 志村健在日本地位相當於台灣哪位藝人？ http://disp.cc/b/796-ce3q ■ [新聞] 啦啦隊女神爬山穿這樣！深U內衣迸出超兇碗公奶 http://disp.cc/b/796-ce3Z ■ [爆卦] 湖南火車出軌最新現場畫面 http://disp.cc/b/796-ce4K ■ [政治] 郝伯村三總逝世 http://disp.cc/b/796-ce55 ■ [自動轉寄] [爆卦] 掛記者電話的官員被WHO從網站除名 http://disp.cc/b/796-ce83 □ [公告] PTT 熱門文章板開板成功 http://disp.cc/b/796-59l9 C:\ProgramData\Anaconda3\envs\work\lib\site-packages\urllib3\connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'disp.cc'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning, ```python= import requests from bs4 import BeautifulSoup headers={'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'} url='https://www.youtube.com/' resp=requests.get(url,headers=headers) soup=BeautifulSoup(resp.text,'html.parser') target=soup.find_all('a') txt=open('video-title.txt','w',encoding='utf-8') for i in target: f=i.get_text().strip() txt.write(f) txt.write('\n') txt.close() ``` ![](https://i.imgur.com/7HmMxjS.png) ```python= import requests from bs4 import BeautifulSoup url='https://tw.yahoo.com/' resp=requests.get(url) #確認下載是否成功 if resp.status_code==requests.codes.ok: soup=BeautifulSoup(resp.text,'html.parser') storiers=soup.find_all('a',class_='story-title') #print(storiers) for s in storiers: #新聞標題 print('標題:'+s.text) #網址 print('網址:'+s.get('href')) ``` 標題:境外病例不斷「陸恐打延長賽」網址:https://tw.news.yahoo.com/%E5%96%AE%E6%97%A5%E7%A2%BA%E8%A8%BA%E7%A0%B4%E7%99%BE-%E9%99%B8%E6%81%90%E6%89%93%E5%BB%B6%E9%95%B7%E8%B3%BD-165116884.html 標題:華航改名？招牌就值10億美元網址:https://tw.news.yahoo.com/%E8%8F%AF%E8%88%AA%E6%8B%9B%E7%89%8C%E5%80%BC10%E5%84%84%E7%BE%8E%E5%85%83-%E6%94%B9%E5%90%8D%E8%88%AA%E6%AC%8A%E8%AE%8A%E6%95%B8%E5%A4%A7-225614359.html 標題:聚餐返家一夜未醒上校猝死網址:https://tw.news.yahoo.com/%E5%BE%8C%E6%8C%87%E9%83%A8%E5%89%AF%E6%97%85%E9%95%B7%E8%81%9A%E9%A4%90%E8%BF%94%E5%AE%B6%E5%BE%8C%E7%8C%9D%E6%AD%BB%E9%81%B2%E6%9C%AA%E8%B5%B7%E5%BA%8A%E5%AE%B6%E5%B1%AC%E7%99%BC%E7%8F%BE%E8%BA%AB%E9%AB%94%E5%B7%B2%E5%86%B0%E5%86%B7-084556825.html 標題:台商拒繳百萬罰金 9土地全查封網址:https://tw.news.yahoo.com/%E5%8F%B0%E5%95%86%E9%81%95%E5%8F%8D%E6%AA%A2%E7%96%AB%E6%8B%92%E7%B9%B3100%E8%90%AC-%E9%81%AD%E6%9F%A5%E5%B0%819%E7%AD%86%E5%9C%9F%E5%9C%B0%E9%99%90%E5%88%B6%E5%87%BA%E5%A2%83-075701529.html 標題:2道禁令看教部怎輕賤大學自治網址:https://tw.news.yahoo.com/%E5%BE%9E%E5%85%A9%E9%81%93%E7%A6%81%E4%BB%A4%E7%9C%8B%E6%95%99%E8%82%B2%E9%83%A8%E5%A6%82%E4%BD%95%E8%BC%95%E8%B3%A4%E5%A4%A7%E5%AD%B8%E8%87%AA%E6%B2%BB-024840204.html 標題:日職史上最強洋將台灣人上榜網址:https://tw.news.yahoo.com/%E6%97%A5%E8%81%B7-12%E7%90%83%E5%9C%98%E5%8F%B2%E4%B8%8A%E6%9C%80%E5%BC%B7%E6%B4%8B%E5%B0%87-%E9%83%AD%E6%BA%90%E6%B2%BB%E5%94%AF-%E5%8F%B0%E7%81%A3%E4%BB%A3%E8%A1%A8-070130874.html 標題:難忘對決柯瑞弟：守哥哥很怪網址:https://tw.news.yahoo.com/nba-%E8%87%AA%E6%9B%9D%E7%94%9F%E6%B6%AF%E6%9C%80%E9%9B%A3%E5%BF%98%E6%99%82%E5%88%BB-%E6%9F%AF%E7%91%9E%E5%BC%9F-%E9%98%B2%E5%AE%88%E5%93%A5%E5%93%A5%E7%9C%9F%E7%9A%84%E5%BE%88%E6%80%AA-040106396.html 標題:還沒開打韓職先談美國轉播權網址:https://tw.sports.yahoo.com/news/%E9%9F%93%E8%81%B7-%E9%82%84%E6%B2%92%E9%96%8B%E6%89%93-%E5%82%B3espn%E6%8E%A5%E6%B4%BD%E6%B5%B7%E5%A4%96%E8%BD%89%E6%92%AD%E6%AC%8A-034855692.html 標題:老爹好暖心送永久季票給醫護網址:https://tw.sports.yahoo.com/news/mlb-%E6%AD%90%E6%8F%90%E8%8C%B2%E9%A9%9A%E5%96%9C%E5%AE%A2%E4%B8%B2-%E4%BB%A3%E8%A1%A8%E7%B4%85%E8%A5%AA%E9%80%81%E6%B0%B8%E4%B9%85%E5%AD%A3%E7%A5%A8%E7%B5%A6%E9%86%AB%E8%AD%B7-031551366.html 標題:難啟齒毛病媽媽靠這3招來救網址:https://tw.sports.yahoo.com/news/%E9%9B%A3%E4%BB%A5%E5%95%9F%E9%BD%92-%E5%AA%BD%E5%AA%BD%E5%80%91%E6%9C%80%E5%B0%B7%E5%B0%AC%E7%9A%84%E6%AF%9B%E7%97%85-%E9%9D%A0%E9%80%993%E6%8B%9B%E5%BE%92%E6%89%8B%E9%81%8B%E5%8B%95%E6%94%B9%E5%96%84-044242146.html 標題:新戲剛殺青金秀賢宣布當媽了網址:https://tw.news.yahoo.com/35%E6%AD%B2%E9%87%91%E7%A7%80%E8%B3%A2-%E5%81%9A%E4%BA%BA%E6%88%90%E5%8A%9F-%E6%87%B7%E5%AD%9515%E9%80%B1%E5%95%A6-062548460.html 標題:曾入圍金馬高盟傑涉毒被起訴網址:https://tw.news.yahoo.com/%E9%AB%98%E7%9B%9F%E5%82%91%E8%B2%A9%E6%AF%92%E5%B9%B4%E5%88%9D%E7%8D%B2%E4%B8%8D%E8%B5%B7%E8%A8%B4-%E7%BA%8C%E6%9F%A5%E9%80%86%E8%BD%89%E6%81%90%E9%9D%A2%E5%B0%8D%E7%84%A1%E6%9C%9F%E5%BE%92%E5%88%91%E9%87%8D%E5%88%91-062751731.html 標題:大兒子露臉林志穎全家顏值高網址:https://tw.news.yahoo.com/%E6%9E%97%E5%BF%97%E7%A9%8E-%E5%AE%B6%E5%85%A8%E9%83%BD%E9%AB%98%E9%A1%8F%E5%80%BC-%E5%A4%A7%E5%85%92%E5%AD%90kimi%E7%BD%95%E8%A6%8B%E9%9C%B2%E6%AD%A3%E8%87%89-050848260.html 標題:染疫痊癒湯姆漢克斯平頭復工網址:https://movies.yahoo.com.tw/article/%E6%B9%AF%E5%A7%86%E6%BC%A2%E5%85%8B%E6%96%AF%E8%82%BA%E7%82%8E%E7%97%8A%E7%99%92-%E5%B9%B3%E9%A0%AD%E5%BE%A9%E5%B7%A5%E5%B9%BD%E9%BB%98%E4%B8%BB%E6%8C%81-130927398.html 標題:防疫宅在家最夯影集前5曝光網址:https://movies.yahoo.com.tw/article/%E7%AC%AC%E4%B8%80%E5%90%8D%E4%B8%8D%E6%98%AF%E9%99%B0%E5%B1%8D%E8%B7%AF%E9%98%B2%E7%96%AB%E6%9C%9F%E9%96%93%E6%9C%80%E5%A4%AF%E5%BD%B1%E9%9B%86-top-5-%E6%A6%9C%E5%96%AE%E5%85%AC%E9%96%8B-113057654.html 標題:背50kg一包狂奔他爽拿3.8萬網址:https://tw.news.yahoo.com/%E8%8B%B1%E5%9C%8B%E6%80%AA%E6%AF%94%E8%B3%BD-%E6%8F%B9%E7%85%A4%E7%82%AD%E8%B7%AF%E8%B7%91%E7%B4%AF%E5%A3%9E%E5%8F%83%E8%B3%BD%E8%80%85-084241809.html 標題:CP值最低科系法律系竟然上榜網址:https://tw.news.yahoo.com/%E5%93%AA%E5%80%8B%E7%A7%91%E7%B3%BBcp%E5%80%BC%E6%9C%80%E4%BD%8E-%E7%B6%B2%E7%8B%82%E6%8E%A8%E9%80%99%E7%B3%BB-%E6%88%90%E6%9C%AC%E9%AB%98%E9%9B%A3%E5%9B%9E%E6%94%B6-083700140.html 標題:空拍見詭異1棟廢棄30年原因曝網址:https://tw.travel.yahoo.com/news/%E6%88%91%E6%98%AF%E8%AA%B0%E6%88%91%E5%9C%A8%E5%93%AA%E8%AE%93%E4%BA%BA%E5%BD%B7%E5%BD%BF%E7%BD%AE%E8%BA%AB%E5%B9%B3%E8%A1%8C%E5%AE%87%E5%AE%99%E7%9A%84%E5%9C%B0%E9%BB%9E-040024452.html 標題:排隊人龍沒停過名氣傳到日本網址:https://tw.travel.yahoo.com/news/%E6%99%AF%E7%BE%8E%E4%BA%BA%E6%B0%A3%E6%8E%92%E9%9A%8A%E6%97%A9%E9%A4%90%E5%BA%97%E5%A4%A7%E4%BB%BD%E9%87%8F%E7%87%92%E9%A4%85%E6%B2%B9%E6%A2%9D%E8%83%A1%E6%A4%92%E9%A4%85%E8%9B%8B%E9%A4%85%E9%83%BD%E5%A5%BD%E5%90%83-152519874.html 標題:醫護包超美口罩萬人歪樓暴動網址:https://tw.news.yahoo.com/%E9%86%AB%E8%AD%B7%E5%8C%85%E8%B6%85%E7%BE%8E-%E4%B8%B9%E5%AF%A7%E8%89%B2%E5%8F%A3%E7%BD%A9-1-5%E8%90%AC%E4%BA%BA%E6%AD%AA%E6%A8%93%E6%9A%B4%E5%8B%95-%E7%AB%8B%E5%88%BB%E5%8E%BB%E6%8E%92-232001283.html 標題:尪破產欠14億！劉濤花4年還清網址:https://tw.style.yahoo.com/%E8%80%81%E5%85%AC%E7%A0%B4%E7%94%A2%E7%A9%8D%E6%AC%A0%E4%B8%8A%E5%84%84-%E5%8A%89%E6%BF%A44%E5%B9%B425%E9%83%A8%E6%88%B2%E9%82%84%E5%82%B5%E5%8B%99-%E4%B8%88%E5%A4%AB%E5%BF%83%E7%96%BC-%E5%AB%81%E5%85%A5%E8%B1%AA%E9%96%80%E5%8D%BB%E6%B2%92%E4%BA%AB%E7%A6%8F-000000099.html 標題:木村嫂17年前母女照網大讚了網址:https://tw.style.yahoo.com/%E6%9C%A8%E6%9D%91%E5%85%89%E5%B8%8C%E7%99%BC-17-%E5%B9%B4%E5%89%8D%E8%88%87%E6%AF%8D%E5%90%8C%E6%A1%86%E8%80%81%E7%85%A7%E7%89%87%E7%B6%B2%E6%8F%AD%E5%B7%A5%E8%97%A4%E9%9D%9C%E9%A6%99%E6%9C%80%E5%A4%A7%E8%AE%8A%E5%8C%96-015020820.html 標題:22歲身家300億全球最年輕富豪網址:https://tw.style.yahoo.com/22%E6%AD%B2-%E7%A4%BE%E7%BE%A4%E5%A5%B3%E7%8E%8B-%E5%87%B1%E8%8E%89%E7%8F%8D%E5%A8%9C%E9%80%A3%E5%85%A9%E5%B9%B4%E7%99%BB%E4%B8%8A-%E5%85%A8%E7%90%83%E6%9C%80%E5%B9%B4%E8%BC%95%E5%AF%8C%E8%B1%AA-%E5%AF%B6%E5%BA%A7-010300859.html 標題:瘦15kg減重專家：伸展消贅肉網址:https://bit.ly/2JZeAVM 標題:15秒賣1瓶驚人紀錄 MIT獲好評網址:https://tw.style.yahoo.com/%E4%B9%BE%E8%82%8C%E4%BA%BA%E6%AF%9B%E5%AD%94%E7%B2%97%E5%A4%A7%E5%BF%85%E8%B2%B7drwu%E8%B6%85%E5%88%92%E7%AE%97%E7%B5%84%E5%90%88%E5%B9%AB%E4%BD%A0%E7%95%AB%E5%A5%BD%E9%87%8D%E9%BB%9E%E5%95%A6-030522190.html 標題:夫子曝弱點「天空城2」一杯倒網址:https://wetv.info/jctcc1#0414wtt 標題:法醫系撩妹「小美滿」搞笑片段網址:https://wetv.info/smm8#0414wt 標題:將皇后打冷宮！宋仁宗愛上她網址:https://wetv.info/qcb6#0414wt 標題:吳奇隆劇中護地下情身世揭密網址:https://www.litv.tv/vod/drama/content.do?content_id=VOD00165496&sponsorName=eWFob28=&autoPlay=1#0414hh 標題:你有危機意識唐老師心測4選1 網址:https://tw.tv.yahoo.com/ent-jessetang/%E6%9B%B8%E5%81%89%E5%8D%B1%E6%A9%9F%E6%84%8F%E8%AD%98%E8%B6%85%E4%BD%8E-%E5%94%90%E8%80%81%E5%B8%AB%E5%8B%B8%E4%B8%96%E5%BF%83%E5%A4%A7%E7%88%86%E7%99%BC-080212366.html#0414hh ```python= import requests import time from bs4 import BeautifulSoup import os import re import urllib.request import json PTT_URL = 'https://www.ptt.cc' #取得網頁文件的function def get_web_page(url): time.sleep(0.5) # 每次爬取前暫停 0.5 秒以免被 PTT 網站判定為大量惡意爬取 resp = requests.get( url=url, cookies={'over18': '1'} ) if resp.status_code != 200: print('Invalid url:', resp.url) return None else: return resp.text def get_articles(dom, date): soup = BeautifulSoup(dom, 'html.parser') # 取得上一頁的連結 paging_div = soup.find('div', 'btn-group btn-group-paging') prev_url = paging_div.find_all('a')[1]['href'] articles = [] # 儲存取得的文章資料 divs = soup.find_all('div', 'r-ent') for d in divs: if d.find('div', 'date').string.strip() == date: # 發文日期正確 # 取得推文數 push_count = 0 if d.find('div', 'nrec').string: try: push_count = int(d.find('div', 'nrec').string) # 轉換字串為數字 except ValueError: # 若轉換失敗，不做任何事，push_count 保持為 0 pass # 取得文章連結及標題 if d.find('a'): # 有超連結，表示文章存在，未被刪除 href = d.find('a')['href'] title = d.find('a').string articles.append({ 'title': title, 'href': href, 'push_count': push_count }) return articles, prev_url#回傳這一頁的文章和上一頁 def parse(dom): soup = BeautifulSoup(dom, 'html.parser') links = soup.find(id='main-content').find_all('a') img_urls = [] for link in links: if re.match(r'^https?://(i.)?(m.)?imgur.com', link['href']): img_urls.append(link['href']) return img_urls def save(img_urls, title): if img_urls: try: dname = title.strip() # 用 strip() 去除字串前後的空白 os.makedirs(dname) for img_url in img_urls: if img_url.split('//')[1].startswith('m.'): img_url = img_url.replace('//m.', '//i.') if not img_url.split('//')[1].startswith('i.'): img_url = img_url.split('//')[0] + '//i.' + img_url.split('//')[1] if not img_url.endswith('.jpg'): img_url += '.jpg' fname = img_url.split('/')[-1] urllib.request.urlretrieve(img_url, os.path.join(dname, fname)) except Exception as e: print(e) if __name__ == '__main__': current_page = get_web_page(PTT_URL + '/bbs/Beauty/index.html') if current_page: articles = [] # 全部的今日文章 date = time.strftime("%m/%d").lstrip('0') # 今天日期, 去掉開頭的 '0' 以符合 PTT 網站格式 current_articles, prev_url = get_articles(current_page, date) # 目前頁面的今日文章 while current_articles: # 若目前頁面有今日文章則加入 articles，並回到上一頁繼續尋找是否有今日文章 articles += current_articles current_page = get_web_page(PTT_URL + prev_url) current_articles, prev_url = get_articles(current_page, date) # 已取得文章列表，開始進入各文章讀圖 for article in articles: print('Processing', article) page = get_web_page(PTT_URL + article['href']) if page: img_urls = parse(page) save(img_urls, article['title']) article['num_image'] = len(img_urls) # 儲存文章資訊 with open('data.json', 'w', encoding='utf-8') as f: json.dump(articles, f, indent=2, sort_keys=True, ensure_ascii=False) ``` Processing {'title': '[正妹] 健身正妹兇', 'href': '/bbs/Beauty/M.1586837739.A.E4B.html', 'push_count': 0} Processing {'title': '[神人] 那些年的無名網紅現況 ?', 'href': '/bbs/Beauty/M.1586838094.A.8F5.html', 'push_count': 16} [WinError 123] 檔案名稱、目錄名稱或磁碟區標籤語法錯誤。: '[神人] 那些年的無名網紅現況 ?' Processing {'title': '[正妹] 兇屁', 'href': '/bbs/Beauty/M.1586844061.A.508.html', 'push_count': 16} Processing {'title': '[廣告] 絕世美少女明里紬(明里つむぎ) ', 'href': '/bbs/Beauty/M.1586845504.A.7BD.html', 'push_count': 27} Processing {'title': '[正妹] 衣服很好看', 'href': '/bbs/Beauty/M.1586848942.A.70C.html', 'push_count': 4} Processing {'title': '[正妹] 包口罩', 'href': '/bbs/Beauty/M.1586852305.A.69A.html', 'push_count': 4} Processing {'title': '[正妹] 正', 'href': '/bbs/Beauty/M.1586852589.A.6CA.html', 'push_count': 2} Processing {'title': '[正妹] 你是不是想……', 'href': '/bbs/Beauty/M.1586853720.A.516.html', 'push_count': 23} Processing {'title': '[正妹] 道歉時露出胸部常識吧', 'href': '/bbs/Beauty/M.1586858766.A.E7B.html', 'push_count': 31} Processing {'title': '[神人] 業配妹子', 'href': '/bbs/Beauty/M.1586863419.A.447.html', 'push_count': 0} Processing {'title': '[新聞]府院齊轟譚德塞!"宜蘭女孩"林薇發聲破百萬', 'href': '/bbs/Beauty/M.1586864350.A.BE9.html', 'push_count': 0} Processing {'title': '[帥哥] 你猜的沒錯，又是這張帥臉!', 'href': '/bbs/Beauty/M.1586796404.A.25D.html', 'push_count': 0} Processing {'title': '[正妹] 音樂老師', 'href': '/bbs/Beauty/M.1586817728.A.88B.html', 'push_count': 18} Processing {'title': '[正妹] 清水あいり', 'href': '/bbs/Beauty/M.1586823997.A.8B2.html', 'push_count': 55} Processing {'title': '[正妹] 自由潛水', 'href': '/bbs/Beauty/M.1586835882.A.6EE.html', 'push_count': 28} # Assertion斷言程式進行到某的時間點,斷定其必然是某種狀態,具體而言,就是斷定這個時間點上Python要進行斷言測試,可以使用assert ```python= class Account: def __init__(self,number,name): self.number=number self.name=name self.balance=0 def deposit(self,amount): assert amount>0,'必須是大於0的正數' self.balance+=amount def withdraw(self,anoumt): assert amount>0,'必須是大於0的正數' if amount<=self.balance: self.balance-=amount else: raise RuntimeError('blance no enouge') a=Account('E122',"Moninca") a.deposit(0) ``` AssertionError Traceback (most recent call last) <ipython-input-6-31eeed44e7f0> in <module> 14 raise RuntimeError('blance no enouge') 15 a=Account('E122',"Moninca") ---> 16 a.deposit(0) <ipython-input-6-31eeed44e7f0> in deposit(self, amount) 5 self.balance=0 6 def deposit(self,amount): ----> 7 assert amount>0,'必須是大於0的正數' 8 self.balance+=amount 9 def withdraw(self,anoumt): AssertionError: 必須是大於0的正數 # 定位元素法 find_element_by_id find_element_by_name find_element_by_xpath find_element_by_Link_text find_element_by_tag_name find_element_by_class_name find_element_by_class_selector ```python= from selenium import webdriver from selenium.webdriver.common.keys import Keys # brower=webdriver.Chrome('/Users/ASUS/chromedriver') url='http://www.python.org' # brower.get(url) assert 'Python' in brower.title elem=brower.find_element_by_name('q') elem.clear() elem.send_keys('pycon') elem.send_keys(Keys.RETURN) print(brower.page_source) ``` 自動幫你在網頁搜尋 ```python= from selenium import webdriver from selenium.webdriver.common.keys import Keys brower=webdriver.Chrome('/Users/ASUS/chromedriver') url='https://www.google.com/' brower.get(url) assert 'Google' in brower.title elem=brower.find_element_by_name('q') elem.clear() elem.send_keys('word') elem.send_keys(Keys.RETURN) print(brower.page_source) ``` 自動在Google搜尋word ```python= from selenium import webdriver from selenium.webdriver.common.keys import Keys brower=webdriver.Chrome('/Users/ASUS/chromedriver') url='https://www.baidu.com/' brower.get(url) assert '百度一下，你就知道' in brower.title elem=brower.find_element_by_id('kw') elem.clear() elem.send_keys('python') elem.send_keys(Keys.RETURN) print(brower.page_source) ``` 自動在百度搜尋python ```python= from selenium.webdriver.common.desired_capabilities import DesiredCapabilities import time #開啟browser def openChrome(): option=webdriver.Chrome('/Users/ASUS/chromedriver') return option #授權操作 def operationAuth(driver): url='https://www.baidu.com/' driver.get(url) # elem=driver.find_element_by_id('kw') elem.send_keys('selenium') driver.find_element_by_xpath("//*[@id='su']").click()#click()是幫你按搜尋鍵等同於ENTER print('查詢操作完畢') if __name__=='__main__': driver=openChrome() operationAuth(driver) ``` 查詢操作完畢 xpath 1.絕對路徑 (/home/a/...) 2.相對路徑 (//div[@class='red']/a) 3.依標籤屬性定位 (//div[@class='red']) # 滑鼠操作 context_click(elem)右鍵點選元素elem double_click(elem)雙擊元素,地圖web可實現放大能力 drag_and_drop(source,targer)移動滑鼠,按下左鍵移動到目標 move_to_element(elem)滑鼠移動到一個元素上 click_and_hold(elem)按下左鍵在任一元素上 ```python= from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.keys import Keys import pyautogui from time import sleep driver=webdriver.Chrome('/Users/ASUS/chromedriver') driver.get('https://www.jianshu.com/') wait=WebDriverWait(driver,10) img=wait.until(EC.element_to_be_clickable((By.TAG_NAME,'img'))) actions=ActionChains(driver) actions.context_click(img) actions.perform() pyautogui.typewrite(['down','down','down','down','down','down','down','enter','enter']) sleep(1) pyautogui.typewrite(['enter']) ``` 會幫你下載網頁圖片 ![](https://i.imgur.com/y3Oc7ij.png) ```python= from selenium import webdriver from selenium.webdriver.common.keys import Keys Browser = webdriver.Chrome('/Users/ASUS/chromedriver') LoginUrl= ('https://member.ithome.com.tw/login') UserName= ('fv744850') UserPass= ('rjrs791120') Browser.get(LoginUrl) Browser.find_element_by_id('account').send_keys(UserName) Browser.find_element_by_id('password').send_keys(UserPass) Browser.find_element_by_id('password').send_keys(Keys.ENTER) Browser.save_screenshot('test.png') Browser.quit() ``` # 鍵盤操作在WebdriverKeys中包含了鍵盤操作 send_keys(keys.ENTER) send_keys(keys.TAB) send_keys(keys.SPACE)#空白鍵 send_keys(keys.ESPACE)#ESC鍵 send_keys(keys.BACK_SPACE)#倒退鍵 send_keys(keys.SHIFT) send_keys(keys.CONTROL)#ctrl send_keys(keys.ARROW_DOWN)#向下按鍵 send_keys(keys.CONTRIL,'a')#ctrl+a send_keys(keys.CONTRIL,'c')#ctrl+c send_keys(keys.CONTRIL,'x')#ctrl+x send_keys(keys.CONTRIL,'v')#ctrl+v ```python= import time from selenium import webdriver from selenium.webdriver.common.keys import Keys driver=webdriver.Chrome('/Users/ASUS/chromedriver') driver.get('http://www.baidu.com') #輸入框輸入內容 elem=driver.find_element_by_id('kw') elem.send_keys('Eatment CSON') time.sleep(3) #刪除一個字元 elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) time.sleep(3) #輸入空格+'部落格' elem.send_keys(Keys.SPACE) elem.send_keys(u'部落格') time.sleep(3) #ctrl+a 權選輸入框內容 elem.send_keys(Keys.CONTROL,'a') time.sleep(3) #ctrl+x elem.send_keys(Keys.CONTROL,'x') time.sleep(3) #ctrl+v 輸入礦重新輸入,搜尋 elem.send_keys(Keys.CONTROL,'v') time.sleep(3) driver.find_element_by_id('su').send_keys(Keys.ENTER) time.sleep(3) driver.quit() ``` 幫你在百度搜尋並離開視窗 ```python= import time from selenium import webdriver from selenium.webdriver.common.keys import Keys driver=webdriver.Chrome('/Users/ASUS/chromedriver') driver.get('https://www.google.com/') elem=driver.find_element_by_name('q') elem.send_keys(u'聯成電腦') time.sleep(3) elem.send_keys(Keys.SPACE) elem.send_keys(u'Java') time.sleep(3) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) time.sleep(3) elem.send_keys(u'Python 聯成電腦') time.sleep(3) elem.send_keys(Keys.CONTROL,'a') time.sleep(1) elem.send_keys(Keys.CONTROL,'x') time.sleep(1) elem.send_keys(Keys.CONTROL,'v') time.sleep(3) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) elem.send_keys(Keys.BACK_SPACE) time.sleep(3) elem.send_keys(Keys.ENTER) time.sleep(3) driver.quit() ``` 利用google去搜尋 # Scrapy套件 scrapy自帶提取資料的機制,選擇由xpath或css 表示式指定HTML文件的某些部分 Xpath是因於XML中選擇節點,也可以與HTML一起使用 CSS是一種將樣式應用於HTML Scrapy建構在LXML庫上,速度和解析精銳度都優於BeautifulSoup 使用Selector選擇器 1.Selector 2.Response類別用於下載HTTP回傳訊息的類別,他的子類別有TextResponse,HtmlResponse(可以自動發現編輯方式通過HTTP wetahttp-equiv),XmlRespons scrapy利用xpath和css來定位元素,他有5個基本方法 1.xpath()選擇節點 2.CSS()使用CSS語法選擇節點 3.extract()傳回選擇元素的unicode字串 4.extract_first()傳回第一個比對到的unicode字串(SelectorList專有) 5.re()透過正則表達式提取unicode字串 6.re_first()(SelectorList專有) ```python= from scrapy.selector import Selector from scrapy.http import HtmlResponse body='<html><body><span>good</span></body></html>' Selector(text=body).xpath('//span/text()').extract() ``` ['good'] ```python= from scrapy.selector import Selector from scrapy.http import HtmlResponse body='''<html> <title>123</title> <body> <h1>Hello World</h1> <h1>Hello Python</h1> <b>Hello Jave</b> <ul> <li>C++</li> <li>C#</li> <li>Python</li> </ul> </body> </html>''' response=HtmlResponse(url='http://example.com',body=body,encoding='utf-8') selector=Selector(response=response) response.selector.xpath('//title/text()') ``` [<Selector xpath='//title/text()' data='123'>] ```python= a=Selector(response=response).response.selector.xpath('//h1/text()').extract() for i in a: print(i) ``` Hello World Hello Python ```python= selector_list=selector.xpath('//h1') for sel in selector_list: print(sel.xpath('./text()')) ``` [<Selector xpath='./text()' data='Hello World'>] [<Selector xpath='./text()' data='Hello Python'>] ```python= from scrapy.selector import Selector from scrapy.http import HtmlResponse text='''<ul> <li>Python 學習手冊 <b>價格:99.00元</b></li> <li>Python 大數據分析 <b>價格:88.00元</b></li> <li>Spark 數據分析 <b>價格:97.00元</b></li> </ul>''' select=Selector(text=text) print(select.xpath('//li/b/text()')) print(select.xpath('//li/b/text()').extract()) print(select.xpath('//li/b/text()').extract_first()) print(select.xpath('//li/b/text()').re('\d+\.\d+')) print(select.xpath('//li/b/text()').re_first('\d+\.\d+')) ``` [<Selector xpath='//li/b/text()' data='價格:99.00元'>, <Selector xpath='//li/b/text()' data='價格:88.00元'>, <Selector xpath='//li/b/text()' data='價格:97.00元'>] ['價格:99.00元', '價格:88.00元', '價格:97.00元'] 價格:99.00元 ['99.00', '88.00', '97.00'] 99.00 ```python= from scrapy.selector import Selector from scrapy.http import HtmlResponse body = ''' <html> <head> <base href='http://example.com/' /> <title>Example website</title> </head> <body> <div id='images'> <a href='image1.html'>Name: Image 1 <br/><img src='image1.jpg'> <a href='image2.html'>Name: Image 2 <br/><img src='image2.jpg'> <a href='image3.html'>Name: Image 3 <br/><img src='image3.jpg'> <a href='image4.html'>Name: Image 4 <br/><img src='image4.jpg'> <a href='image5.html'>Name: Image 5 <br/><img src='image5.jpg'> </div> </body> </html>''' select=Selector(text=body).xpath('//a/text()').extract() for i in select: print(i) ``` Name: Image 1 Name: Image 2 Name: Image 3 Name: Image 4 Name: Image 5 ```python= from scrapy.selector import Selector from scrapy.http import HtmlResponse body = ''' <html> <head> <base href='http://example.com/' /> <title>Example website</title> </head> <body> <div id='images'> <a href='image1.html'>Name: Image 1 <br/><img src='image1.jpg'> <a href='image2.html'>Name: Image 2 <br/><img src='image2.jpg'> <a href='image3.html'>Name: Image 3 <br/><img src='image3.jpg'> <a href='image4.html'>Name: Image 4 <br/><img src='image4.jpg'> <a href='image5.html'>Name: Image 5 <br/><img src='image5.jpg'> </div> </body> </html>''' response=HtmlResponse(url='http://example.com',body=body,encoding='utf-8') selector=Selector(response=response) a=response.selector.xpath('//a/text()').extract() for i in a: print(i) ``` Name: Image 1 Name: Image 2 Name: Image 3 Name: Image 4 Name: Image 5 ```python= sel=response.xpath('//img/@src').extract() for s in sel: print(s) ``` image1.jpg image2.jpg image3.jpg image4.jpg image5.jpg ```python= sel=response.xpath('//a/@href').extract() for s in sel: print(s) ``` image1.html image2.html image3.html image4.html image5.html ```python= print(response.xpath('//a[3]/@href').extract_first()) ``` image3.html ```python= #使用HtmlResponse from scrapy.http import HtmlResponse html1=open('example1.html',encoding='utf-8').read() html2=open('example2.html',encoding='utf-8').read() response1=HtmlResponse(url='http://example1.com',body=html1,encoding='utf-8') response2=HtmlResponse(url='http://example2.com',body=html2,encoding='utf-8') print(response1) print(response2) ``` <200 http://example1.com> <200 http://example2.com> ```python= #LinkExtractor選擇器 from scrapy.linkextractors import LinkExtractor le=LinkExtractor() links=le.extract_links(response1) [link.url for link in links] ``` ['http://example1.com/intro/install.html', 'http://example1.com/intro/tutorial.html', 'http://example1.com/examples.html', 'http://stackoverflow.com/tags/scrapy/info', 'https://github.com/scrapy/scrapy'] ```python= #allow,接收一個正則表達式或正則表達式列表選取link.如果沒有參數,就提取全部 from scrapy.linkextractors import LinkExtractor pattern='/intro/.+\.html$' le=LinkExtractor(allow=pattern) links=le.extract_links(response1) [link.url for link in links] ``` ['http://example1.com/intro/install.html', 'http://example1.com/intro/tutorial.html'] ```python= #deny與allow相反,只選取規則外的 from scrapy.linkextractors import LinkExtractor from urllib.parse import urlparse pattern='^'+urlparse(response1.url).geturl() pattern ``` '^http://example1.com' # Scrapy startproject ![](https://i.imgur.com/rsuul3K.jpg) ```python= import scrapy class BooksSpider(scrapy.Spider): #每一個爬蟲的唯一標籤 name='books' #定一爬蟲爬取的起始點,起始點可以有多個 start_urls=['http://books.toscrape.com/'] def parse(self,response): #提供數據 #每一本書的訊息在<article class="prodct_pod">中 #使用css方法找到所有這樣的article for book in response.css('article.product_pod'): name=book.xpath('./h3/a/@title').extract_first() price=book.css('p.price_color::text').extract_first() yield{'name':name,'price':price,} #提取link #下一頁的url在ul.pager>li.next>a next_url=response.css('ul.pager li.next a::attr(href)').extract_first() if next_url: #如果找到下一頁的url絕對路徑,構選新的request對象 next_url=response.urljoin(next_url) yield scrapy.Request(next_url,callback=self.parse) ``` ![](https://i.imgur.com/c014Z2Y.jpg) 使用VSCODE撰寫 Scrapy框架提供了兩個item Pipline專門用於下載文件和圖片 FilePipeline & ImagesPipeline(可下載文件有PDF程式檔案) ![](https://i.imgur.com/tIv35sv.jpg)