# 爬蟲

```python=
import requests
import webbrowser
address=input('請輸入地址:')
webbrowser.open('http://www.google.com.tw/maps/search/'+address)
```
請輸入地址:高雄火車站
True
request→get(取得網頁)
webbrowser→open(開啟網頁)
urlopen→open(取得網頁)
```python=
#GET請求
#普通單傳的網頁,只需要用GET就可以直接下載
#透過GET取得伺服器狀態回應碼
import requests
#使用GET下載
r=requests.get('http://www.google.com.tw/')
print(r)
#伺服器的回應碼
print(r.status_code)
#檢查伺服器狀態
if r.status_code==requests.codes.ok:
print('ok')
```
<Response [200]>
200
ok
```python=
import requests
r=requests.get('http://www.grandtech.info')
print(type(r))
```
<class 'requests.models.Response'>
```python=
import requests
r=requests.get('http://www.grandtech.info')
print(type(r))
if r.status_code==requests.codes.ok:
print('取得網頁內容成功')
else:
print('取的網業類容失敗')
```
<class 'requests.models.Response'>
取得網頁內容成功
```python=
import requests
print('開啟搜尋網頁')
param={'wd':'yahoo購物'}#搜尋的信息"網址?wd="
r=requests.get('https://www.baidu.com/s',params=param)
webbrowser.open(r.url)
```
開啟搜尋網頁
True
這個是百度的
```python=
import requests
print('開啟搜尋網頁')
param={'q':'yahoo購物'}#搜尋信息"網址?q="
r=requests.get('https://www.google.com/search',params=param)
webbrowser.open(r.url)
```
開啟搜尋網頁
True
這個是GOOGLE的
```python=
from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen('http://www.pythonscraping.com/pages/page1.html')
bsObj=BeautifulSoup(html)
print(type(bsObj))
print(bsObj.h1)
print(bsObj.h1.text)
```
<class 'bs4.BeautifulSoup'>
<h1>An Interesting Title</h1>
An Interesting Title
```python=
from bs4 import BeautifulSoup
from urllib.request import urlopen
html=urlopen('https://morvanzhou.github.io/static/scraping/basic-structure.html').read().decode('utf-8')
soup=BeautifulSoup(html,'html.parser')
print(type(soup))
print(soup.h1.text)
print('\n',soup.p.text)
#透過find_all抓取連結
all_href=soup.find_all('a')
print(all_href)
for i in all_href:
print(i['href'])
```

```python=
from bs4 import BeautifulSoup
from urllib.request import urlopen
html=urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bsObj=BeautifulSoup(html,'html.parser')
#找出為紅色字的
namelist1=bsObj.find_all('span',{"class":"red"})
print(type(namelist1))
for name in namelist1:
print(name.get_text())
```
<class 'bs4.element.ResultSet'>
Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.
Heavens! what a virulent attack!
First of all, dear friend, tell me how you are. Set your friend's
mind at rest,
Can one be well while suffering morally? Can one be calm in times
like these if one has any feeling?
You are
staying the whole evening, I hope?
And the fete at the English ambassador's? Today is Wednesday. I
must put in an appearance there,
My daughter is
coming for me to take me there.
I thought today's fete had been canceled. I confess all these
festivities and fireworks are becoming wearisome.
If they had known that you wished it, the entertainment would
have been put off,
Don't tease! Well, and what has been decided about Novosiltsev's
dispatch? You know everything.
What can one say about it?
What has been decided? They have decided that
Buonaparte has burnt his boats, and I believe that we are ready to
burn ours.
Oh, don't speak to me of Austria. Perhaps I don't understand
things, but Austria never has wished, and does not wish, for war.
She is betraying us! Russia alone must save Europe. Our gracious
sovereign recognizes his high vocation and will be true to it. That is
the one thing I have faith in! Our good and wonderful sovereign has to
perform the noblest role on earth, and he is so virtuous and noble
that God will not forsake him. He will fulfill his vocation and
crush the hydra of revolution, which has become more terrible than
ever in the person of this murderer and villain! We alone must
avenge the blood of the just one.... Whom, I ask you, can we rely
on?... England with her commercial spirit will not and cannot
understand the Emperor Alexander's loftiness of soul. She has
refused to evacuate Malta. She wanted to find, and still seeks, some
secret motive in our actions. What answer did Novosiltsev get? None.
The English have not understood and cannot understand the
self-abnegation of our Emperor who wants nothing for himself, but only
desires the good of mankind. And what have they promised? Nothing! And
what little they have promised they will not perform! Prussia has
always declared that Buonaparte is invincible, and that all Europe
is powerless before him.... And I don't believe a word that Hardenburg
says, or Haugwitz either. This famous Prussian neutrality is just a
trap. I have faith only in God and the lofty destiny of our adored
monarch. He will save Europe!
I think,
that if you had been
sent instead of our dear Wintzingerode you would have captured the
King of Prussia's consent by assault. You are so eloquent. Will you
give me a cup of tea?
In a moment. A propos,
I am
expecting two very interesting men tonight, le Vicomte de Mortemart,
who is connected with the Montmorencys through the Rohans, one of
the best French families. He is one of the genuine emigres, the good
ones. And also the Abbe Morio. Do you know that profound thinker? He
has been received by the Emperor. Had you heard?
I shall be delighted to meet them,
But tell me,
is it true that the Dowager Empress wants Baron Funke
to be appointed first secretary at Vienna? The baron by all accounts
is a poor creature.
Baron Funke has been recommended to the Dowager Empress by her
sister,
Now about your family. Do you know that since your daughter came
out everyone has been enraptured by her? They say she is amazingly
beautiful.
I often think,
I often think how unfairly sometimes the
joys of life are distributed. Why has fate given you two such splendid
children? I don't speak of Anatole, your youngest. I don't like
him,
Two such charming children. And really you appreciate
them less than anyone, and so you don't deserve to have them.
I can't help it,
Lavater would have said I
lack the bump of paternity.
Don't joke; I mean to have a serious talk with you. Do you know I
am dissatisfied with your younger son? Between ourselves
he was mentioned at Her
Majesty's and you were pitied....
What would you have me do?
You know I did all
a father could for their education, and they have both turned out
fools. Hippolyte is at least a quiet fool, but Anatole is an active
one. That is the only difference between them.
And why are children born to such men as you? If you were not a
father there would be nothing I could reproach you with,
I am your faithful slave and to you alone I can confess that my
children are the bane of my life. It is the cross I have to bear. That
is how I explain it to myself. It can't be helped!
```python=
#找出為綠色字的
namelist2=bsObj.find_all('span',{'class':'green'})
for i in namelist2:
print(i.get_text())
```
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna
```python=
#印出h1,h2
t=bsObj.find_all({'h1','h2'})
print(t)
```
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
```python=
#找出紅色與綠色的字
s=bsObj.find_all('span',{'class':['red','green']})
s
```
find_all抓取參數的方式

```python=
#以樂透網頁作範例
import bs4,requests
url='https://www.taiwanlottery.com.tw/'
html=requests.get(url) #用get方式取網頁
print(html.raise_for_status()) #驗證網頁是否回應200
objSoup=BeautifulSoup(html.text,'html.parser')
#透過select()去選出.contents_box02
dataTag=objSoup.select('.contents_box02')
print('串列長度:',len(dataTag))
#列印出contents_box02的串列
for i in range(len(dataTag)):
print(dataTag[i])
```
None
串列長度: 4
<div class="contents_box02">
<div id="contents_logo_02"></div><div class="contents_mine_tx02"><span class="font_black15">109/3/9 第109000020期 </span><span class="font_red14"><a href="Result_all.aspx#01">開獎結果</a></span></div><div class="contents_mine_tx04">開出順序<br/>大小順序<br/>第二區</div><div class="ball_tx ball_green">27 </div><div class="ball_tx ball_green">18 </div><div class="ball_tx ball_green">30 </div><div class="ball_tx ball_green">19 </div><div class="ball_tx ball_green">22 </div><div class="ball_tx ball_green">35 </div><div class="ball_tx ball_green">18 </div><div class="ball_tx ball_green">19 </div><div class="ball_tx ball_green">22 </div><div class="ball_tx ball_green">27 </div><div class="ball_tx ball_green">30 </div><div class="ball_tx ball_green">35 </div><div class="ball_red">01 </div>
</div>
<div class="contents_box02">
<div id="contents_logo_03"></div><div class="contents_mine_tx02"><span class="font_black15">109/3/9 第109000020期 </span><span class="font_red14"><a href="Result_all.aspx#07">開獎結果</a></span></div><div class="contents_mine_tx04">開出順序<br/>大小順序</div><div class="ball_tx ball_green">27 </div><div class="ball_tx ball_green">18 </div><div class="ball_tx ball_green">30 </div><div class="ball_tx ball_green">19 </div><div class="ball_tx ball_green">22 </div><div class="ball_tx ball_green">35 </div><div class="ball_tx ball_green">18 </div><div class="ball_tx ball_green">19 </div><div class="ball_tx ball_green">22 </div><div class="ball_tx ball_green">27 </div><div class="ball_tx ball_green">30 </div><div class="ball_tx ball_green">35 </div>
</div>
<div class="contents_box02">
<div id="contents_logo_04"></div><div class="contents_mine_tx02"><span class="font_black15">109/3/10 第109000028期 </span><span class="font_red14"><a href="Result_all.aspx#02">開獎結果</a></span></div><div class="contents_mine_tx04">開出順序<br/>大小順序<br/>特別號</div><div class="ball_tx ball_yellow">01 </div><div class="ball_tx ball_yellow">12 </div><div class="ball_tx ball_yellow">34 </div><div class="ball_tx ball_yellow">37 </div><div class="ball_tx ball_yellow">04 </div><div class="ball_tx ball_yellow">25 </div><div class="ball_tx ball_yellow">01 </div><div class="ball_tx ball_yellow">04 </div><div class="ball_tx ball_yellow">12 </div><div class="ball_tx ball_yellow">25 </div><div class="ball_tx ball_yellow">34 </div><div class="ball_tx ball_yellow">37 </div><div class="ball_red">42 </div>
</div>
<div class="contents_box02">
<div id="contents_logo_05"></div><div class="contents_mine_tx02"><span class="font_black15">109/3/10 第109000028期 </span><span class="font_red14"><a href="Result_all.aspx#08">開獎結果</a></span></div><div class="contents_mine_tx04">開出順序<br/>大小順序</div><div class="ball_tx ball_yellow">01 </div><div class="ball_tx ball_yellow">12 </div><div class="ball_tx ball_yellow">34 </div><div class="ball_tx ball_yellow">37 </div><div class="ball_tx ball_yellow">04 </div><div class="ball_tx ball_yellow">25 </div><div class="ball_tx ball_yellow">01 </div><div class="ball_tx ball_yellow">04 </div><div class="ball_tx ball_yellow">12 </div><div class="ball_tx ball_yellow">25 </div><div class="ball_tx ball_yellow">34 </div><div class="ball_tx ball_yellow">37 </div>
</div>
```python=
#找出開出的順序
#綠球
balls=dataTag[0].find_all('div',{'class':'ball_tx ball_green'})
print('開出順序',end=' ')
for i in range(6):
print(balls[i].text,end=' ')
```
開出順序 27 18 30 19 22 35
```python=
#找出開出的順序
#黃球
balls=dataTag[2].find_all('div',{'class':'ball_tx ball_yellow'})
print('開出順序',end=' ')
for i in range(6):
print(balls[i].text,end=' ')
```
開出順序 01 12 34 37 04 25
```python=
#找出大小順序
#綠球
balls=dataTag[0].find_all('div',{'class':'ball_tx ball_green'})
print('大小順序',end=' ')
for i in range(6,len(balls)):
print(balls[i].text,end=' ')
```
大小順序 18 19 22 27 30 35
```python=
#找出大小順序
#黃球
balls=dataTag[2].find_all('div',{'class':'ball_tx ball_yellow'})
print('大小順序',end=' ')
for i in range(6,len(balls)):
print(balls[i].text,end=' ')
```
大小順序 01 04 12 25 34 37
```python=
#威力彩第二區
special=dataTag[0].find_all('div',{'class':'ball_red'})
print('第二區:',special[0].text,end=' ')
```
第二區: 01
```python=
#大樂透特別號
special=dataTag[2].find_all('div',{'class':'ball_red'})
print('特別號:',special[0].text,end=' ')
```
特別號: 42
```python=
from bs4 import BeautifulSoup
from urllib.request import urlopen
url='https://morvanzhou.github.io/static/scraping/list.html'
html=requests.get(url)
soup=BeautifulSoup(html.text,'html.parser')
month=soup.find_all('li',{'class':'month'})
for i in month:
print(i.text,end=' ')
```
一月 二月 三月 四月 五月
```python=
from bs4 import BeautifulSoup
from urllib.request import urlopen
html=urlopen('https://en.wikipedia.org/wiki/Kevin_Bacon').read().decode('utf-8')
bsObj=BeautifulSoup(html,features='lxml')
network=bsObj.find_all('a')
for i in network:
print(i.get_text())
```

```python=
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
html=urlopen('https://morvanzhou.github.io/static/scraping/table.html').read().decode('utf-8')
soup=BeautifulSoup(html,'html.parser')
img_links=soup.find_all('img',{'src':re.compile('.*?\.jpg')})
for link in img_links:
print(link['src'])
print('\n')
course_link=soup.find_all('a',{'href':re.compile('https://morvanzhou.*')})
for link in course_link:
print(link['href'])
```

```python=
import requests
import re
from bs4 import BeautifulSoup
def main():
resp=requests.get('http://blog.castman.net/py-scraping-analysis-book/ch2/blog/blog.html')
soup=BeautifulSoup(resp.text,'html.parser')
#找出所有'h'開頭的標題文字
titles=soup.find_all(['h1','h2','h3','h4','h5','h6'])
for title in titles:
print(title.text.strip())
#用正則表達式找出所有'h'開頭
for title in soup.find_all(re.compile('h[1-6]')):
print(title.text.strip())
#找出所有.png結尾的圖片
imgs=soup.find_all('img')
for img in imgs:
if 'src' in img.attrs:
if img['src'].endswith('.png'):
print(img['src'])
#透過政則表達式
for img in soup.find_all('img',{'src':re.compile('\.png$')}):
print(img['src'])
if __name__=='__main__':
main()
```
Python教學文章
開發環境設定
Mac使用者
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析
Python教學文章
開發環境設定
Mac使用者
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析
static/python-for-beginners.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
static/python-for-beginners.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
```python=
from bs4 import BeautifulSoup
import requests
import re
url='http://www.deepstone.com.tw/'
try:
htmlfile=requests.get(url)
print('下載成功')
except Exception as error: #error是系統自訂的錯誤訊息
print('網頁原始碼下載失敗:%s'%error)
#儲存檢視的內容
fn='output.txt'
with open(fn,'wb') as fileObj: #以二進位儲存
for diskStorage in htmlfile.iter_content(10240):#Response物件處理
size=fileObj.write(diskStorage)#Response物件寫入
print(size)#列出每次寫入大小
print('以%s儲存網頁成功'%fn)
```
下載成功
10240
10240
7961
以output.txt儲存網頁成功
```python=
from bs4 import BeautifulSoup
import requests
url='https://taqm.epa.gov.tw/pm25/tw/PM25A.aspx?area=1'
html=requests.get(url)
sp=BeautifulSoup(html.text,'html.parser')
print(sp.select('title')[0].text.strip())#在body裡面 網頁的標題
print(sp.find('span',{'id':'ctl08_labText1'}).text.strip())
print(sp.find('a',{'href':'HourlyData.aspx'}).get('title').strip())
rs=sp.find_all('tr',{'align':'center','style':'border-width:1px;border-style:Solid;'})
for r in rs:
name=r.find('a').text.strip()
pm25=r.find_all('span')
print(name,end=' ')
for p in pm25:
print(p.text.strip(),end=' ')
```
行政院環保署-細懸浮微粒
發布時間:2020/03/17 21:00
資料下載
富貴角 16 18 萬里 20 14 淡水 20 21 林口 28 24 三重 31 32 菜寮 23 22 汐止 24 26 新莊 30 27 永和 27 26 板橋 32 26 土城 37 26 新店 29 29 陽明 15 18 士林 26 23 大同 29 26 中山 32 32 松山 27 24 萬華 25 21 古亭 27 23 基隆 22 21 大園 25 21 觀音 20 19 桃園 33 28 平鎮 36 30 中壢 36 26 龍潭 43 34
```python=
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
from random import randint
from time import time
url='https://www.imdb.com/search/title/?release_date=2017&sort=num_votes,desc&page=1'
respone=get(url)
print(respone.text[:300])
html_soup= BeautifulSoup(respone.text,'html.parser')
type(html_soup)
movie_containers=html_soup.find_all('div',class_='lister-item mode-advanced')
movie_containers=html_soup.find_all('div',{'class':'lister-item mode-advanced'})
print(type(movie_containers))
print(len(movie_containers))
first_movie=movie_containers[0]
print('電影名稱:',first_movie.h3.a.text)
first_year=first_movie.h3.find('span',{'class':'lister-item-year text-muted unbold'})
first_year=first_year.text[1:5]
print('發行年分:',first_year)
first_imDb=float(first_movie.strong.text)
print(type(first_imDb))
print('電影評分:',first_imDb)
first_metascore=first_movie.find('span',class_='metascore favorable')
first_metascore=int(first_metascore.text)
print('Metascore評分:',first_metascore)
first_votes=first_movie.find('span',attrs={'name':'nv'})
print(first_votes.text)
print('投票數:'+first_votes['data-value'])#int
```


電影名稱: 羅根
發行年分: 2017
<class 'float'>
電影評分: 8.1
Metascore評分: 77
604,359
投票數:604359
```python=
names=[]
years=[]
imdb_ratings=[]
metascores=[]
votes=[]
for container in movie_containers:
#內容具Meatscore時才會提取
if container.find('div',class_='ratings-metascore')is not None:
#The name
name=container.h3.a.text
names.append(name)
#The year
year=container.h3.find('span',class_='lister-item-year').text
years.append(year)
#The IMDB rating
imdb=float(container.strong.text)
imdb_ratings.append(imdb)
#The metascore
m_score=container.find('span',class_='metascore').text
metascores.append(int(m_score))
#The number of votes
vote=container.find('span',attrs={'name':'nv'})['data-value']
votes.append(vote)
test_df=pd.DataFrame({'movie':names,'year':years,'imdb':imdb_ratings,'metascore':metascores,'votes':votes})
print(test_df.info)
test_df['year'].unique()
test_df.loc[:,'year']=test_df['year'].str[-5:-1].astype(int)
test_df.to_csv('movie_ratings.csv',encoding='utf=8-sig')
```
```python=
import requests
from bs4 import BeautifulSoup
def main():
print('蘋果今日焦點')
dom=requests.get('https://tw.appledaily.com/hot/daily').text
soup=BeautifulSoup(dom,'html.parser')
for ele in soup.find('ul','all').find_all('li'):
print(ele.find('div','aht_title_num').text,ele.find('div','aht_title').text)
if __name__=='__main__':
main()
```
蘋果今日焦點
01 最年幼 4歲童確診 再爆26例 累計直逼200「離封城還非常遠」
02 44歲劉真 天上起舞 搶救45天不治 辛龍吻別「撕心裂肺」
03 劉真後事龍巖董座出手幫 辛龍淚覓塔位
04 港媒踢爆 中國確診數造假 逾4萬人無症狀感染竟剔除 治癒者復陽就醫也遭拒收
05 劉真粉紅美裳飛仙 雙雙節留永恆印記 辛龍慟約來世夫妻
06 真辛夫妻6年沒吵過「我回來了」甜嗓絕響
07 肺炎改名去歧視?國際合作才有解(張傳賢)
08 德國才宣布新規 禁2人以上集會 接觸中鏢醫師 梅克爾採檢陰性
09 《女人我最大》結緣 藍心湄大哭中斷錄影
10 【Video Talk】主動脈瓣膜狹窄 存活率2年不到5成
11 劉真共舞留口紅印 華仔遺憾緣滅
12 明起第二輪網購口罩 縮短為1周就可取貨
13 辛龍寵妻給驚喜 飛法國買包當生日禮
14 紐約疫情慘重 確診逾萬 死亡近百
15 Ella同理人母放不下 蕭亞軒打氣辛龍
16 劉真撒手對岸也心碎 10億觸擊熱搜首位
17 劉真留憾 來不及陪4歲愛女長大
18 大S鬥鞋惺惺相惜 難忘搶贏舞后5分鐘
19 20年國標搭檔像兄妹 李志堯「美麗永停留」
20 林熙蕾泣未見奇蹟 關穎同擁4歲女鼻酸
21 動刀置換主動脈瓣膜 葉克膜 開腦 仍救不回
22 中研院群聚擴大 德籍男傳女友
23 南韓驚爆性剝削聊天室 主嫌遭起底為學霸 付費會員達26萬 逾70名女性淪性奴
24 曾馨瑩Pizza廣告交心 哭嘆劉真幸福太短
25 蘋中信:病毒改變了我們的未來生活(王浩威)
26 【疫情最前線】美12歲童染疫靠機器呼吸 家人po警世文
27 人渣文本專欄:你相信中國的病例數?(周偉航)
28 吳宗憲和霓霓通話哽咽 憂辛龍走不出
29 鞋痴想開鞋子博物館 時尚咖不手軟「保值傳家」
30 【45天病程解析】定情日變忌日!北榮證實劉真腦壓過高 22日22時22分離世
```python=
import requests
from bs4 import BeautifulSoup
resp=requests.get('http://blog.castman.net/py-scraping-analysis-book/ch2/table/table.html')
soup=BeautifulSoup(resp.text,'html.parser')
prices=[]
rows=soup.find('table','table').tbody.find_all('tr')
print(len(rows))
for row in rows:
price=row.find_all('td')[2].text
prices.append(int(price))
print('均價:',sum(prices)/len(prices))
```
6
均價: 1823.3333333333333
```python=
rows=soup.find('table','table').tbody.find_all('tr')
for row in rows:
all_tds=row.find_all('td')
if 'href' in all_tds[3].a.attrs:
href=all_tds[3].a['href']
else:
href=None
print(all_tds[0].text,all_tds[1].text,all_tds[2].text,href,all_tds[3].a.img['src'])
```
初心者 - Python入門 初學者 1490 http://www.pycone.com img/python-logo.png
Python 網頁爬蟲入門實戰 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 機器學習入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 資料科學入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 資料視覺化入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 網站架設入門實戰 (預計) 有程式基礎的初學者 1890 None img/python-logo.png
```python=
#attrs屬性
html='<div data-name="123">foo</div>'
soup=BeautifulSoup(html,'html.parser')
data_tag=soup.find(attrs={"data-name":"123"})
print(data_tag)
```
<div data-name="123">foo</div>
```python=
import requests
import re
from bs4 import BeautifulSoup
Y_MOVIE_URL='https://movies.yahoo.com.tw/movie_thisweek.html'
def get_web_page(url):
resp=requests.get(url)
if resp.status_code!=200:
print('Invalid url',resp.url)
return None
else:
return resp.text
def get_date(date_str):
#e.g的上映日期:2017-03-23
pattern='\d+-\d+-\d+'
match=re.search(pattern,date_str)
if match!=None:
return date_str
else:
return match.group(0)
def get_movie_id(url):
try:
movie_id=url.split('.html')[0].split('-')[-1]
except:
movie_id=url
return movie_id
def get_movies(dom):
soup=BeautifulSoup(dom,'html.parser')
movies=[]
rows=soup.find_all('div','release_info_text')
for row in rows:
movie=dict()
movie['expectation']=row.find('div','leveltext').span.text.strip()
movie['ch_name']=row.find('div','release_movie_name').a.text.strip()
movie['eng_name']=row.find('div','release_movie_name').find('div','en').a.text.strip()
movie['movie-id']=get_movie_id(row.find('div','release_movie_name').a['href'])
movie['poster_url']=row.parent.find_previous_sibling('div','release_foto').a.img['src']
movie['release_date']=get_date(row.find('div','release_movie_time').text)
movies.append(movie)
return movies
if __name__=='__main__':
page=get_web_page(Y_MOVIE_URL)
movies=get_movies(page)
for movie in movies:
print(movie)
#回家作業:如何從裡面撈資料出來??????
```

```python=
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
pages=[str(i) for i in range(1,5)]
years_url=[str(i) for i in range(2010,2019)]
requests=0
for year_url in years_url:
for page in pages:
response = get('http://www.imdb.com/search/title?release_date=' + year_url +
'&sort=num_votes,desc&page=' + page)
sleep(randint(3,5))
requests+=1
if requests>72:
warn('!!!')
break
page_html=BeautifulSoup(response.text,'html.parser')
mv_containers=page_html.find_all('div',class_='lister-item mode-advanced')
for container in mv_containers:
if container.find('div',class_='ratings-metascore') is not None:
name=container.h3.a.text
names.append(name)
year=container.h3.find('span',class_='lister-item-year').text
years.append(year)
imdb=float(container.strong.text)
imdb_ratings.append(imdb)
m_score=container.find('span',class_='metascore').text
metascores.append(m_score)
vote=container.find('span',attrs={'name':'nv'})['data-value']
votes.append(int(vote))
movie_ratings=pd.DataFrame({'movie':names,'year':years,'imdb':imdb_ratings,'metascore':metascores,'votes':votes})
print(movie_ratings.info())
movie_ratings.head(10)
```

```python=
movie_ratings['year'].unique()
```
array(['(2010)', '(I) (2010)', '(2011)', '(I) (2011)', '(2012)',
'(I) (2012)', '(2013)', '(I) (2013)', '(2014)', '(I) (2014)',
'(II) (2014)', '(2015)', '(I) (2015)', '(II) (2015)', '(2016)',
'(II) (2016)', '(I) (2016)', '(IX) (2016)', '(2017)', '(I) (2017)',
'(2018)', '(I) (2018)', '(III) (2018)'], dtype=object)
```python=
movie_ratings.loc[:,'year']=movie_ratings['year'].str[-5:-1].astype(int)
movie_ratings['year'].head(3)
```
0 2010
1 2010
2 2010
Name: year, dtype: int32
```python=
movie_ratings.to_csv('movie_ratings')
movie_ratings.head(10)
```

```python=
import requests
from bs4 import BeautifulSoup
def append_list_pm25():
url = 'https://taqm.epa.gov.tw/pm25/tw/PM25A.aspx?area=1'
html = requests.get(url)
sp = BeautifulSoup(html.text, 'html.parser')
rs = sp.find_all("tr", {"align": "center", "style": "border-width:1px;border-style:Solid;"})
for r in rs:
name = r.find('a')
pm25 = r.find_all('span')
dic = {}
dic.setdefault('name', name.text.strip())
dic.setdefault('pm25', pm25[0].text.strip())
dic.setdefault('pm25_1', pm25[1].text.strip())
list.append(dic)
def get_pm25(name):
for d in list:
if d.get('name') == name:
return d
list=[]
append_list_pm25()
print(list)
name=input('請輸入地區?(例如:林口,桃園):')
d=get_pm25(name)
print(d)
print(d.get('pm25'))
```
[{'name': '富貴角', 'pm25': '15', 'pm25_1': '18'}, {'name': '萬里', 'pm25': '', 'pm25_1': '21'}, {'name': '淡水', 'pm25': '21', 'pm25_1': '23'}, {'name': '林口', 'pm25': '36', 'pm25_1': '30'}, {'name': '三重', 'pm25': '29', 'pm25_1': '29'}, {'name': '菜寮', 'pm25': '26', 'pm25_1': '25'}, {'name': '汐止', 'pm25': '21', 'pm25_1': '22'}, {'name': '新莊', 'pm25': '28', 'pm25_1': '27'}, {'name': '永和', 'pm25': '24', 'pm25_1': '24'}, {'name': '板橋', 'pm25': '18', 'pm25_1': '23'}, {'name': '土城', 'pm25': '34', 'pm25_1': '20'}, {'name': '新店', 'pm25': '21', 'pm25_1': '25'}, {'name': '陽明', 'pm25': '28', 'pm25_1': '26'}, {'name': '士林', 'pm25': '17', 'pm25_1': '24'}, {'name': '大同', 'pm25': '21', 'pm25_1': '22'}, {'name': '中山', 'pm25': '26', 'pm25_1': '29'}, {'name': '松山', 'pm25': '28', 'pm25_1': '25'}, {'name': '萬華', 'pm25': '23', 'pm25_1': '30'}, {'name': '古亭', 'pm25': '29', 'pm25_1': '26'}, {'name': '基隆', 'pm25': '16', 'pm25_1': '13'}, {'name': '大園', 'pm25': '27', 'pm25_1': '31'}, {'name': '觀音', 'pm25': '19', 'pm25_1': '22'}, {'name': '桃園', 'pm25': '37', 'pm25_1': '33'}, {'name': '平鎮', 'pm25': '37', 'pm25_1': '35'}, {'name': '中壢', 'pm25': '36', 'pm25_1': '42'}, {'name': '龍潭', 'pm25': '34', 'pm25_1': '40'}]
請輸入地區?(例如:林口,桃園):三重
{'name': '三重', 'pm25': '29', 'pm25_1': '29'}
29
```python=
import requests
from bs4 import BeautifulSoup
F_FINANCE_URL='https://www.google.com/search?q='
def get_web_page(url,stock_id):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/66.0.3359.181 Safari/537.36'}
resp=requests.get(url+stock_id,headers=headers)
if resp.status_code!=200:
print('找不到網頁:',resp.url)
return None
else:
return resp.text
```
```python=
import requests
from bs4 import BeautifulSoup
G_FINANCE_URL='https://www.google.com/search?q='
def get_web_page(url,stock_id):
#瀏覽器請求,有些網路多家請求會報錯
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/66.0.3359.181 Safari/537.36'}
resp=requests.get(url+stock_id,headers=headers)
if resp.status_code!=200:
print('網頁錯誤')
return None
else:
return resp.text
def get_stock_info(dom):
soup=BeautifulSoup(dom,'html.parser')
stock=dict()
sections=soup.find_all('g-card-section')
stock['name']=sections[1].div.text
spans=sections[1].find_all('div',recursive=False)[1].find_all('span',recursive=False)
stock['current_price']=spans[0].text
stock['current_change']=spans[1].text
for table in sections[3].find_all('table'):
for tr in table.find_all('tr')[:3]:
key=tr.find_all('td')[0].text.lower().strip()
value=tr.find_all('td')[1].text.strip()
stock[key]=value
return stock
if __name__=='__main__':
page=get_web_page(G_FINANCE_URL,'TPE:2330')
if page:
stock=get_stock_info(page)
for k,v in stock.items():
print(k,v)
```
name 台灣積體電路製造TPE: 2330
current_price 已收盤: 3月31日 下午1:30 [GMT+8] ·
current_change 免責聲明
開盤 273.00
最高 274.00
最低 269.50
殖利率 3.47%
上次收盤價 267.50
52 週高點 346.00
https://acg.gamer.com.tw/billboard.php?p=ANIME&t=3&period=all
回家作業
```python=
import requests
import re
import random
from bs4 import BeautifulSoup
target_url='https://gas.goodlife.tw/'
rs=requests.session()
res=rs.get(target_url,verify=False)
res.encoding='utf-8'
soup=BeautifulSoup(res.text,'html.parser')
title=soup.select('#main')[0].text.replace('\n','').split('(')[0]
gas_price=soup.select('#gas-price')[0].text.replace('\n\n\n','').replace(' ','')
cpc=soup.select('#cpc')[0].text.replace(' ','')
corrtent='{}\n{}{}'.format(title,gas_price,cpc)
print(corrtent)
```
最後更新時間: 2020-03-31 20:20
柴油預計調整:
-1.1元
下週一2020年04月06日起,預計汽油每公升:
降0.9元
今日中油油價
92:
18.2
95油價:
19.7
98:
21.7
柴油:
15.4
C:\ProgramData\Anaconda3\envs\work\lib\site-packages\urllib3\connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'gas.goodlife.tw'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning,
```python=
import requests
import re
import random
from bs4 import BeautifulSoup
target_url='https://disp.cc/b/PttHot'
rs=requests.session()
res=rs.get(target_url,verify=False)
soup=BeautifulSoup(res.text,'html.parser')
content=''
for data in soup.select('#list div.row2 div span.listTitle'):
title=data.text
link='http://disp.cc/b/'+data.find('a')['href']
content+='{}\n{}\n\n'.format(title,link)
print(content)
```
■ [新聞] 博恩「我從國小到高中都被強X!」 重訓健身背後藏瘡疤:
http://disp.cc/b/796-cdVP
■ [政治] 川普兒子發明「責任輪盤」?
http://disp.cc/b/796-cdVS
■ [新聞] 中國真的零確診?「黃安躲在台灣」打臉中共
http://disp.cc/b/796-cdVX
■ Re: [影音] 博恩被強姦的故事
http://disp.cc/b/796-cdWf
■ [問卦] 網咖櫃檯穿這樣我怎麼專心發廢文
http://disp.cc/b/796-cdYj
■ [新聞] 讚大陸防疫得宜!WHO專家:全世界都欠武漢人民一次
http://disp.cc/b/796-c86x
■ [問卦] 同事為了月子中心跟老婆吵架
http://disp.cc/b/796-cdF4
■ [新聞] 無視禁足令!21歲金髮妹炫耀「我才不會感染」 2天後中…
http://disp.cc/b/796-cdGk
■ Re: [新聞] 快訊/美股崩盤!道瓊暴跌近900點
http://disp.cc/b/796-cdKO
■ [新聞] 海邊情侶搭帳篷作愛全都錄 里長證實在漁
http://disp.cc/b/796-cdWU
■ [新聞] 不只博恩!呱吉揭童年慘事「小4被強暴」 地點對象全曝光
http://disp.cc/b/796-cdYd
■ [新聞] 開槍不慎打死女列被告 民眾打爆電話力挺警察
http://disp.cc/b/796-cdYI
■ 日本宣佈「大鎖國」
http://disp.cc/b/796-ce3b
■ [新聞] 王子首相都確診!英國超氣譙中國「蓋牌」:疫情後一定算帳
http://disp.cc/b/796-ce3o
■ [問卦] 志村健在日本地位 相當於台灣哪位藝人?
http://disp.cc/b/796-ce3q
■ [新聞] 啦啦隊女神爬山穿這樣!深U內衣迸出超兇碗公奶
http://disp.cc/b/796-ce3Z
■ [爆卦] 湖南火車出軌最新現場畫面
http://disp.cc/b/796-ce4K
■ [政治] 郝伯村三總逝世
http://disp.cc/b/796-ce55
■ [自動轉寄] [爆卦] 掛記者電話的官員被WHO從網站除名
http://disp.cc/b/796-ce83
□ [公告] PTT 熱門文章板 開板成功
http://disp.cc/b/796-59l9
C:\ProgramData\Anaconda3\envs\work\lib\site-packages\urllib3\connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'disp.cc'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning,
```python=
import requests
from bs4 import BeautifulSoup
headers={'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
url='https://www.youtube.com/'
resp=requests.get(url,headers=headers)
soup=BeautifulSoup(resp.text,'html.parser')
target=soup.find_all('a')
txt=open('video-title.txt','w',encoding='utf-8')
for i in target:
f=i.get_text().strip()
txt.write(f)
txt.write('\n')
txt.close()
```

```python=
import requests
from bs4 import BeautifulSoup
url='https://tw.yahoo.com/'
resp=requests.get(url)
#確認下載是否成功
if resp.status_code==requests.codes.ok:
soup=BeautifulSoup(resp.text,'html.parser')
storiers=soup.find_all('a',class_='story-title')
#print(storiers)
for s in storiers:
#新聞標題
print('標題:'+s.text)
#網址
print('網址:'+s.get('href'))
```
標題:境外病例不斷「陸恐打延長賽」
網址:https://tw.news.yahoo.com/%E5%96%AE%E6%97%A5%E7%A2%BA%E8%A8%BA%E7%A0%B4%E7%99%BE-%E9%99%B8%E6%81%90%E6%89%93%E5%BB%B6%E9%95%B7%E8%B3%BD-165116884.html
標題:華航改名?招牌就值10億美元
網址:https://tw.news.yahoo.com/%E8%8F%AF%E8%88%AA%E6%8B%9B%E7%89%8C%E5%80%BC10%E5%84%84%E7%BE%8E%E5%85%83-%E6%94%B9%E5%90%8D%E8%88%AA%E6%AC%8A%E8%AE%8A%E6%95%B8%E5%A4%A7-225614359.html
標題:聚餐返家一夜未醒 上校猝死
網址:https://tw.news.yahoo.com/%E5%BE%8C%E6%8C%87%E9%83%A8%E5%89%AF%E6%97%85%E9%95%B7%E8%81%9A%E9%A4%90%E8%BF%94%E5%AE%B6%E5%BE%8C%E7%8C%9D%E6%AD%BB%E9%81%B2%E6%9C%AA%E8%B5%B7%E5%BA%8A%E5%AE%B6%E5%B1%AC%E7%99%BC%E7%8F%BE%E8%BA%AB%E9%AB%94%E5%B7%B2%E5%86%B0%E5%86%B7-084556825.html
標題:台商拒繳百萬罰金 9土地全查封
網址:https://tw.news.yahoo.com/%E5%8F%B0%E5%95%86%E9%81%95%E5%8F%8D%E6%AA%A2%E7%96%AB%E6%8B%92%E7%B9%B3100%E8%90%AC-%E9%81%AD%E6%9F%A5%E5%B0%819%E7%AD%86%E5%9C%9F%E5%9C%B0%E9%99%90%E5%88%B6%E5%87%BA%E5%A2%83-075701529.html
標題:2道禁令 看教部怎輕賤大學自治
網址:https://tw.news.yahoo.com/%E5%BE%9E%E5%85%A9%E9%81%93%E7%A6%81%E4%BB%A4%E7%9C%8B%E6%95%99%E8%82%B2%E9%83%A8%E5%A6%82%E4%BD%95%E8%BC%95%E8%B3%A4%E5%A4%A7%E5%AD%B8%E8%87%AA%E6%B2%BB-024840204.html
標題:日職史上最強洋將 台灣人上榜
網址:https://tw.news.yahoo.com/%E6%97%A5%E8%81%B7-12%E7%90%83%E5%9C%98%E5%8F%B2%E4%B8%8A%E6%9C%80%E5%BC%B7%E6%B4%8B%E5%B0%87-%E9%83%AD%E6%BA%90%E6%B2%BB%E5%94%AF-%E5%8F%B0%E7%81%A3%E4%BB%A3%E8%A1%A8-070130874.html
標題:難忘對決 柯瑞弟:守哥哥很怪
網址:https://tw.news.yahoo.com/nba-%E8%87%AA%E6%9B%9D%E7%94%9F%E6%B6%AF%E6%9C%80%E9%9B%A3%E5%BF%98%E6%99%82%E5%88%BB-%E6%9F%AF%E7%91%9E%E5%BC%9F-%E9%98%B2%E5%AE%88%E5%93%A5%E5%93%A5%E7%9C%9F%E7%9A%84%E5%BE%88%E6%80%AA-040106396.html
標題:還沒開打 韓職先談美國轉播權
網址:https://tw.sports.yahoo.com/news/%E9%9F%93%E8%81%B7-%E9%82%84%E6%B2%92%E9%96%8B%E6%89%93-%E5%82%B3espn%E6%8E%A5%E6%B4%BD%E6%B5%B7%E5%A4%96%E8%BD%89%E6%92%AD%E6%AC%8A-034855692.html
標題:老爹好暖心 送永久季票給醫護
網址:https://tw.sports.yahoo.com/news/mlb-%E6%AD%90%E6%8F%90%E8%8C%B2%E9%A9%9A%E5%96%9C%E5%AE%A2%E4%B8%B2-%E4%BB%A3%E8%A1%A8%E7%B4%85%E8%A5%AA%E9%80%81%E6%B0%B8%E4%B9%85%E5%AD%A3%E7%A5%A8%E7%B5%A6%E9%86%AB%E8%AD%B7-031551366.html
標題:難啟齒毛病 媽媽靠這3招來救
網址:https://tw.sports.yahoo.com/news/%E9%9B%A3%E4%BB%A5%E5%95%9F%E9%BD%92-%E5%AA%BD%E5%AA%BD%E5%80%91%E6%9C%80%E5%B0%B7%E5%B0%AC%E7%9A%84%E6%AF%9B%E7%97%85-%E9%9D%A0%E9%80%993%E6%8B%9B%E5%BE%92%E6%89%8B%E9%81%8B%E5%8B%95%E6%94%B9%E5%96%84-044242146.html
標題:新戲剛殺青 金秀賢宣布當媽了
網址:https://tw.news.yahoo.com/35%E6%AD%B2%E9%87%91%E7%A7%80%E8%B3%A2-%E5%81%9A%E4%BA%BA%E6%88%90%E5%8A%9F-%E6%87%B7%E5%AD%9515%E9%80%B1%E5%95%A6-062548460.html
標題:曾入圍金馬 高盟傑涉毒被起訴
網址:https://tw.news.yahoo.com/%E9%AB%98%E7%9B%9F%E5%82%91%E8%B2%A9%E6%AF%92%E5%B9%B4%E5%88%9D%E7%8D%B2%E4%B8%8D%E8%B5%B7%E8%A8%B4-%E7%BA%8C%E6%9F%A5%E9%80%86%E8%BD%89%E6%81%90%E9%9D%A2%E5%B0%8D%E7%84%A1%E6%9C%9F%E5%BE%92%E5%88%91%E9%87%8D%E5%88%91-062751731.html
標題:大兒子露臉 林志穎全家顏值高
網址:https://tw.news.yahoo.com/%E6%9E%97%E5%BF%97%E7%A9%8E-%E5%AE%B6%E5%85%A8%E9%83%BD%E9%AB%98%E9%A1%8F%E5%80%BC-%E5%A4%A7%E5%85%92%E5%AD%90kimi%E7%BD%95%E8%A6%8B%E9%9C%B2%E6%AD%A3%E8%87%89-050848260.html
標題:染疫痊癒 湯姆漢克斯平頭復工
網址:https://movies.yahoo.com.tw/article/%E6%B9%AF%E5%A7%86%E6%BC%A2%E5%85%8B%E6%96%AF%E8%82%BA%E7%82%8E%E7%97%8A%E7%99%92-%E5%B9%B3%E9%A0%AD%E5%BE%A9%E5%B7%A5%E5%B9%BD%E9%BB%98%E4%B8%BB%E6%8C%81-130927398.html
標題:防疫宅在家 最夯影集前5曝光
網址:https://movies.yahoo.com.tw/article/%E7%AC%AC%E4%B8%80%E5%90%8D%E4%B8%8D%E6%98%AF%E9%99%B0%E5%B1%8D%E8%B7%AF%E9%98%B2%E7%96%AB%E6%9C%9F%E9%96%93%E6%9C%80%E5%A4%AF%E5%BD%B1%E9%9B%86-top-5-%E6%A6%9C%E5%96%AE%E5%85%AC%E9%96%8B-113057654.html
標題:背50kg一包狂奔 他爽拿3.8萬
網址:https://tw.news.yahoo.com/%E8%8B%B1%E5%9C%8B%E6%80%AA%E6%AF%94%E8%B3%BD-%E6%8F%B9%E7%85%A4%E7%82%AD%E8%B7%AF%E8%B7%91%E7%B4%AF%E5%A3%9E%E5%8F%83%E8%B3%BD%E8%80%85-084241809.html
標題:CP值最低科系 法律系竟然上榜
網址:https://tw.news.yahoo.com/%E5%93%AA%E5%80%8B%E7%A7%91%E7%B3%BBcp%E5%80%BC%E6%9C%80%E4%BD%8E-%E7%B6%B2%E7%8B%82%E6%8E%A8%E9%80%99%E7%B3%BB-%E6%88%90%E6%9C%AC%E9%AB%98%E9%9B%A3%E5%9B%9E%E6%94%B6-083700140.html
標題:空拍見詭異1棟 廢棄30年原因曝
網址:https://tw.travel.yahoo.com/news/%E6%88%91%E6%98%AF%E8%AA%B0%E6%88%91%E5%9C%A8%E5%93%AA%E8%AE%93%E4%BA%BA%E5%BD%B7%E5%BD%BF%E7%BD%AE%E8%BA%AB%E5%B9%B3%E8%A1%8C%E5%AE%87%E5%AE%99%E7%9A%84%E5%9C%B0%E9%BB%9E-040024452.html
標題:排隊人龍沒停過 名氣傳到日本
網址:https://tw.travel.yahoo.com/news/%E6%99%AF%E7%BE%8E%E4%BA%BA%E6%B0%A3%E6%8E%92%E9%9A%8A%E6%97%A9%E9%A4%90%E5%BA%97%E5%A4%A7%E4%BB%BD%E9%87%8F%E7%87%92%E9%A4%85%E6%B2%B9%E6%A2%9D%E8%83%A1%E6%A4%92%E9%A4%85%E8%9B%8B%E9%A4%85%E9%83%BD%E5%A5%BD%E5%90%83-152519874.html
標題:醫護包超美口罩 萬人歪樓暴動
網址:https://tw.news.yahoo.com/%E9%86%AB%E8%AD%B7%E5%8C%85%E8%B6%85%E7%BE%8E-%E4%B8%B9%E5%AF%A7%E8%89%B2%E5%8F%A3%E7%BD%A9-1-5%E8%90%AC%E4%BA%BA%E6%AD%AA%E6%A8%93%E6%9A%B4%E5%8B%95-%E7%AB%8B%E5%88%BB%E5%8E%BB%E6%8E%92-232001283.html
標題:尪破產欠14億!劉濤花4年還清
網址:https://tw.style.yahoo.com/%E8%80%81%E5%85%AC%E7%A0%B4%E7%94%A2%E7%A9%8D%E6%AC%A0%E4%B8%8A%E5%84%84-%E5%8A%89%E6%BF%A44%E5%B9%B425%E9%83%A8%E6%88%B2%E9%82%84%E5%82%B5%E5%8B%99-%E4%B8%88%E5%A4%AB%E5%BF%83%E7%96%BC-%E5%AB%81%E5%85%A5%E8%B1%AA%E9%96%80%E5%8D%BB%E6%B2%92%E4%BA%AB%E7%A6%8F-000000099.html
標題:木村嫂17年前母女照 網大讚了
網址:https://tw.style.yahoo.com/%E6%9C%A8%E6%9D%91%E5%85%89%E5%B8%8C%E7%99%BC-17-%E5%B9%B4%E5%89%8D%E8%88%87%E6%AF%8D%E5%90%8C%E6%A1%86%E8%80%81%E7%85%A7%E7%89%87%E7%B6%B2%E6%8F%AD%E5%B7%A5%E8%97%A4%E9%9D%9C%E9%A6%99%E6%9C%80%E5%A4%A7%E8%AE%8A%E5%8C%96-015020820.html
標題:22歲身家300億 全球最年輕富豪
網址:https://tw.style.yahoo.com/22%E6%AD%B2-%E7%A4%BE%E7%BE%A4%E5%A5%B3%E7%8E%8B-%E5%87%B1%E8%8E%89%E7%8F%8D%E5%A8%9C%E9%80%A3%E5%85%A9%E5%B9%B4%E7%99%BB%E4%B8%8A-%E5%85%A8%E7%90%83%E6%9C%80%E5%B9%B4%E8%BC%95%E5%AF%8C%E8%B1%AA-%E5%AF%B6%E5%BA%A7-010300859.html
標題:瘦15kg減重專家:伸展消贅肉
網址:https://bit.ly/2JZeAVM
標題:15秒賣1瓶驚人紀錄 MIT獲好評
網址:https://tw.style.yahoo.com/%E4%B9%BE%E8%82%8C%E4%BA%BA%E6%AF%9B%E5%AD%94%E7%B2%97%E5%A4%A7%E5%BF%85%E8%B2%B7drwu%E8%B6%85%E5%88%92%E7%AE%97%E7%B5%84%E5%90%88%E5%B9%AB%E4%BD%A0%E7%95%AB%E5%A5%BD%E9%87%8D%E9%BB%9E%E5%95%A6-030522190.html
標題:夫子曝弱點「天空城2」一杯倒
網址:https://wetv.info/jctcc1#0414wtt
標題:法醫系撩妹「小美滿」搞笑片段
網址:https://wetv.info/smm8#0414wt
標題:將皇后打冷宮!宋仁宗愛上她
網址:https://wetv.info/qcb6#0414wt
標題:吳奇隆劇中護地下情 身世揭密
網址:https://www.litv.tv/vod/drama/content.do?content_id=VOD00165496&sponsorName=eWFob28=&autoPlay=1#0414hh
標題:你有危機意識 唐老師心測4選1
網址:https://tw.tv.yahoo.com/ent-jessetang/%E6%9B%B8%E5%81%89%E5%8D%B1%E6%A9%9F%E6%84%8F%E8%AD%98%E8%B6%85%E4%BD%8E-%E5%94%90%E8%80%81%E5%B8%AB%E5%8B%B8%E4%B8%96%E5%BF%83%E5%A4%A7%E7%88%86%E7%99%BC-080212366.html#0414hh
```python=
import requests
import time
from bs4 import BeautifulSoup
import os
import re
import urllib.request
import json
PTT_URL = 'https://www.ptt.cc'
#取得網頁文件的function
def get_web_page(url):
time.sleep(0.5) # 每次爬取前暫停 0.5 秒以免被 PTT 網站判定為大量惡意爬取
resp = requests.get(
url=url,
cookies={'over18': '1'}
)
if resp.status_code != 200:
print('Invalid url:', resp.url)
return None
else:
return resp.text
def get_articles(dom, date):
soup = BeautifulSoup(dom, 'html.parser')
# 取得上一頁的連結
paging_div = soup.find('div', 'btn-group btn-group-paging')
prev_url = paging_div.find_all('a')[1]['href']
articles = [] # 儲存取得的文章資料
divs = soup.find_all('div', 'r-ent')
for d in divs:
if d.find('div', 'date').string.strip() == date: # 發文日期正確
# 取得推文數
push_count = 0
if d.find('div', 'nrec').string:
try:
push_count = int(d.find('div', 'nrec').string) # 轉換字串為數字
except ValueError: # 若轉換失敗,不做任何事,push_count 保持為 0
pass
# 取得文章連結及標題
if d.find('a'): # 有超連結,表示文章存在,未被刪除
href = d.find('a')['href']
title = d.find('a').string
articles.append({
'title': title,
'href': href,
'push_count': push_count
})
return articles, prev_url#回傳這一頁的文章和上一頁
def parse(dom):
soup = BeautifulSoup(dom, 'html.parser')
links = soup.find(id='main-content').find_all('a')
img_urls = []
for link in links:
if re.match(r'^https?://(i.)?(m.)?imgur.com', link['href']):
img_urls.append(link['href'])
return img_urls
def save(img_urls, title):
if img_urls:
try:
dname = title.strip() # 用 strip() 去除字串前後的空白
os.makedirs(dname)
for img_url in img_urls:
if img_url.split('//')[1].startswith('m.'):
img_url = img_url.replace('//m.', '//i.')
if not img_url.split('//')[1].startswith('i.'):
img_url = img_url.split('//')[0] + '//i.' + img_url.split('//')[1]
if not img_url.endswith('.jpg'):
img_url += '.jpg'
fname = img_url.split('/')[-1]
urllib.request.urlretrieve(img_url, os.path.join(dname, fname))
except Exception as e:
print(e)
if __name__ == '__main__':
current_page = get_web_page(PTT_URL + '/bbs/Beauty/index.html')
if current_page:
articles = [] # 全部的今日文章
date = time.strftime("%m/%d").lstrip('0') # 今天日期, 去掉開頭的 '0' 以符合 PTT 網站格式
current_articles, prev_url = get_articles(current_page, date) # 目前頁面的今日文章
while current_articles: # 若目前頁面有今日文章則加入 articles,並回到上一頁繼續尋找是否有今日文章
articles += current_articles
current_page = get_web_page(PTT_URL + prev_url)
current_articles, prev_url = get_articles(current_page, date)
# 已取得文章列表,開始進入各文章讀圖
for article in articles:
print('Processing', article)
page = get_web_page(PTT_URL + article['href'])
if page:
img_urls = parse(page)
save(img_urls, article['title'])
article['num_image'] = len(img_urls)
# 儲存文章資訊
with open('data.json', 'w', encoding='utf-8') as f:
json.dump(articles, f, indent=2, sort_keys=True, ensure_ascii=False)
```
Processing {'title': '[正妹] 健身正妹 兇', 'href': '/bbs/Beauty/M.1586837739.A.E4B.html', 'push_count': 0}
Processing {'title': '[神人] 那些年的無名網紅現況 ?', 'href': '/bbs/Beauty/M.1586838094.A.8F5.html', 'push_count': 16}
[WinError 123] 檔案名稱、目錄名稱或磁碟區標籤語法錯誤。: '[神人] 那些年的無名網紅現況 ?'
Processing {'title': '[正妹] 兇屁', 'href': '/bbs/Beauty/M.1586844061.A.508.html', 'push_count': 16}
Processing {'title': '[廣告] 絕世美少女 明里紬(明里つむぎ) ', 'href': '/bbs/Beauty/M.1586845504.A.7BD.html', 'push_count': 27}
Processing {'title': '[正妹] 衣服很好看', 'href': '/bbs/Beauty/M.1586848942.A.70C.html', 'push_count': 4}
Processing {'title': '[正妹] 包口罩', 'href': '/bbs/Beauty/M.1586852305.A.69A.html', 'push_count': 4}
Processing {'title': '[正妹] 正', 'href': '/bbs/Beauty/M.1586852589.A.6CA.html', 'push_count': 2}
Processing {'title': '[正妹] 你是不是想……', 'href': '/bbs/Beauty/M.1586853720.A.516.html', 'push_count': 23}
Processing {'title': '[正妹] 道歉時露出胸部常識吧', 'href': '/bbs/Beauty/M.1586858766.A.E7B.html', 'push_count': 31}
Processing {'title': '[神人] 業配妹子', 'href': '/bbs/Beauty/M.1586863419.A.447.html', 'push_count': 0}
Processing {'title': '[新聞]府院齊轟譚德塞!"宜蘭女孩"林薇發聲破百萬', 'href': '/bbs/Beauty/M.1586864350.A.BE9.html', 'push_count': 0}
Processing {'title': '[帥哥] 你猜的沒錯,又是這張帥臉!', 'href': '/bbs/Beauty/M.1586796404.A.25D.html', 'push_count': 0}
Processing {'title': '[正妹] 音樂老師', 'href': '/bbs/Beauty/M.1586817728.A.88B.html', 'push_count': 18}
Processing {'title': '[正妹] 清水あいり', 'href': '/bbs/Beauty/M.1586823997.A.8B2.html', 'push_count': 55}
Processing {'title': '[正妹] 自由潛水', 'href': '/bbs/Beauty/M.1586835882.A.6EE.html', 'push_count': 28}
# Assertion斷言
程式進行到某的時間點,斷定其必然是某種狀態,具體而言,就是斷定這個時間點上Python要進行斷言測試,可以使用assert
```python=
class Account:
def __init__(self,number,name):
self.number=number
self.name=name
self.balance=0
def deposit(self,amount):
assert amount>0,'必須是大於0的正數'
self.balance+=amount
def withdraw(self,anoumt):
assert amount>0,'必須是大於0的正數'
if amount<=self.balance:
self.balance-=amount
else:
raise RuntimeError('blance no enouge')
a=Account('E122',"Moninca")
a.deposit(0)
```
AssertionError Traceback (most recent call last)
<ipython-input-6-31eeed44e7f0> in <module>
14 raise RuntimeError('blance no enouge')
15 a=Account('E122',"Moninca")
---> 16 a.deposit(0)
<ipython-input-6-31eeed44e7f0> in deposit(self, amount)
5 self.balance=0
6 def deposit(self,amount):
----> 7 assert amount>0,'必須是大於0的正數'
8 self.balance+=amount
9 def withdraw(self,anoumt):
AssertionError: 必須是大於0的正數
# 定位元素法
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_Link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_class_selector
```python=
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
#
brower=webdriver.Chrome('/Users/ASUS/chromedriver')
url='http://www.python.org'
#
brower.get(url)
assert 'Python' in brower.title
elem=brower.find_element_by_name('q')
elem.clear()
elem.send_keys('pycon')
elem.send_keys(Keys.RETURN)
print(brower.page_source)
```
自動幫你在網頁搜尋
```python=
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
brower=webdriver.Chrome('/Users/ASUS/chromedriver')
url='https://www.google.com/'
brower.get(url)
assert 'Google' in brower.title
elem=brower.find_element_by_name('q')
elem.clear()
elem.send_keys('word')
elem.send_keys(Keys.RETURN)
print(brower.page_source)
```
自動在Google搜尋word
```python=
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
brower=webdriver.Chrome('/Users/ASUS/chromedriver')
url='https://www.baidu.com/'
brower.get(url)
assert '百度一下,你就知道' in brower.title
elem=brower.find_element_by_id('kw')
elem.clear()
elem.send_keys('python')
elem.send_keys(Keys.RETURN)
print(brower.page_source)
```
自動在百度搜尋python
```python=
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
#開啟browser
def openChrome():
option=webdriver.Chrome('/Users/ASUS/chromedriver')
return option
#授權操作
def operationAuth(driver):
url='https://www.baidu.com/'
driver.get(url)
#
elem=driver.find_element_by_id('kw')
elem.send_keys('selenium')
driver.find_element_by_xpath("//*[@id='su']").click()#click()是幫你按搜尋鍵 等同於ENTER
print('查詢操作完畢')
if __name__=='__main__':
driver=openChrome()
operationAuth(driver)
```
查詢操作完畢
xpath
1.絕對路徑 (/home/a/...)
2.相對路徑 (//div[@class='red']/a)
3.依標籤屬性定位 (//div[@class='red'])
# 滑鼠操作
context_click(elem)右鍵點選元素elem
double_click(elem)雙擊元素,地圖web可實現放大能力
drag_and_drop(source,targer)移動滑鼠,按下左鍵移動到目標
move_to_element(elem)滑鼠移動到一個元素上
click_and_hold(elem)按下左鍵在任一元素上
```python=
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import pyautogui
from time import sleep
driver=webdriver.Chrome('/Users/ASUS/chromedriver')
driver.get('https://www.jianshu.com/')
wait=WebDriverWait(driver,10)
img=wait.until(EC.element_to_be_clickable((By.TAG_NAME,'img')))
actions=ActionChains(driver)
actions.context_click(img)
actions.perform()
pyautogui.typewrite(['down','down','down','down','down','down','down','enter','enter'])
sleep(1)
pyautogui.typewrite(['enter'])
```
會幫你下載網頁圖片

```python=
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
Browser = webdriver.Chrome('/Users/ASUS/chromedriver')
LoginUrl= ('https://member.ithome.com.tw/login')
UserName= ('fv744850')
UserPass= ('rjrs791120')
Browser.get(LoginUrl)
Browser.find_element_by_id('account').send_keys(UserName)
Browser.find_element_by_id('password').send_keys(UserPass)
Browser.find_element_by_id('password').send_keys(Keys.ENTER)
Browser.save_screenshot('test.png')
Browser.quit()
```
# 鍵盤操作
在WebdriverKeys中包含了鍵盤操作
send_keys(keys.ENTER)
send_keys(keys.TAB)
send_keys(keys.SPACE)#空白鍵
send_keys(keys.ESPACE)#ESC鍵
send_keys(keys.BACK_SPACE)#倒退鍵
send_keys(keys.SHIFT)
send_keys(keys.CONTROL)#ctrl
send_keys(keys.ARROW_DOWN)#向下按鍵
send_keys(keys.CONTRIL,'a')#ctrl+a
send_keys(keys.CONTRIL,'c')#ctrl+c
send_keys(keys.CONTRIL,'x')#ctrl+x
send_keys(keys.CONTRIL,'v')#ctrl+v
```python=
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver=webdriver.Chrome('/Users/ASUS/chromedriver')
driver.get('http://www.baidu.com')
#輸入框輸入內容
elem=driver.find_element_by_id('kw')
elem.send_keys('Eatment CSON')
time.sleep(3)
#刪除一個字元
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
time.sleep(3)
#輸入空格+'部落格'
elem.send_keys(Keys.SPACE)
elem.send_keys(u'部落格')
time.sleep(3)
#ctrl+a 權選輸入框內容
elem.send_keys(Keys.CONTROL,'a')
time.sleep(3)
#ctrl+x
elem.send_keys(Keys.CONTROL,'x')
time.sleep(3)
#ctrl+v 輸入礦重新輸入,搜尋
elem.send_keys(Keys.CONTROL,'v')
time.sleep(3)
driver.find_element_by_id('su').send_keys(Keys.ENTER)
time.sleep(3)
driver.quit()
```
幫你在百度搜尋並離開視窗
```python=
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver=webdriver.Chrome('/Users/ASUS/chromedriver')
driver.get('https://www.google.com/')
elem=driver.find_element_by_name('q')
elem.send_keys(u'聯成電腦')
time.sleep(3)
elem.send_keys(Keys.SPACE)
elem.send_keys(u'Java')
time.sleep(3)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
time.sleep(3)
elem.send_keys(u'Python 聯成電腦')
time.sleep(3)
elem.send_keys(Keys.CONTROL,'a')
time.sleep(1)
elem.send_keys(Keys.CONTROL,'x')
time.sleep(1)
elem.send_keys(Keys.CONTROL,'v')
time.sleep(3)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
elem.send_keys(Keys.BACK_SPACE)
time.sleep(3)
elem.send_keys(Keys.ENTER)
time.sleep(3)
driver.quit()
```
利用google去搜尋
# Scrapy套件
scrapy自帶提取資料的機制,選擇由xpath或css
表示式指定HTML文件的某些部分
Xpath是因於XML中選擇節點,也可以與HTML一起使用
CSS是一種將樣式應用於HTML
Scrapy建構在LXML庫上,速度和解析精銳度都優於BeautifulSoup
使用Selector選擇器
1.Selector
2.Response類別用於下載HTTP回傳訊息的類別,他的子類別有TextResponse,HtmlResponse(可以自動發現編輯方式通過HTTP wetahttp-equiv),XmlRespons
scrapy利用xpath和css來定位元素,他有5個基本方法
1.xpath()選擇節點
2.CSS()使用CSS語法選擇節點
3.extract()傳回選擇元素的unicode字串
4.extract_first()傳回第一個比對到的unicode字串(SelectorList專有)
5.re()透過正則表達式提取unicode字串
6.re_first()(SelectorList專有)
```python=
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
body='<html><body><span>good</span></body></html>'
Selector(text=body).xpath('//span/text()').extract()
```
['good']
```python=
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
body='''<html>
<title>123</title>
<body>
<h1>Hello World</h1>
<h1>Hello Python</h1>
<b>Hello Jave</b>
<ul>
<li>C++</li>
<li>C#</li>
<li>Python</li>
</ul>
</body>
</html>'''
response=HtmlResponse(url='http://example.com',body=body,encoding='utf-8')
selector=Selector(response=response)
response.selector.xpath('//title/text()')
```
[<Selector xpath='//title/text()' data='123'>]
```python=
a=Selector(response=response).response.selector.xpath('//h1/text()').extract()
for i in a:
print(i)
```
Hello World
Hello Python
```python=
selector_list=selector.xpath('//h1')
for sel in selector_list:
print(sel.xpath('./text()'))
```
[<Selector xpath='./text()' data='Hello World'>]
[<Selector xpath='./text()' data='Hello Python'>]
```python=
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
text='''<ul>
<li>Python 學習手冊 <b>價格:99.00元</b></li>
<li>Python 大數據分析 <b>價格:88.00元</b></li>
<li>Spark 數據分析 <b>價格:97.00元</b></li>
</ul>'''
select=Selector(text=text)
print(select.xpath('//li/b/text()'))
print(select.xpath('//li/b/text()').extract())
print(select.xpath('//li/b/text()').extract_first())
print(select.xpath('//li/b/text()').re('\d+\.\d+'))
print(select.xpath('//li/b/text()').re_first('\d+\.\d+'))
```
[<Selector xpath='//li/b/text()' data='價格:99.00元'>, <Selector xpath='//li/b/text()' data='價格:88.00元'>, <Selector xpath='//li/b/text()' data='價格:97.00元'>]
['價格:99.00元', '價格:88.00元', '價格:97.00元']
價格:99.00元
['99.00', '88.00', '97.00']
99.00
```python=
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
body = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: Image 1 <br/><img src='image1.jpg'>
<a href='image2.html'>Name: Image 2 <br/><img src='image2.jpg'>
<a href='image3.html'>Name: Image 3 <br/><img src='image3.jpg'>
<a href='image4.html'>Name: Image 4 <br/><img src='image4.jpg'>
<a href='image5.html'>Name: Image 5 <br/><img src='image5.jpg'> </div>
</body>
</html>'''
select=Selector(text=body).xpath('//a/text()').extract()
for i in select:
print(i)
```
Name: Image 1
Name: Image 2
Name: Image 3
Name: Image 4
Name: Image 5
```python=
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
body = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: Image 1 <br/><img src='image1.jpg'>
<a href='image2.html'>Name: Image 2 <br/><img src='image2.jpg'>
<a href='image3.html'>Name: Image 3 <br/><img src='image3.jpg'>
<a href='image4.html'>Name: Image 4 <br/><img src='image4.jpg'>
<a href='image5.html'>Name: Image 5 <br/><img src='image5.jpg'> </div>
</body>
</html>'''
response=HtmlResponse(url='http://example.com',body=body,encoding='utf-8')
selector=Selector(response=response)
a=response.selector.xpath('//a/text()').extract()
for i in a:
print(i)
```
Name: Image 1
Name: Image 2
Name: Image 3
Name: Image 4
Name: Image 5
```python=
sel=response.xpath('//img/@src').extract()
for s in sel:
print(s)
```
image1.jpg
image2.jpg
image3.jpg
image4.jpg
image5.jpg
```python=
sel=response.xpath('//a/@href').extract()
for s in sel:
print(s)
```
image1.html
image2.html
image3.html
image4.html
image5.html
```python=
print(response.xpath('//a[3]/@href').extract_first())
```
image3.html
```python=
#使用HtmlResponse
from scrapy.http import HtmlResponse
html1=open('example1.html',encoding='utf-8').read()
html2=open('example2.html',encoding='utf-8').read()
response1=HtmlResponse(url='http://example1.com',body=html1,encoding='utf-8')
response2=HtmlResponse(url='http://example2.com',body=html2,encoding='utf-8')
print(response1)
print(response2)
```
<200 http://example1.com>
<200 http://example2.com>
```python=
#LinkExtractor選擇器
from scrapy.linkextractors import LinkExtractor
le=LinkExtractor()
links=le.extract_links(response1)
[link.url for link in links]
```
['http://example1.com/intro/install.html',
'http://example1.com/intro/tutorial.html',
'http://example1.com/examples.html',
'http://stackoverflow.com/tags/scrapy/info',
'https://github.com/scrapy/scrapy']
```python=
#allow,接收一個正則表達式或正則表達式列表選取link.如果沒有參數,就提取全部
from scrapy.linkextractors import LinkExtractor
pattern='/intro/.+\.html$'
le=LinkExtractor(allow=pattern)
links=le.extract_links(response1)
[link.url for link in links]
```
['http://example1.com/intro/install.html',
'http://example1.com/intro/tutorial.html']
```python=
#deny與allow相反,只選取規則外的
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urlparse
pattern='^'+urlparse(response1.url).geturl()
pattern
```
'^http://example1.com'
# Scrapy startproject

```python=
import scrapy
class BooksSpider(scrapy.Spider):
#每一個爬蟲的唯一標籤
name='books'
#定一爬蟲爬取的起始點,起始點可以有多個
start_urls=['http://books.toscrape.com/']
def parse(self,response):
#提供數據
#每一本書的訊息在<article class="prodct_pod">中
#使用css方法找到所有這樣的article
for book in response.css('article.product_pod'):
name=book.xpath('./h3/a/@title').extract_first()
price=book.css('p.price_color::text').extract_first()
yield{'name':name,'price':price,}
#提取link
#下一頁的url在ul.pager>li.next>a
next_url=response.css('ul.pager li.next a::attr(href)').extract_first()
if next_url:
#如果找到下一頁的url絕對路徑,構選新的request對象
next_url=response.urljoin(next_url)
yield scrapy.Request(next_url,callback=self.parse)
```

使用VSCODE撰寫
Scrapy框架提供了兩個item Pipline專門用於下載文件和圖片
FilePipeline & ImagesPipeline(可下載文件有PDF程式檔案)
