爬蟲教學4-下載圖片要用二進位

# 爬蟲教學4-下載圖片要用二進位這次的教學會用到兩個套件 >1. request >2. bs4 >3. urllib(內建) 接下來爬完文字內容後我們來簡單提一下圖片下載的部分，其實圖片我們爬下來的資料都是要用二進制的方式去紀錄的。跟我們之前爬取的文字不太一樣，文字則是用Unicode的格式來記錄的。所以我們的寫檔的跟取得內容的方法會有差別。所以我來簡單示範一下台灣科技大學的網站logo要怎麼爬吧~ --- 一開始我們一樣先觀察網站上的網址發現他用class是logo的div包著，所以我們，以這個為基準去做抓取![](https://i.imgur.com/gM2NU3X.png) ```python= import requests from bs4 import BeautifulSoup url = "https://www.ntust.edu.tw/?Lang=zh-tw" htmlpage = requests.get(url) htmlpage = BeautifulSoup(htmlpage.text) imgpot = htmlpage.find("div",{"class":"logo"}) imglink = imgpot.find("img").get("src") print(imglink) ``` 抓到後我們發現抓到的其實是相對位置，所以我們需要把它的網域附在網址後面，等一下才比較好爬 ```python= import requests from bs4 import BeautifulSoup from urllib import parse #把網址依照內容格式做分割 url = "https://www.ntust.edu.tw/?Lang=zh-tw" urlparse = parse.urlparse(url) htmlpage = requests.get(url) htmlpage = BeautifulSoup(htmlpage.text) imgpot = htmlpage.find("div",{"class":"logo"}) imglink = imgpot.find("img").get("src") print(urlparse.netloc + imglink) ``` --- 當然urlparse不可能只能找網域而已，你也可以找其他東西來利用 ```python= from urllib import parse #把網址依照內容格式做分割 url = "https://covid-19.nchc.org.tw/api.php?limited=FRA&tableID=3001" urlparse = parse.urlparse(url) print(urlparse.scheme) print(urlparse.netloc) #網域名稱 print(urlparse.path) #網站路徑 print(urlparse.query) #網址裡的變數 ``` 取得網址後我們就可以來把圖片下載下來了，剛剛我們有提到，圖片要用二進位來存，所以我們現在寫檔跟抓取都要用二進位的方式 ```python= import requests from bs4 import BeautifulSoup from urllib import parse #把網址依照內容格式做分割 url = "https://www.ntust.edu.tw/?Lang=zh-tw" urlparse = parse.urlparse(url) htmlpage = requests.get(url) htmlpage = BeautifulSoup(htmlpage.text) imgpot = htmlpage.find("div",{"class":"logo"}) imglink = imgpot.find("img").get("src") url = urlparse.scheme + "://" + urlparse.netloc + imglink #協定+網域名稱+圖片位置 htmlpage = requests.get(url) file = open(r"download.png",'wb') #可以看網址決定檔名。當然在有時候你要存jpg或png等的用二進制存的圖片格式大部分都可以用檔名決定存的格式 #wb的意思是以二進位的方式寫入 file.write(htmlpage.content) #content以二進位的方式表達 file.close() ```