---
title: Python-爬蟲程式設計 2020-11-25
tags: 爬蟲基礎課程,
---
# Python-爬蟲程式設計 2020-11-25
## CSV資料轉換
- 在欲使用的程式中載入CSV函式庫:
import csv
- 載入CSV檔案串流:
結果串列=list(csv.reader(檔案串流))
傳入codecs.open函式開啟的變數
如果想要直接解析字串,可以用
「io.StringIO(字串)」函式將字串轉
成檔案串流
----
#### 使用python 列出csv檔案
1.先在桌面上另存一個csv檔案,將其命名為 (2020-11-25.csv)
2.操作如下:會從程式碼中印出資料
```
import csv
with codecs.open("2020-11-25.csv","r","utf-8") as f:
data=list(csv.reader(f))
print(data)
```
---
### 練習爬出開放式網頁上的csv檔案
1. open source [查詢資料集: 休閒旅遊 + CSV | 政府資料開放平臺](https://data.gov.tw/datasets/search?qs=tid:6402+dtid:253&order=downloadcount)
2. 選擇要爬的網頁 複製CSV連結至sublime

```
import csv
import codecs
import requests
import io
r1=requests.get("https://gis.taiwan.net.tw/od/01_PRD/%E6%AD%B7%E5%B9%B4%E4%B8%AD%E8%8F%AF%E6%B0%91%E5%9C%8B%E5%9C%8B%E6%B0%91%E5%87%BA%E5%9C%8B%E7%9B%AE%E7%9A%84%E5%9C%B0%E4%BA%BA%E6%95%B8%E7%B5%B1%E8%A8%88.csv")
f=io.StringIO(r1.text)
data=list(csv.reader(f))
for d in data:
print(d[0],d[1],d[5])
```

則CMD會印出的結果如下:

### 練習爬出網頁上檔案:JSON資料轉換

1. 從open source [查詢資料集: 休閒旅遊 + CSV | 政府資料開放平臺](https://data.gov.tw/datasets/search?qs=tid:6402+dtid:253&order=downloadcount)
2. 選擇要爬的JSON資料 使用火狐瀏覽器打開
```
r1=requests.get("https://lod2.apc.gov.tw/API/v1/dump/datastore/A53000000A-000041-001")
data=json.loads(r1.text)
print(data[0])
```
印出結果如下:

i=0
for p in range(1,5):
r1=requests.get(
"https://udn.com/api/more",
params={
"page":"2",
"id":"",
"channelID":"1",
"cate_id":"0",
"type":"breaknews",
"totalRecNo":"7795"
}
)
data=json.loads(r1.text)
for d in data["lists"]:
with codecs.open("udn.txt","a","utf-8") as f:
f.write(d["title"]+"\r\n")
r2=requests.get(d["url"])
i+=1
with codecs.open("udn/"+str(i)+".jpg","wb") as f:
f.write(r2.content)
程式碼中的params 可從網頁的檢測元素->網路->XHR->"檔頭" 抓取 需參照"回應"中 是否有想要抓取的新聞資料

----
### BeautifulSoup函式庫
介紹html
html
標籤 </>
屬性aaaa=....-> 000=888參數
---
Beautiful 用在html解析
```
from bs4 import BeautifulSoup
r1=requests.get(
"https://money.udn.com/rank/newest/1001/0/1",
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0",
"Cookie":"_ga=GA1.2.101087151.1606287468; _gid=GA1.2.984804416.1606287468; __eruid=10234217-b3ce-4d18-da91-8bc80a4d9ea4; _fbp=fb.1.1606287467888.1458149677; __gads=ID=087af169e3b8c774-2214cf0be5c400b3:T=1606287468:S=ALNI_MbYkdHAELrY-TSJ_3jzK54MzgigYw; truvid_protected={\"val\":\"c\",\"level\":2,\"geo\":\"TW\",\"timestamp\":1606288151}; gliaplayer_ssid=011bf0e1-2eec-11eb-900f-5bb872501f73; _gliaplayer_user_info={%22city%22:%22shibuya%20city%22%2C%22uid%22:%2234233680-2eea-11eb-9d29-11dd3fc99c0b%22%2C%22country%22:%22TW%22%2C%22region%22:%2213%22%2C%22source%22:%22CF%22%2C%22latlong%22:%2235.661971%2C139.703795%22%2C%22ip%22:%2260.250.79.113%22}; _gat_UA-19660006-1=1; _gat_UA-19210365-3=1; last_click_URL=https://money.udn.com/money/story/7307/5042789"
}
)
#print(r1.text)
b=BeautifulSoup(r1.text, "html.parser")
data=b.find_all("td")
for d in data:
if "class" not in d.attrs:
print(d.text)
```
### 抓標題+網址
```
from bs4 import BeautifulSoup
r1=requests.get(
"https://money.udn.com/rank/newest/1001/0/1",
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0",
"Cookie":"_ga=GA1.2.101087151.1606287468; _gid=GA1.2.984804416.1606287468; __eruid=10234217-b3ce-4d18-da91-8bc80a4d9ea4; _fbp=fb.1.1606287467888.1458149677; __gads=ID=087af169e3b8c774-2214cf0be5c400b3:T=1606287468:S=ALNI_MbYkdHAELrY-TSJ_3jzK54MzgigYw; truvid_protected={\"val\":\"c\",\"level\":2,\"geo\":\"TW\",\"timestamp\":1606288151}; gliaplayer_ssid=011bf0e1-2eec-11eb-900f-5bb872501f73; _gliaplayer_user_info={%22city%22:%22shibuya%20city%22%2C%22uid%22:%2234233680-2eea-11eb-9d29-11dd3fc99c0b%22%2C%22country%22:%22TW%22%2C%22region%22:%2213%22%2C%22source%22:%22CF%22%2C%22latlong%22:%2235.661971%2C139.703795%22%2C%22ip%22:%2260.250.79.113%22}; _gat_UA-19660006-1=1; _gat_UA-19210365-3=1; last_click_URL=https://money.udn.com/money/story/7307/5042789"
}
)
#print(r1.text)
b=BeautifulSoup(r1.text, "html.parser")
data=b.find_all("td")
for d in data:
if "class" not in d.attrs:
data2=d.find("a")
print(data2.attrs["href"])
print(data2.text)
print("")
#print(d.text) #抓標題
```

