# Week3-爬蟲
###### tags: `守恩專案`
---
## 安裝MariaDB
1. Ubuntu:
https://linuxize.com/post/how-to-install-mariadb-on-ubuntu-18-04/
2. Windows:
https://downloads.mariadb.org/
3. 啟動與關閉(在Maria的terminal中以系統管理員身分執行command prompt)
`net start mariadb`
`net stop mariadb`
`mysql -uroot -p`
進入DB設定 :arrow_right:
```sql=
/*創建資料庫*/
CREATE DATABASE iot;
/*創建使用者*/
CREATE USER `iotuser`@`localhost` IDENTIFIED BY "iotuser1234";
/*賦予權限*/
GRANT ALL PRIVILEGES ON iot.* TO `iotuser`@`localhost`;
/*移除權限*/
FLUSH PRIVILEGES;
```
## DB software(DB視覺化)
1. 安裝
https://dbeaver.io/
2. Django 連 MariaDB(需進入虛擬環境)
`pip install pymysql` :arrow_right:用python連資料庫
:::info
:bookmark: **Django**
1. framework
2. python 後端
:::
## 連線
1. 在 `weather_web/weather_web/__init__.py` 內新增
```python=
import pymysql
#支援mysql原本內建有bug,把這行當成連接mysql的程式
pymysql.install_as_MySQLdb()
```
:::info
:bookmark: __init__
這個特定的名稱,用來定義類別的實例建立之後,要進行的初始化動作。__init__之後則可指定初始化時所必須給定的資料
:::
2. 列出django的檔案位置
`pip show django`
3. 在/user/megan/.virtualenvs/project-y_bGZKJJ/Lib/site-packages/django/db/backends/mysql 中修改
* base.py (註解掉Windows 36,37行)
```python=
if version < (1, 3, 13):
raise ImproperlyConfigured('mysqlclient 1.3.13 or newer is required; you have %s.' % Database.__version__)
```
:::warning
Django最新版本高於我們版本
:::
* settings.py
```python=
DATABASES = {
'default': {
# 'ENGINE': 'django.db.backends.sqlite3',
# 'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
'ENGINE': 'django.db.backends.mysql',
'NAME': 'iot',
'USER':'iotuser',
'PASSWORD':'iotuser1234',
'HOST':'127.0.0.1',
#'PORT':'3306', #預設
}
}
```
:::warning
原本的sqlite3為內建輕量型資料庫,故註解,新增Mariadb的資料
:::
## Nginx
:::info
:bookmark: Nginx
1. 類似Apache
2. 作為服務員(類似郵局功能),將接收的封包(類似信件)轉到你要傳送到的ip位置(好比住址)
:::
1. 安裝
Windows:
http://nginx.org
2. on the terminal in nginx folder
`start nginx`
確認是否執行
`tasklist /fi "imagename eq nginx.exe"`
`nginx -s quit`
kill -> `taskkill /f /t /im nginx.exe`
3. conf/nginx.conf
```nginx=
location / {
# root html;
# index index.html index.htm;
proxy_pass http://127.0.0.1:8000;
}
```
:::warning
1. 第五行為localhost自動轉127.0.0.1:8000
2. 可以新增多個function,使網域達到切割分配的功能
EX:
```nginx=
location /web1/ {
# root html;
# index index.html index.htm;
proxy_pass http://127.0.0.1:8000;
}
location /web2/ {
# root html;
# index index.html index.htm;
proxy_pass http://127.0.0.1:8080;
}
```
:::
4. 修改 setting.py
`ALLOWED_HOSTS = ['*']`
:::warning
沒有連線阻擋
:::
:::info
:pencil: **若要連到他人網站**
1. 愈連線對象需跟自己連同個網路
2. 下ipconfig,查看ipv4
3. 在自己網址列輸入:對方ipv4:port
4. 成功跳出對方頁面
:::
5. opendata
https://data.gov.tw/
https://opendata.epa.gov.tw/api/v1/AQI?format=json
chrome 的擴充工具 : JSONView(使json可以在chrome上美美的排版)
## 爬蟲
### 將公開資料抓入DB
1. 在DBeaver裡建資料表`environment`
2. 欄位名稱:
```
SiteName
AQI
PM10
PM2p5
WindSpeed
WindDirec
PublishTime
```
3. 安裝requests
`pip install requests`
:::info
:bookmark: **requests(爬蟲小工具)**
1. 使用 Python 來下載網頁上的資料
2. 透過 HTTP 請求從網頁伺服器下載指定的資料
3. 參考來源: https://blog.gtwang.org/programming/python-requests-module-tutorial/
:::
4. 新增 testAPI.py
```python=
import requests, json, pymysql
url = "https://opendata.epa.gov.tw/api/v1/AQI?format=json"
#使用 GET 方式下載普通網頁
response = requests.get(url)
#將網頁以json格式載入,並存到objs變數
objs = json.loads(response.text)
#連線資料庫
conn = pymysql.connect(user="iotuser", password="iotuser1234", host="127.0.0.1", database="iot")
#呼叫cursor()時便會建立一個cursor.MySQLCursor的物件
cursor = conn.cursor()
sql = "INSERT INTO environment (SiteName,AQI,PM10,PM2p5,WindSpeed,WindDirec,PublishTime) VALUES (%s,%s,%s,%s,%s,%s,%s)"
for obj in objs :
try:
data = [
obj["SiteName"],
obj["AQI"],
obj["PM10"],
obj["PM2.5"],
obj["WindSpeed"],
obj["WindDirec"],
obj["PublishTime"]
]
#類似git的commit
cursor.execute(sql, data)
#類似git的push
conn.commit()
except:
pass
conn.close()
```
結果:

### Yahoo 天氣預測
https://tw.news.yahoo.com/weather/台灣/埔里鎮/埔里鎮-2306205
1. 在 models.py 中新增
```
class Forecast(models.Model):
class Meta:
db_table = 'forecast'
verbose_name = '天氣預測'
verbose_name_plural = '天氣預測'
id = models.AutoField(primary_key=True)
DateTime = models.DateTimeField()
Temperature = models.FloatField()
Precipitation = models.FloatField()
WindSpeed = models.FloatField()
WindDirect = models.FloatField()
Location = models.CharField(max_length=200, null=False)
RecordTime = models.DateTimeField()
```
:::info
1. models.py 建立資料模型,與資料庫相關
2. admin.py為後台管理
3. 參考來源:
https://ivanjo39191.pixnet.net/blog/post/144744012-python-django-%E5%AD%B8%E7%BF%92%E7%B4%80%E9%8C%84%28%E5%9B%9B%29-%E8%B3%87%E6%96%99%E5%BA%AB%E5%8F%8A%E5%BE%8C%E5%8F%B0%E7%AE%A1%E7%90%86
:::
2. 執行
```
python manage.py makemigrations
```
:::info
1. 在你改動了 model.py的内容之后執行上面的命令
2. 相當於在該app下建立 migrations目錄,並記錄下你所有的關於modes.py的改動,但是這個改動還沒有作用到數據庫文件
3. 生成文件檔
4. 參考來源: https://blog.csdn.net/liuweiyuxiang/article/details/71150965
:::
```
python manage.py migrate
```
:::info
同步更新資料庫內容
:::
3. 在 admin.py 中新增
```python=
class ForecastAdmin(admin.ModelAdmin):
list_display = ['id','DateTime','Temperature','Precipitation','WindSpeed','WindDirect','Location','RecordTime']
#admin.site.register可以在admin介面中新增、檢視、編輯、刪除我們模型中的資料
admin.site.register(Forecast, ForecastAdmin)
```
4. 安裝
`pip install beautifulsoup4 selenium`
:::info
1. Python 的函式庫模組
2. 撰寫非常少量的程式碼,就可以快速解析網頁 HTML 碼,從中翠取出使用者有興趣的資料
3. 運作:
讀取 HTML 原始碼,自動進行解析並產生一個 BeautifulSoup 物件,此物件中包含了整個 HTML 文件的結構樹,有了這個結構樹之後,就可以輕鬆找出任何有興趣的資料了
:::
5. 下載 chrome driver
https://chromedriver.chromium.org/
:::info
1. 自動測試Web應用程序,安裝的版本要與自己chrome的版本一樣
2. 
3. 因為selenium需要透過chromeDriver來控制chrome瀏覽器
4. 參考來源:
https://medium.com/@bob800530/selenium-1-%E9%96%8B%E5%95%9Fchrome%E7%80%8F%E8%A6%BD%E5%99%A8-21448980dff9
:::
6. 新增 testCr.py 去爬自己想要的資料
:::info
:key: py檔需與driver放在同一folder
:::
```python=
#selenium模擬人控制瀏覽器
import requests, time
from datetime import datetime, timedelta
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
url = "https://tw.news.yahoo.com/weather/台灣/埔里鎮/埔里鎮-2306205"
#開啟chrome browser
driver = webdriver.Chrome()
driver.get(url)
#等待資料載完
time.sleep(5)
#查找元素(網頁f12),滿足class有
select = Select(driver.find_element_by_xpath("//select[contains(@class, 'Fz(14px)')][contains(@class, 'Bdrs(2px)')][contains(@class, 'Cur(p)')]"))
#temperature
html_source1 = driver.page_source
#precipitation
select.select_by_value("precipitation")
time.sleep(3)
html_source2 = driver.page_source
#wind
select.select_by_value("wind")
time.sleep(3)
html_source3 = driver.page_source
driver.close()
time_list = []
temp_list = []
precip_list = []
WD_list = []
WS_list = []
date_ = str(datetime.now()).split(" ")[0]
#BeautifulSoup解析html的解析器
result1 = BeautifulSoup(html_source1, 'html.parser')
#把class有hourly抓下來解析
hourly_ul = result1.find('ul', attrs={'class': 'hourly'})
for li in hourly_ul.find_all("li"):
tmp = li.find_all('span', attrs={'class': 'D(ib)'})[0].text.split(" ")
clock = int(tmp[0])
if tmp[1] == "AM" :
if clock == 12:
t = "00:00:00"
date_ = str(datetime.now()+timedelta(days=1)).split(" ")[0]
else:
if clock < 10:
t = "0"+str(clock)+":00:00"
else:
t = str(clock)+":00:00"
else:
if clock == 12:
t = "12:00:00"
else:
t = str(clock+12)+":00:00"
time_list.append(date_+" "+t)
temperature = li.find_all('span', attrs={'class': 'D(ib)'})[1].text[:-1]
temp_list.append(temperature)
result2 = BeautifulSoup(html_source2, 'html.parser')
hourly_ul = result2.find('ul', attrs={'class': 'hourly'})
for li in hourly_ul.find_all("li"):
precipitation = li.find_all('span', attrs={'class': 'D(ib)'})[1].text[:-1]
precip_list.append(precipitation)
# 風來向的角度
result3 = BeautifulSoup(html_source3, 'html.parser')
hourly_ul = result3.find('ul', attrs={'class': 'hourly'})
for li in hourly_ul.find_all("li"):
WS_span = li.find_all('span', attrs={'class': 'D(ib)'})[1]
WS_obj = WS_span.find('span', attrs={'class': 'D(ib)'})
WS_list.append(WS_span["title"].split(" ")[0])
#利用rotate抓風向角度
start_index = WS_obj["style"].find("rotate(")+7
end_index = WS_obj["style"].find("deg)")
WD_list.append(WS_obj["style"][start_index:end_index])
RecordTime = str(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
url = "http://127.0.0.1:8000/weather/insert_data"
for i in range(len(time_list)):
data = {
"DateTime" : time_list[i],
"Temperature" : temp_list[i],
"Precipitation" : precip_list[i],
"WindDirect" : WD_list[i],
"WindSpeed" : WS_list[i],
"Location" : "埔里",
"RecordTime" : RecordTime
}
response = requests.post(url, data=data)
print(response.status_code,response.text)
```
7. 在weather/views.py中新增
:::warning
csrf 安全機制
@csrf_exempt 在function上加,表示開放安全機制
:::
```
from django.views.decorators.csrf import csrf_exempt
@csrf_exempt
def insert_data(request):
DateTime = request.POST.get("DateTime")
Temperature = request.POST.get("Temperature")
Precipitation = request.POST.get("Precipitation")
WindDirect = request.POST.get("WindDirect")
WindSpeed = request.POST.get("WindSpeed")
Location = request.POST.get("Location")
RecordTime = request.POST.get("RecordTime")
tmp = Forecast()
tmp.DateTime = DateTime
tmp.Temperature = Temperature
tmp.Precipitation = Precipitation
tmp.WindDirect = WindDirect
tmp.WindSpeed = WindSpeed
tmp.Location = Location
tmp.RecordTime = RecordTime
tmp.save()
response = HttpResponse("GET!")
return response
```
8. 在weather/urls.py中新增
```
path('insert_data',views.insert_data),
```
__gt 大於
__gte 大於等於
__lt 小於
__lte 小於等於
weather/views.py
:::info
自建API
:::
```python=
from django.http import JsonResponse
from datetime import datetime
def get_forecast_value(request):
objs = Forecast.objects.filter(DateTime__gte=datetime.now())
data = []
for obj in objs:
data.append({
"DateTime" : obj.DateTime,
"Temperature" : obj.Temperature,
"Precipitation" : obj.Precipitation,
"WindDirect" : obj.WindDirect,
"WindSpeed" : obj.WindSpeed,
"Location" : obj.Location,
"RecordTime" : obj.RecordTime
})
response = JsonResponse(data, safe=False)
return response
```
weather/urls.py
```
urlpatterns = [
...
path('get_forecast_value',views.get_forecast_value),
...
]
```
## 網頁顯示 Air Quality 資料
https://jquery.com/
https://datatables.net/
templates/AirQ_list.html
```=html
<!DOCTYPE html>
<html>
<head>
<title>Air Quality</title>
<link href="https://cdn.datatables.net/1.10.20/css/jquery.dataTables.min.css" rel="stylesheet">
</head>
<style>
main {
margin-top: 30px;
}
</style>
<body>
<main>
<table id="example" class="display"></table>
</main>
</body>
<script src="https://code.jquery.com/jquery-3.4.1.min.js"></script>
<script src="https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js"></script>
<script>
$.ajax({
url: "https://opendata.epa.gov.tw/api/v1/AQI?format=json",
success : function(res){
var dataSet = new Array();
for(i=0;i<res.length;i++){
dataSet.push([
res[i]["SiteId"],
res[i]["County"],
res[i]["SiteName"],
res[i]["AQI"],
res[i]["SO2"],
res[i]["CO"],
res[i]["O3"],
res[i]["PM10"],
res[i]["PM2.5"],
res[i]["NO2"],
res[i]["NOx"],
res[i]["WindSpeed"],
res[i]["WindDirec"],
res[i]["PublishTime"],
]);
}
$('#example').DataTable( {
data: dataSet,
columns: [
{ title: "SiteId" },
{ title: "County" },
{ title: "SiteName" },
{ title: "AQI" },
{ title: "SO2" },
{ title: "CO" },
{ title: "O3" },
{ title: "PM10" },
{ title: "PM2.5" },
{ title: "NO2" },
{ title: "NOx" },
{ title: "WindSpeed" },
{ title: "WindDirec" },
{ title: "PublishTime" },
]
});
}
});
</script>
</html>
```
weather/urls.py
```
path('AirQ_list',views.AirQ_list),
```
weather/views.py
```
def AirQ_list(request):
response = render(request, 'AirQ_list.html', {})
return response
```
{"metaMigratedAt":"2023-06-15T05:35:06.163Z","metaMigratedFrom":"Content","title":"Week3-爬蟲","breaks":true,"contributors":"[{\"id\":\"c37d62b7-e70a-4621-b8e5-339e2cb07dad\",\"add\":14016,\"del\":725}]"}