Week3-爬蟲 - HackMD

# Week3-爬蟲 ###### tags: `守恩專案` --- ## 安裝MariaDB 1. Ubuntu: https://linuxize.com/post/how-to-install-mariadb-on-ubuntu-18-04/ 2. Windows: https://downloads.mariadb.org/ 3. 啟動與關閉(在Maria的terminal中以系統管理員身分執行command prompt) `net start mariadb` `net stop mariadb` `mysql -uroot -p` 進入DB設定 :arrow_right: ```sql= /*創建資料庫*/ CREATE DATABASE iot; /*創建使用者*/ CREATE USER `iotuser`@`localhost` IDENTIFIED BY "iotuser1234"; /*賦予權限*/ GRANT ALL PRIVILEGES ON iot.* TO `iotuser`@`localhost`; /*移除權限*/ FLUSH PRIVILEGES; ``` ## DB software(DB視覺化) 1. 安裝 https://dbeaver.io/ 2. Django 連 MariaDB(需進入虛擬環境) `pip install pymysql` :arrow_right:用python連資料庫 :::info :bookmark: **Django** 1. framework 2. python 後端 ::: ## 連線 1. 在 `weather_web/weather_web/__init__.py` 內新增 ```python= import pymysql #支援mysql原本內建有bug，把這行當成連接mysql的程式 pymysql.install_as_MySQLdb() ``` :::info :bookmark: __init__ 這個特定的名稱，用來定義類別的實例建立之後，要進行的初始化動作。__init__之後則可指定初始化時所必須給定的資料 ::: 2. 列出django的檔案位置 `pip show django` 3. 在/user/megan/.virtualenvs/project-y_bGZKJJ/Lib/site-packages/django/db/backends/mysql 中修改 * base.py (註解掉Windows 36,37行) ```python= if version < (1, 3, 13): raise ImproperlyConfigured('mysqlclient 1.3.13 or newer is required; you have %s.' % Database.__version__) ``` :::warning Django最新版本高於我們版本 ::: * settings.py ```python= DATABASES = { 'default': { # 'ENGINE': 'django.db.backends.sqlite3', # 'NAME': os.path.join(BASE_DIR, 'db.sqlite3'), 'ENGINE': 'django.db.backends.mysql', 'NAME': 'iot', 'USER':'iotuser', 'PASSWORD':'iotuser1234', 'HOST':'127.0.0.1', #'PORT':'3306', #預設 } } ``` :::warning 原本的sqlite3為內建輕量型資料庫，故註解，新增Mariadb的資料 ::: ## Nginx :::info :bookmark: Nginx 1. 類似Apache 2. 作為服務員(類似郵局功能)，將接收的封包(類似信件)轉到你要傳送到的ip位置(好比住址) ::: 1. 安裝 Windows: http://nginx.org 2. on the terminal in nginx folder `start nginx` 確認是否執行 `tasklist /fi "imagename eq nginx.exe"` `nginx -s quit` kill -> `taskkill /f /t /im nginx.exe` 3. conf/nginx.conf ```nginx= location / { # root html; # index index.html index.htm; proxy_pass http://127.0.0.1:8000; } ``` :::warning 1. 第五行為localhost自動轉127.0.0.1:8000 2. 可以新增多個function，使網域達到切割分配的功能 EX: ```nginx= location /web1/ { # root html; # index index.html index.htm; proxy_pass http://127.0.0.1:8000; } location /web2/ { # root html; # index index.html index.htm; proxy_pass http://127.0.0.1:8080; } ``` ::: 4. 修改 setting.py `ALLOWED_HOSTS = ['*']` :::warning 沒有連線阻擋 ::: :::info :pencil: **若要連到他人網站** 1. 愈連線對象需跟自己連同個網路 2. 下ipconfig，查看ipv4 3. 在自己網址列輸入:對方ipv4:port 4. 成功跳出對方頁面 ::: 5. opendata https://data.gov.tw/ https://opendata.epa.gov.tw/api/v1/AQI?format=json chrome 的擴充工具 : JSONView(使json可以在chrome上美美的排版) ## 爬蟲 ### 將公開資料抓入DB 1. 在DBeaver裡建資料表`environment` 2. 欄位名稱: ``` SiteName AQI PM10 PM2p5 WindSpeed WindDirec PublishTime ``` 3. 安裝requests `pip install requests` :::info :bookmark: **requests(爬蟲小工具)** 1. 使用 Python 來下載網頁上的資料 2. 透過 HTTP 請求從網頁伺服器下載指定的資料 3. 參考來源: https://blog.gtwang.org/programming/python-requests-module-tutorial/ ::: 4. 新增 testAPI.py ```python= import requests, json, pymysql url = "https://opendata.epa.gov.tw/api/v1/AQI?format=json" #使用 GET 方式下載普通網頁 response = requests.get(url) #將網頁以json格式載入，並存到objs變數 objs = json.loads(response.text) #連線資料庫 conn = pymysql.connect(user="iotuser", password="iotuser1234", host="127.0.0.1", database="iot") #呼叫cursor()時便會建立一個cursor.MySQLCursor的物件 cursor = conn.cursor() sql = "INSERT INTO environment (SiteName,AQI,PM10,PM2p5,WindSpeed,WindDirec,PublishTime) VALUES (%s,%s,%s,%s,%s,%s,%s)" for obj in objs : try: data = [ obj["SiteName"], obj["AQI"], obj["PM10"], obj["PM2.5"], obj["WindSpeed"], obj["WindDirec"], obj["PublishTime"] ] #類似git的commit cursor.execute(sql, data) #類似git的push conn.commit() except: pass conn.close() ``` 結果: ![](https://i.imgur.com/uXC211n.png) ### Yahoo 天氣預測 https://tw.news.yahoo.com/weather/台灣/埔里鎮/埔里鎮-2306205 1. 在 models.py 中新增 ``` class Forecast(models.Model): class Meta: db_table = 'forecast' verbose_name = '天氣預測' verbose_name_plural = '天氣預測' id = models.AutoField(primary_key=True) DateTime = models.DateTimeField() Temperature = models.FloatField() Precipitation = models.FloatField() WindSpeed = models.FloatField() WindDirect = models.FloatField() Location = models.CharField(max_length=200, null=False) RecordTime = models.DateTimeField() ``` :::info 1. models.py 建立資料模型，與資料庫相關 2. admin.py為後台管理 3. 參考來源: https://ivanjo39191.pixnet.net/blog/post/144744012-python-django-%E5%AD%B8%E7%BF%92%E7%B4%80%E9%8C%84%28%E5%9B%9B%29-%E8%B3%87%E6%96%99%E5%BA%AB%E5%8F%8A%E5%BE%8C%E5%8F%B0%E7%AE%A1%E7%90%86 ::: 2. 執行 ``` python manage.py makemigrations ``` :::info 1. 在你改動了 model.py的内容之后執行上面的命令 2. 相當於在該app下建立 migrations目錄，並記錄下你所有的關於modes.py的改動，但是這個改動還沒有作用到數據庫文件 3. 生成文件檔 4. 參考來源: https://blog.csdn.net/liuweiyuxiang/article/details/71150965 ::: ``` python manage.py migrate ``` :::info 同步更新資料庫內容 ::: 3. 在 admin.py 中新增 ```python= class ForecastAdmin(admin.ModelAdmin): list_display = ['id','DateTime','Temperature','Precipitation','WindSpeed','WindDirect','Location','RecordTime'] #admin.site.register可以在admin介面中新增、檢視、編輯、刪除我們模型中的資料 admin.site.register(Forecast, ForecastAdmin) ``` 4. 安裝 `pip install beautifulsoup4 selenium` :::info 1. Python 的函式庫模組 2. 撰寫非常少量的程式碼，就可以快速解析網頁 HTML 碼，從中翠取出使用者有興趣的資料 3. 運作: 讀取 HTML 原始碼，自動進行解析並產生一個 BeautifulSoup 物件，此物件中包含了整個 HTML 文件的結構樹，有了這個結構樹之後，就可以輕鬆找出任何有興趣的資料了 ::: 5. 下載 chrome driver https://chromedriver.chromium.org/ :::info 1. 自動測試Web應用程序，安裝的版本要與自己chrome的版本一樣 2. ![](https://i.imgur.com/sR7ZLum.png) 3. 因為selenium需要透過chromeDriver來控制chrome瀏覽器 4. 參考來源: https://medium.com/@bob800530/selenium-1-%E9%96%8B%E5%95%9Fchrome%E7%80%8F%E8%A6%BD%E5%99%A8-21448980dff9 ::: 6. 新增 testCr.py 去爬自己想要的資料 :::info :key: py檔需與driver放在同一folder ::: ```python= #selenium模擬人控制瀏覽器 import requests, time from datetime import datetime, timedelta from selenium import webdriver from selenium.webdriver.support.ui import Select from selenium.webdriver.chrome.options import Options from bs4 import BeautifulSoup url = "https://tw.news.yahoo.com/weather/台灣/埔里鎮/埔里鎮-2306205" #開啟chrome browser driver = webdriver.Chrome() driver.get(url) #等待資料載完 time.sleep(5) #查找元素(網頁f12)，滿足class有 select = Select(driver.find_element_by_xpath("//select[contains(@class, 'Fz(14px)')][contains(@class, 'Bdrs(2px)')][contains(@class, 'Cur(p)')]")) #temperature html_source1 = driver.page_source #precipitation select.select_by_value("precipitation") time.sleep(3) html_source2 = driver.page_source #wind select.select_by_value("wind") time.sleep(3) html_source3 = driver.page_source driver.close() time_list = [] temp_list = [] precip_list = [] WD_list = [] WS_list = [] date_ = str(datetime.now()).split(" ")[0] #BeautifulSoup解析html的解析器 result1 = BeautifulSoup(html_source1, 'html.parser') #把class有hourly抓下來解析 hourly_ul = result1.find('ul', attrs={'class': 'hourly'}) for li in hourly_ul.find_all("li"): tmp = li.find_all('span', attrs={'class': 'D(ib)'})[0].text.split(" ") clock = int(tmp[0]) if tmp[1] == "AM" : if clock == 12: t = "00:00:00" date_ = str(datetime.now()+timedelta(days=1)).split(" ")[0] else: if clock < 10: t = "0"+str(clock)+":00:00" else: t = str(clock)+":00:00" else: if clock == 12: t = "12:00:00" else: t = str(clock+12)+":00:00" time_list.append(date_+" "+t) temperature = li.find_all('span', attrs={'class': 'D(ib)'})[1].text[:-1] temp_list.append(temperature) result2 = BeautifulSoup(html_source2, 'html.parser') hourly_ul = result2.find('ul', attrs={'class': 'hourly'}) for li in hourly_ul.find_all("li"): precipitation = li.find_all('span', attrs={'class': 'D(ib)'})[1].text[:-1] precip_list.append(precipitation) # 風來向的角度 result3 = BeautifulSoup(html_source3, 'html.parser') hourly_ul = result3.find('ul', attrs={'class': 'hourly'}) for li in hourly_ul.find_all("li"): WS_span = li.find_all('span', attrs={'class': 'D(ib)'})[1] WS_obj = WS_span.find('span', attrs={'class': 'D(ib)'}) WS_list.append(WS_span["title"].split(" ")[0]) #利用rotate抓風向角度 start_index = WS_obj["style"].find("rotate(")+7 end_index = WS_obj["style"].find("deg)") WD_list.append(WS_obj["style"][start_index:end_index]) RecordTime = str(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())) url = "http://127.0.0.1:8000/weather/insert_data" for i in range(len(time_list)): data = { "DateTime" : time_list[i], "Temperature" : temp_list[i], "Precipitation" : precip_list[i], "WindDirect" : WD_list[i], "WindSpeed" : WS_list[i], "Location" : "埔里", "RecordTime" : RecordTime } response = requests.post(url, data=data) print(response.status_code,response.text) ``` 7. 在weather/views.py中新增 :::warning csrf 安全機制 @csrf_exempt 在function上加，表示開放安全機制 ::: ``` from django.views.decorators.csrf import csrf_exempt @csrf_exempt def insert_data(request): DateTime = request.POST.get("DateTime") Temperature = request.POST.get("Temperature") Precipitation = request.POST.get("Precipitation") WindDirect = request.POST.get("WindDirect") WindSpeed = request.POST.get("WindSpeed") Location = request.POST.get("Location") RecordTime = request.POST.get("RecordTime") tmp = Forecast() tmp.DateTime = DateTime tmp.Temperature = Temperature tmp.Precipitation = Precipitation tmp.WindDirect = WindDirect tmp.WindSpeed = WindSpeed tmp.Location = Location tmp.RecordTime = RecordTime tmp.save() response = HttpResponse("GET!") return response ``` 8. 在weather/urls.py中新增 ``` path('insert_data',views.insert_data), ``` __gt 大於 __gte 大於等於 __lt 小於 __lte 小於等於 weather/views.py :::info 自建API ::: ```python= from django.http import JsonResponse from datetime import datetime def get_forecast_value(request): objs = Forecast.objects.filter(DateTime__gte=datetime.now()) data = [] for obj in objs: data.append({ "DateTime" : obj.DateTime, "Temperature" : obj.Temperature, "Precipitation" : obj.Precipitation, "WindDirect" : obj.WindDirect, "WindSpeed" : obj.WindSpeed, "Location" : obj.Location, "RecordTime" : obj.RecordTime }) response = JsonResponse(data, safe=False) return response ``` weather/urls.py ``` urlpatterns = [ ... path('get_forecast_value',views.get_forecast_value), ... ] ``` ## 網頁顯示 Air Quality 資料 https://jquery.com/ https://datatables.net/ templates/AirQ_list.html ```=html <!DOCTYPE html> <html> <head> <title>Air Quality</title> <link href="https://cdn.datatables.net/1.10.20/css/jquery.dataTables.min.css" rel="stylesheet"> </head> <style> main { margin-top: 30px; } </style> <body> <main> <table id="example" class="display"></table> </main> </body> <script src="https://code.jquery.com/jquery-3.4.1.min.js"></script> <script src="https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js"></script> <script> $.ajax({ url: "https://opendata.epa.gov.tw/api/v1/AQI?format=json", success : function(res){ var dataSet = new Array(); for(i=0;i<res.length;i++){ dataSet.push([ res[i]["SiteId"], res[i]["County"], res[i]["SiteName"], res[i]["AQI"], res[i]["SO2"], res[i]["CO"], res[i]["O3"], res[i]["PM10"], res[i]["PM2.5"], res[i]["NO2"], res[i]["NOx"], res[i]["WindSpeed"], res[i]["WindDirec"], res[i]["PublishTime"], ]); } $('#example').DataTable( { data: dataSet, columns: [ { title: "SiteId" }, { title: "County" }, { title: "SiteName" }, { title: "AQI" }, { title: "SO2" }, { title: "CO" }, { title: "O3" }, { title: "PM10" }, { title: "PM2.5" }, { title: "NO2" }, { title: "NOx" }, { title: "WindSpeed" }, { title: "WindDirec" }, { title: "PublishTime" }, ] }); } }); </script> </html> ``` weather/urls.py ``` path('AirQ_list',views.AirQ_list), ``` weather/views.py ``` def AirQ_list(request): response = render(request, 'AirQ_list.html', {}) return response ```