Python爬蟲 - HackMD

# robot.txt robots.txt 檔案能夠告訴搜尋引擎檢索器可存取網站上的哪些網址主要用來避免網站因要求過多而超載但robot.txt 只是一種既定成俗的規定如果想禁止存取私人內容，請不要使用 robots.txt #### 佈署方法直接將 robots.txt 檔案佈署在網站的根目錄 #### 實用規則 1. 禁止檢索整個網站 ``` User-agent: * Disallow: / ``` - User-agent: 使用者代理：例如：爬蟲、瀏覽器等等 - Disallow：禁止 2. 禁止檢索特定目錄及其中內容 ``` User-agent: * Disallow: /calendar/ Disallow: /junk/ Disallow: /books/fiction/contemporary/ ``` 3. 允許單一檢索器存取網站內容 ``` User-agent: Googlebot-news Allow: / User-agent: * Disallow: / ``` 只有 googlebot-news 可檢索整個網站 4. 允許所有檢索器存取網站內容，但某一個檢索器除外 ``` User-agent: Unnecessarybot Disallow: / User-agent: * Allow: / ``` Unnecessarybot 不得檢索網站，但其他漫遊器可以 5. 禁止檢索單一網頁 ``` User-agent: * Disallow: /useless_file.html Disallow: /junk/other_useless_file.html ``` 禁止檢索位於 https://example.com/useless_file.html 的 useless_file.html 頁面以及 junk 目錄中的 other_useless_file.html 頁面 6. 禁止 Google 圖片檢索特定圖片 ``` User-agent: Googlebot-Image Disallow: / ``` 7. 禁止檢索特定類型的檔案 ``` User-agent: Googlebot Disallow: /*.gif$ ``` 例如，禁止檢索所有的 .gif 檔案 # Python & MySQL 1. 安裝MySQL套件 ``` $ sudo apt update && sudo apt install mysql-server ``` 2. 安裝pymysql ``` $ sudo apt-get install python3-pip ``` ``` $ pip3 install pymysql ``` 3. 測試 ``` $ sudo service mysql status ``` Output： ``` ● mysql.service - MySQL Community Server Loaded: loaded (/lib/systemd/system/mysql.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2021-09-08 07:16:49 UTC; 18s ago Main PID: 32713 (mysqld) Status: "Server is operational" Tasks: 38 (limit: 4705) Memory: 356.6M CGroup: /system.slice/mysql.service └─32713 /usr/sbin/mysqld Sep 08 07:16:48 ubuntu-20 systemd[1]: Starting MySQL Community Server... Sep 08 07:16:49 ubuntu-20 systemd[1]: Started MySQL Community Server. ``` - 於後台開啟 ``` $ sudo netstat -ntlp | grep mysql ``` ### 設定 MySQL Root 密碼 ``` $ sudo mysql -u root ``` ``` ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'PASSWORD'; ``` ## BeautifulSoup 安裝套件 ``` $ pip3 install beautifulsoup4 ``` ### 1. get Example: - 現在已知有一個檔案如下： ```javascript! <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister">Elsie</a>, <a href="http://example.com/lacie" class="sister">Lacie</a> and <a href="http://example.com/tillie" class="sister">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ``` 請根據 find_all 與 get 這兩個 method 印出如同 Output 指定的內容 ```python! from bs4 import BeautifulSoup fileInput = open("test.html","r") soup = BeautifulSoup(fileInput,"html.parser") result = soup.find_all("a") print(type(result)) for obj in result: if obj in result: print(obj.text) fileInput.close ``` - Output： ``` Elsie ``` ![image](https://hackmd.io/_uploads/HywWxbyRT.png) ### 2. find_all Example：以 input.html 檔案為輸入找出 a tag 中，含有 Elsie 與 Lacie 的節點 ```python! from bs4 import BeautifulSoup fileInput = open("test.html","r") soup = BeautifulSoup(fileInput,"html.parser") result = soup.find_all(["a","p"]) print(type(result)) for obj in result: if obj in result: print(obj.text) fileInput.close ``` - Output： ``` [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] ``` ![image](https://hackmd.io/_uploads/SyNLgW10p.png) ### 3. find Example： test2.html ```javascript! <!DOCTYPE html> <html> <head> <title>時間顯示</title> </head> <body> <div> <p>Paragraph 1</p> <div> <p>Paragraph 2</p> <div> <p>21:00</p> </div> <p>Paragraph 3</p> </div> <div> <p>台灣時間：</p> </div> </div> </body> </html> ``` - 請根據HTML - 印出如同 Output 的結果 ```python! from bs4 import BeautifulSoup fileInput = open("test2.html","r") soup = BeautifulSoup(fileInput,"html.parser") result = soup.find("p") print(result) result2 = result.findChild() print(result2) result3 = result.findNextSibling() print(result3) fileInput.close ``` output： ``` 台灣時間： 21:00 ``` ![image](https://hackmd.io/_uploads/SkaS-bk0T.png)