# robot.txt robots.txt 檔案能夠告訴搜尋引擎檢索器 可存取網站上的哪些網址 主要用來避免網站因要求過多而超載 但robot.txt 只是一種既定成俗的規定 如果想禁止存取私人內容,請不要使用 robots.txt #### 佈署方法 直接將 robots.txt 檔案佈署在網站的根目錄 #### 實用規則 1. 禁止檢索整個網站 ``` User-agent: * Disallow: / ``` - User-agent: 使用者代理:例如:爬蟲、瀏覽器等等 - Disallow:禁止 2. 禁止檢索特定目錄及其中內容 ``` User-agent: * Disallow: /calendar/ Disallow: /junk/ Disallow: /books/fiction/contemporary/ ``` 3. 允許單一檢索器存取網站內容 ``` User-agent: Googlebot-news Allow: / User-agent: * Disallow: / ``` 只有 googlebot-news 可檢索整個網站 4. 允許所有檢索器存取網站內容,但某一個檢索器除外 ``` User-agent: Unnecessarybot Disallow: / User-agent: * Allow: / ``` Unnecessarybot 不得檢索網站,但其他漫遊器可以 5. 禁止檢索單一網頁 ``` User-agent: * Disallow: /useless_file.html Disallow: /junk/other_useless_file.html ``` 禁止檢索位於 https://example.com/useless_file.html 的 useless_file.html 頁面 以及 junk 目錄中的 other_useless_file.html 頁面 6. 禁止 Google 圖片檢索特定圖片 ``` User-agent: Googlebot-Image Disallow: / ``` 7. 禁止檢索特定類型的檔案 ``` User-agent: Googlebot Disallow: /*.gif$ ``` 例如,禁止檢索所有的 .gif 檔案 # Python & MySQL 1. 安裝MySQL套件 ``` $ sudo apt update && sudo apt install mysql-server ``` 2. 安裝pymysql ``` $ sudo apt-get install python3-pip ``` ``` $ pip3 install pymysql ``` 3. 測試 ``` $ sudo service mysql status ``` Output: ``` ● mysql.service - MySQL Community Server Loaded: loaded (/lib/systemd/system/mysql.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2021-09-08 07:16:49 UTC; 18s ago Main PID: 32713 (mysqld) Status: "Server is operational" Tasks: 38 (limit: 4705) Memory: 356.6M CGroup: /system.slice/mysql.service └─32713 /usr/sbin/mysqld Sep 08 07:16:48 ubuntu-20 systemd[1]: Starting MySQL Community Server... Sep 08 07:16:49 ubuntu-20 systemd[1]: Started MySQL Community Server. ``` - 於後台開啟 ``` $ sudo netstat -ntlp | grep mysql ``` ### 設定 MySQL Root 密碼 ``` $ sudo mysql -u root ``` ``` ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'PASSWORD'; ``` ## BeautifulSoup 安裝套件 ``` $ pip3 install beautifulsoup4 ``` ### 1. get Example: - 現在已知有一個檔案如下: ```javascript! <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister">Elsie</a>, <a href="http://example.com/lacie" class="sister">Lacie</a> and <a href="http://example.com/tillie" class="sister">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ``` 請根據 find_all 與 get 這兩個 method 印出如同 Output 指定的內容 ```python! from bs4 import BeautifulSoup fileInput = open("test.html","r") soup = BeautifulSoup(fileInput,"html.parser") result = soup.find_all("a") print(type(result)) for obj in result: if obj in result: print(obj.text) fileInput.close ``` - Output: ``` Elsie ``` ![image](https://hackmd.io/_uploads/HywWxbyRT.png) ### 2. find_all Example: 以 input.html 檔案為輸入 找出 a tag 中,含有 Elsie 與 Lacie 的節點 ```python! from bs4 import BeautifulSoup fileInput = open("test.html","r") soup = BeautifulSoup(fileInput,"html.parser") result = soup.find_all(["a","p"]) print(type(result)) for obj in result: if obj in result: print(obj.text) fileInput.close ``` - Output: ``` [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] ``` ![image](https://hackmd.io/_uploads/SyNLgW10p.png) ### 3. find Example: test2.html ```javascript! <!DOCTYPE html> <html> <head> <title>時間顯示</title> </head> <body> <div> <p>Paragraph 1</p> <div> <p>Paragraph 2</p> <div> <p>21:00</p> </div> <p>Paragraph 3</p> </div> <div> <p>台灣時間:</p> </div> </div> </body> </html> ``` - 請根據HTML - 印出如同 Output 的結果 ```python! from bs4 import BeautifulSoup fileInput = open("test2.html","r") soup = BeautifulSoup(fileInput,"html.parser") result = soup.find("p") print(result) result2 = result.findChild() print(result2) result3 = result.findNextSibling() print(result3) fileInput.close ``` output: ``` 台灣時間: 21:00 ``` ![image](https://hackmd.io/_uploads/SkaS-bk0T.png)