Python requests - HackMD

<style> .new { color: red; font-weight: bold; } </style> # Python requests ###### tags: `presentation` `sprout` 資訊之芽熊育霆 2022/05/22 --- 大家先把這個 repo 下載下來，等等會用到 https://github.com/bearomorphism/sprout-request ```bash= git clone https://github.com/bearomorphism/sprout-request.git ``` 如果你想要當成自己的專案做 commit 或 push，記得 fork 一下 --- ## 爬蟲 crawler 什麼是爬蟲？ * [爬蟲 wiki](https://zh.wikipedia.org/wiki/%E7%B6%B2%E8%B7%AF%E7%88%AC%E8%9F%B2) ---- ### 爬蟲可以做什麼 * 網路上有許多網頁，裡面有很多很有價值的知識 * https://userinyerface.com/ * 例如這個 https://longdogechallenge.com/ * 又例如這個 http://www.partridgegetslucky.com/ * 還有這個 https://thatsthefinger.com/ * 我們可以手動去擷取這些網頁的內容 ---- * 但是有價值的知識真的太多了,沒辦法一頁一頁手動擷取 * 例如 http://eelslap.com/ * 還有 https://matias.ma/nsfw/ * 這邊還有很多 https://theuselessweb.com/ * 最重要的是，我們不能忘記那個辛苦端火鍋的夥伴 https://www.youtube.com/watch?v=dQw4w9WgXcQ 有辦法寫個程式去把這些非常有價值的資訊蒐集起來嗎? ---- 上面兩張是從去年的投影片抄的 ---- ### 爬蟲可以做的事情 * 正經：大數據分析、統計 * 不正經： * 自動下載迷因 * 搶票機器人 * 下載漫畫 * 下載各種神奇的東西 * 我們對各位的特殊癖好沒有興趣大家可以把自己寫的爬蟲機器人分享在 github 和 discord ---- 其實上網找也可以找到一些別人寫好的機器人來參考學習 ![](https://i.imgur.com/904KocZ.png) ---- 結合下週要教 BS4 和 Selenium 比較容易寫爬蟲這堂課算是一個入門（？ --- ## Set up [安裝 python requests](https://docs.python-requests.org/en/latest/user/install/#install) 就跟以前安裝其他套件一樣把下面這串貼到你的 terminal/小黑窗 ```bash pip install requests ``` ---- ![](https://i.smalljoys.me/2018/08/screen-shot-2018-08-27-at-4-48-11-pm.png) --- ## Recall: what are requests? 根據 [IBM Documentation](https://www.ibm.com/docs/en/cics-ts/5.3?topic=protocol-http-requests) > An HTTP request is made by a client, to a named host, which is located on a server. The aim of the request is to access a resource on the server. ---- 這是中文翻譯（我國文11級分，翻太爛請見諒） > 一個**HTTP 請求**是由**用戶端**製造並且送到一個位於**伺服器**的**主機**。一個請求的目的是要存取伺服器上的**資源**。 ---- 專業術語小教室 * HTTP request: HTTP 請求 * client: 用戶端/客戶 * server: 伺服器 * host: 主機 * resource: 資源 ---- 你在上網時會一直跟伺服器要資料，像是HTML之類的東西。網頁是藉由瀏覽器把HTML變成人看得懂的東西。如果你吃飽太閒，可以看一下[這篇文章](https://github.com/alex/what-happens-when) ---- Recall: [HTTP request methods](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) 其實上週也有教，簡單來說就是 * GET 從伺服器拿資料 * POST 把資料給伺服器 ---- 不過你也可以用 GET 把資料給伺服器、用 POST 從伺服器拿資料，很奇怪吧？如果你有點進去上一頁的文件看的話，會發現還有其他 HTTP request methods，但是我們這堂課只會用到 GET 和 POST ---- Questions? --- ## Python Requests Module 如果你偏好看官方文件的話，所有的官方文件都在這裡了 [English version](https://docs.python-requests.org/en/latest/) [中文版](https://docs.python-requests.org/zh_CN/latest/) ---- Requests is an elegant and simple HTTP library for Python, built for human beings. Requests 是一個優雅、簡單的 Python HTTP 函式庫，為了人類而設計的 Requests 唯一的一个非转基因的 Python HTTP 库，人类可以安全享用。 --- ## How to use `requests` [xkcd](https://xkcd.com/) <- 練習會用到喔 ---- 建議大家開一個 ipython notebook 跟著練習或是打開稍早 clone 下來的 repo ---- Set up ```python= import requests url = 'https://xkcd.com/' res = requests.get(url) # 用 get 去拿網站的資料 ``` ---- 這裡變數取 res 是 response 的意思，簡單來說就是你發送請求以後伺服器送回來給你的東西 ![](https://i.imgur.com/Y4hsjk7.png) ---- requests ![](https://i.imgur.com/rV8HgGz.png) ---- 印出 res 看看 ```python= import requests url = 'https://xkcd.com/' res = requests.get(url) print(res) # <Response [200]> ``` ---- 看看 res 有哪些屬性可以用 ```python= import requests url = 'https://xkcd.com/353/' res = requests.get(url) print(dir(res)) # <Response [200]> ``` ```python ['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url'] ``` ---- 看說明 ```python help(r) ``` --- ### 偷看 res 的屬性自己執行看看 ```python= print(res.status_code) print(res.text) print(res.content) print(res.ok) print(res.url) ``` ---- ```python= print(res.status_code) print(res.text) # content print(res.content) # binary content print(res.ok) # O 不 OK print(res.url) ``` --- ### 下載圖片我們上面舉例的 xkcd 漫畫網站，每一頁的圖片下方都有附圖片連結 https://imgs.xkcd.com/comics/python.png 直接貼到瀏覽器網址列上就會顯示圖片 ---- 貼到網址列後會看到這個 ![](https://imgs.xkcd.com/comics/python.png) ---- 在下載圖片之前我們先來認識一個好康的函式庫 **[pathlib](https://docs.python.org/3/library/pathlib.html)** ```python= # 建立一個 images 資料夾 from pathlib import Path Path("./images").mkdir(parents=True, exist_ok=True) # Reference: https://stackoverflow.com/questions/273192/how-can-i-safely-create-a-nested-directory ``` ---- 用 requests 下載圖片 ```python= import requests # 建立一個 images 資料夾 from pathlib import Path Path("./images").mkdir(parents=True, exist_ok=True) # Reference: https://stackoverflow.com/questions/273192/how-can-i-safely-create-a-nested-directory url = 'https://imgs.xkcd.com/comics/python.png' res = requests.get(url) with open(Path("./images/python.png"), "wb") as f: f.write(res.content) ``` --- ## [API](https://zh.wikipedia.org/wiki/%E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F%E6%8E%A5%E5%8F%A3) 我們先來了解一下 API 是什麼 ---- 用一張圖解釋 ![](https://img-comment-fun.9cache.com/media/aAYBA2o/aqxMnkn4_700w_0.jpg) ---- * 前端 frontend: 你看得到的東西，會在你的電腦上（用戶端）跑的程式碼，例如渲染後的網頁等等 * HTML, JS, CSS * 題外話，[HTML是一種程式語言嗎？](https://www.google.com/search?q=Is+html+a+programming+language&oq=Is+html+a+programming+language&aqs=chrome..69i57.10829j0j7&sourceid=chrome&ie=UTF-8) * 後端 backend: 你看不到的東西，會在伺服器上跑的程式碼，像是資料庫 * 可以用很多其他語言寫，Python, NodeJs, C/C++, PHP, Go ... * API: 負責前端與後端溝通的界面，簡單來說就是有服務生/小精靈在你的電腦和伺服器之間班資料 ---- xkcd 的 api 說明 https://xkcd.com/json.html ---- https://xkcd.com/614/info.0.json ```json= { "month": "7", "num": 614, "link": "", "year": "2009", "news": "", "safe_title": "Woodpecker", "transcript": "[[A man with a beret and a woman are standing on a boardwalk, leaning on a handrail.]]\nMan: A woodpecker!\n<<Pop pop pop>>\nWoman: Yup.\n\n[[The woodpecker is banging its head against a tree.]]\nWoman: He hatched about this time last year.\n<<Pop pop pop pop>>\n\n[[The woman walks away. The man is still standing at the handrail.]]\n\nMan: ... woodpecker?\nMan: It's your birthday!\n\nMan: Did you know?\n\nMan: Did... did nobody tell you?\n\n[[The man stands, looking.]]\n\n[[The man walks away.]]\n\n[[There is a tree.]]\n\n[[The man approaches the tree with a present in a box, tied up with ribbon.]]\n\n[[The man sets the present down at the base of the tree and looks up.]]\n\n[[The man walks away.]]\n\n[[The present is sitting at the bottom of the tree.]]\n\n[[The woodpecker looks down at the present.]]\n\n[[The woodpecker sits on the present.]]\n\n[[The woodpecker pulls on the ribbon tying the present closed.]]\n\n((full width panel))\n[[The woodpecker is flying, with an electric drill dangling from its feet, held by the cord.]]\n\n{{Title text: If you don't have an extension cord I can get that too. Because we're friends! Right?}}", "alt": "If you don't have an extension cord I can get that too. Because we're friends! Right?", "img": "https://imgs.xkcd.com/comics/woodpecker.png", "title": "Woodpecker", "day": "24" } ``` 我們來觀察一下上面的 json ，其中有一個欄位提供了圖片的網址 ---- 結合 requests ```python= import requests from pathlib import Path Path("./images").mkdir(parents=True, exist_ok=True) url = 'https://xkcd.com/614/info.0.json' res = requests.get(url) res_json = res.json() # returns a dictionary print(res_json) img_url = res_json['img'] print(img_url) img_name = img_url.split('/')[-1] with open(Path(f"./images/{img_name}"), "wb") as f: img_res = requests.get(img_url) f.write(img_res.content) ``` ---- 可以用這兩個網站來玩玩 * [JSONPlaceholder - Free Fake REST API](https://jsonplaceholder.typicode.com/) * [Lorem Picsum](https://picsum.photos/) ---- 大多數網站的 api 都不會隨便讓你亂打，你會需要金鑰因為 set up 比較麻煩，需要申請 token 或帳號之類的講完課有時間再回來講 --- ## Practice (5 min) 打開 xkcd-api-practice.py 開始練習 --- ## 進階用法接下來的東西比較雜，下週爬蟲可能需要使用先打開這個網站 https://httpbin.org/ ---- ### GET query [Query string - Wikipedia](https://en.wikipedia.org/wiki/Query_string) ---- Recall: Query string 是問號後面那串 ![](https://i.imgur.com/Ep2vvNN.png) ---- https://httpbin.org/get https://httpbin.org/get?a=1&rick=roll ---- ```python= import requests url = 'https://httpbin.org/get?a=1&rick=roll' res = requests.get(url) print(res.text) ``` ---- 更安全的用法 ```python= import requests payload = {'a': 1, 'rick': 'roll'} url = 'https://httpbin.org/get' res = requests.get(url, params=payload) print(res.text) print(res.url) # https://httpbin.org/get?a=1&rick=roll ``` --- ### POST ```python= import requests payload = {'a': 1, 'rick': 'roll'} res = requests.post('https://httpbin.org/post', data=payload) print(res.text) ``` ---- ```json { ... "form": { "a": "1", "rick": "roll" }, ... } ``` --- Any Questions? ![](https://i.imgur.com/w2xd7oj.png) --- ## 雜七雜八 * [系上朋友寫的長輩圖機器人](https://github.com/superr0ng/LINEtools) * [API List: A public list of free APIs for programmers](https://apilist.fun/) --- ## 參考資料 * [Python Requests Tutorial: Request Web Pages, Download Images, POST Data, Read JSON, and More - YouTube](https://www.youtube.com/watch?v=tb8gHvYlCFs&t=1266s&ab_channel=CoreySchafer) * [去年投影片](https://drive.google.com/file/d/1j__HsD32Tn987a5w2NwkS8KjObVziemh/view) --- End