<style> .new { color: red; font-weight: bold; } </style> # Python requests ###### tags: `presentation` `sprout` 資訊之芽 熊育霆 2022/05/22 --- 大家先把這個 repo 下載下來,等等會用到 https://github.com/bearomorphism/sprout-request ```bash= git clone https://github.com/bearomorphism/sprout-request.git ``` 如果你想要當成自己的專案做 commit 或 push,記得 fork 一下 --- ## 爬蟲 crawler 什麼是爬蟲? * [爬蟲 wiki](https://zh.wikipedia.org/wiki/%E7%B6%B2%E8%B7%AF%E7%88%AC%E8%9F%B2) ---- ### 爬蟲可以做什麼 * 網路上有許多網頁,裡面有很多很有價值的知識 * https://userinyerface.com/ * 例如這個 https://longdogechallenge.com/ * 又例如這個 http://www.partridgegetslucky.com/ * 還有這個 https://thatsthefinger.com/ * 我們可以手動去擷取這些網頁的內容 ---- * 但是有價值的知識真的太多了,沒辦法一頁一頁手動擷取 * 例如 http://eelslap.com/ * 還有 https://matias.ma/nsfw/ * 這邊還有很多 https://theuselessweb.com/ * 最重要的是,我們不能忘記那個辛苦端火鍋的夥伴 https://www.youtube.com/watch?v=dQw4w9WgXcQ 有辦法寫個程式去把這些非常有價值的資訊蒐集起來嗎? ---- 上面兩張是從去年的投影片抄的 ---- ### 爬蟲可以做的事情 * 正經:大數據分析、統計 * 不正經: * 自動下載迷因 * 搶票機器人 * 下載漫畫 * 下載各種神奇的東西 * 我們對各位的特殊癖好沒有興趣 大家可以把自己寫的爬蟲機器人分享在 github 和 discord ---- 其實上網找也可以找到一些別人寫好的機器人來參考學習 ![](https://i.imgur.com/904KocZ.png) ---- 結合下週要教 BS4 和 Selenium 比較容易寫爬蟲 這堂課算是一個入門(? --- ## Set up [安裝 python requests](https://docs.python-requests.org/en/latest/user/install/#install) 就跟以前安裝其他套件一樣 把下面這串貼到你的 terminal/小黑窗 ```bash pip install requests ``` ---- ![](https://i.smalljoys.me/2018/08/screen-shot-2018-08-27-at-4-48-11-pm.png) --- ## Recall: what are requests? 根據 [IBM Documentation](https://www.ibm.com/docs/en/cics-ts/5.3?topic=protocol-http-requests) > An HTTP request is made by a client, to a named host, which is located on a server. The aim of the request is to access a resource on the server. ---- 這是中文翻譯(我國文11級分,翻太爛請見諒) > 一個**HTTP 請求**是由**用戶端**製造並且送到一個位於**伺服器**的**主機**。一個請求的目的是要存取伺服器上的**資源**。 ---- 專業術語小教室 * HTTP request: HTTP 請求 * client: 用戶端/客戶 * server: 伺服器 * host: 主機 * resource: 資源 ---- 你在上網時會一直跟伺服器要資料,像是HTML之類的東西。網頁是藉由瀏覽器把HTML變成人看得懂的東西。 如果你吃飽太閒,可以看一下[這篇文章](https://github.com/alex/what-happens-when) ---- Recall: [HTTP request methods](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) 其實上週也有教,簡單來說就是 * GET 從伺服器拿資料 * POST 把資料給伺服器 ---- 不過你也可以用 GET 把資料給伺服器、用 POST 從伺服器拿資料,很奇怪吧? 如果你有點進去上一頁的文件看的話,會發現還有其他 HTTP request methods,但是我們這堂課只會用到 GET 和 POST ---- Questions? --- ## Python Requests Module 如果你偏好看官方文件的話,所有的官方文件都在這裡了 [English version](https://docs.python-requests.org/en/latest/) [中文版](https://docs.python-requests.org/zh_CN/latest/) ---- Requests is an elegant and simple HTTP library for Python, built for human beings. Requests 是一個優雅、簡單的 Python HTTP 函式庫,為了人類而設計的 Requests 唯一的一个非转基因的 Python HTTP 库,人类可以安全享用。 --- ## How to use `requests` [xkcd](https://xkcd.com/) <- 練習會用到喔 ---- 建議大家開一個 ipython notebook 跟著練習 或是打開稍早 clone 下來的 repo ---- Set up ```python= import requests url = 'https://xkcd.com/' res = requests.get(url) # 用 get 去拿網站的資料 ``` ---- 這裡變數取 res 是 response 的意思,簡單來說就是你發送請求以後伺服器送回來給你的東西 ![](https://i.imgur.com/Y4hsjk7.png) ---- requests ![](https://i.imgur.com/rV8HgGz.png) ---- 印出 res 看看 ```python= import requests url = 'https://xkcd.com/' res = requests.get(url) print(res) # <Response [200]> ``` ---- 看看 res 有哪些屬性可以用 ```python= import requests url = 'https://xkcd.com/353/' res = requests.get(url) print(dir(res)) # <Response [200]> ``` ```python ['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url'] ``` ---- 看說明 ```python help(r) ``` --- ### 偷看 res 的屬性 自己執行看看 ```python= print(res.status_code) print(res.text) print(res.content) print(res.ok) print(res.url) ``` ---- ```python= print(res.status_code) print(res.text) # content print(res.content) # binary content print(res.ok) # O 不 OK print(res.url) ``` --- ### 下載圖片 我們上面舉例的 xkcd 漫畫網站,每一頁的圖片下方都有附圖片連結 https://imgs.xkcd.com/comics/python.png 直接貼到瀏覽器網址列上就會顯示圖片 ---- 貼到網址列後會看到這個 ![](https://imgs.xkcd.com/comics/python.png) ---- 在下載圖片之前我們先來認識一個好康的函式庫 **[pathlib](https://docs.python.org/3/library/pathlib.html)** ```python= # 建立一個 images 資料夾 from pathlib import Path Path("./images").mkdir(parents=True, exist_ok=True) # Reference: https://stackoverflow.com/questions/273192/how-can-i-safely-create-a-nested-directory ``` ---- 用 requests 下載圖片 ```python= import requests # 建立一個 images 資料夾 from pathlib import Path Path("./images").mkdir(parents=True, exist_ok=True) # Reference: https://stackoverflow.com/questions/273192/how-can-i-safely-create-a-nested-directory url = 'https://imgs.xkcd.com/comics/python.png' res = requests.get(url) with open(Path("./images/python.png"), "wb") as f: f.write(res.content) ``` --- ## [API](https://zh.wikipedia.org/wiki/%E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F%E6%8E%A5%E5%8F%A3) 我們先來了解一下 API 是什麼 ---- 用一張圖解釋 ![](https://img-comment-fun.9cache.com/media/aAYBA2o/aqxMnkn4_700w_0.jpg) ---- * 前端 frontend: 你看得到的東西,會在你的電腦上(用戶端)跑的程式碼,例如渲染後的網頁等等 * HTML, JS, CSS * 題外話,[HTML是一種程式語言嗎?](https://www.google.com/search?q=Is+html+a+programming+language&oq=Is+html+a+programming+language&aqs=chrome..69i57.10829j0j7&sourceid=chrome&ie=UTF-8) * 後端 backend: 你看不到的東西,會在伺服器上跑的程式碼,像是資料庫 * 可以用很多其他語言寫,Python, NodeJs, C/C++, PHP, Go ... * API: 負責前端與後端溝通的界面,簡單來說就是有服務生/小精靈在你的電腦和伺服器之間班資料 ---- xkcd 的 api 說明 https://xkcd.com/json.html ---- https://xkcd.com/614/info.0.json ```json= { "month": "7", "num": 614, "link": "", "year": "2009", "news": "", "safe_title": "Woodpecker", "transcript": "[[A man with a beret and a woman are standing on a boardwalk, leaning on a handrail.]]\nMan: A woodpecker!\n<<Pop pop pop>>\nWoman: Yup.\n\n[[The woodpecker is banging its head against a tree.]]\nWoman: He hatched about this time last year.\n<<Pop pop pop pop>>\n\n[[The woman walks away. The man is still standing at the handrail.]]\n\nMan: ... woodpecker?\nMan: It's your birthday!\n\nMan: Did you know?\n\nMan: Did... did nobody tell you?\n\n[[The man stands, looking.]]\n\n[[The man walks away.]]\n\n[[There is a tree.]]\n\n[[The man approaches the tree with a present in a box, tied up with ribbon.]]\n\n[[The man sets the present down at the base of the tree and looks up.]]\n\n[[The man walks away.]]\n\n[[The present is sitting at the bottom of the tree.]]\n\n[[The woodpecker looks down at the present.]]\n\n[[The woodpecker sits on the present.]]\n\n[[The woodpecker pulls on the ribbon tying the present closed.]]\n\n((full width panel))\n[[The woodpecker is flying, with an electric drill dangling from its feet, held by the cord.]]\n\n{{Title text: If you don't have an extension cord I can get that too. Because we're friends! Right?}}", "alt": "If you don't have an extension cord I can get that too. Because we're friends! Right?", "img": "https://imgs.xkcd.com/comics/woodpecker.png", "title": "Woodpecker", "day": "24" } ``` 我們來觀察一下上面的 json ,其中有一個欄位提供了圖片的網址 ---- 結合 requests ```python= import requests from pathlib import Path Path("./images").mkdir(parents=True, exist_ok=True) url = 'https://xkcd.com/614/info.0.json' res = requests.get(url) res_json = res.json() # returns a dictionary print(res_json) img_url = res_json['img'] print(img_url) img_name = img_url.split('/')[-1] with open(Path(f"./images/{img_name}"), "wb") as f: img_res = requests.get(img_url) f.write(img_res.content) ``` ---- 可以用這兩個網站來玩玩 * [JSONPlaceholder - Free Fake REST API](https://jsonplaceholder.typicode.com/) * [Lorem Picsum](https://picsum.photos/) ---- 大多數網站的 api 都不會隨便讓你亂打,你會需要金鑰 因為 set up 比較麻煩,需要申請 token 或帳號之類的 講完課有時間再回來講 --- ## Practice (5 min) 打開 xkcd-api-practice.py 開始練習 --- ## 進階用法 接下來的東西比較雜,下週爬蟲可能需要使用 先打開這個網站 https://httpbin.org/ ---- ### GET query [Query string - Wikipedia](https://en.wikipedia.org/wiki/Query_string) ---- Recall: Query string 是問號後面那串 ![](https://i.imgur.com/Ep2vvNN.png) ---- https://httpbin.org/get https://httpbin.org/get?a=1&rick=roll ---- ```python= import requests url = 'https://httpbin.org/get?a=1&rick=roll' res = requests.get(url) print(res.text) ``` ---- 更安全的用法 ```python= import requests payload = {'a': 1, 'rick': 'roll'} url = 'https://httpbin.org/get' res = requests.get(url, params=payload) print(res.text) print(res.url) # https://httpbin.org/get?a=1&rick=roll ``` --- ### POST ```python= import requests payload = {'a': 1, 'rick': 'roll'} res = requests.post('https://httpbin.org/post', data=payload) print(res.text) ``` ---- ```json { ... "form": { "a": "1", "rick": "roll" }, ... } ``` --- Any Questions? ![](https://i.imgur.com/w2xd7oj.png) --- ## 雜七雜八 * [系上朋友寫的長輩圖機器人](https://github.com/superr0ng/LINEtools) * [API List: A public list of free APIs for programmers](https://apilist.fun/) --- ## 參考資料 * [Python Requests Tutorial: Request Web Pages, Download Images, POST Data, Read JSON, and More - YouTube](https://www.youtube.com/watch?v=tb8gHvYlCFs&t=1266s&ab_channel=CoreySchafer) * [去年投影片](https://drive.google.com/file/d/1j__HsD32Tn987a5w2NwkS8KjObVziemh/view) --- End
{"metaMigratedAt":"2023-06-17T00:57:09.276Z","metaMigratedFrom":"YAML","title":"Python requests","breaks":true,"slideOptions":"{\"transition\":\"fade\"}","contributors":"[{\"id\":\"f93c8d2e-91fa-44cf-b9d2-ea6d875fcb79\",\"add\":10512,\"del\":671}]"}
    788 views