Note for Web Crawler(Web Scrapying)

# Note for Web Crawler(Web Scrapying) ###### tags: Web Crawler , Web Scrapying ## :memo: Introduction of Web Crawler ### Section 1 : Prior Knowledge of HTML： - <**a**> ➜ **Hyperlink** ,use with **href**. - <**img**> ➜ **Image file** ,use with **src**. - <**video**> ➜ **Video file** ,also use with **src**. - <**div**＞ ➜ a normal area ,code content is covered. ### Section 2 : Characteristics of HTML： - Features of General : **class** can be mutiple types ,but only have one id. - General note : Like **XML**, **HTML** delimit its **tag** and gramer start from **<head>** and end in **</head>** (Just make an example). :::info :bulb: **Extended data:** Before do this yourself, you can take a look at **HTML** in wikipedia. :pushpin: [超文本標記語言 (HTML) 介紹](https://zh.wikipedia.org/wiki/HTML) ::: ### Section 3 : Source code on Web like "tabalog" > **F12** : View the source code of this web - **Hyperlink** : the source code starting with <**a**>, use with **href=**. ![](https://i.imgur.com/vxUovGy.png) - **image** : the source code starting with <**img**>, use with **src=**. ![](https://i.imgur.com/sLZXkzi.png) - **video** : the source code staring with <**video**>, also use with **src=**. - ![](https://i.imgur.com/PmpQ1Ag.jpg) -**note** : What is wrapped in the middle of the two function words is called a note. ![](https://i.imgur.com/2aZkDDX.jpg) --- ## Complete Web Crawler in python - Step 1 : Import **beautifulsoup** and **urllib** : ![](https://i.imgur.com/ODvKbKz.png) ![](https://i.imgur.com/sRTQ3Wb.png) - Step 2 : Inplement Web Crawler in python ： ```Python=16 from urllib.request import urlopen from bs4 import BeautifulSoup# import bs4就能吃到_init_，他是整個程式的老闆，引進init內的 beautifulsoup class，要引用其他檔案就要用 import bs4.dammit #beautifulsoup主要是把網頁編碼畫成html的格式 import pandas as pd #通常都會改名，因為裡面有些函式用pd df = pd.DataFrame(columns=["名稱","網址連結","圖片連結"]) url = "https://tabelog.com/tw/osaka/rstLst/viking/1" print(url) response = urlopen(url) html = BeautifulSoup(response, from_encoding="utf-8") restaurants = html.find_all("li", class_="list-rst") print(restaurants) for r in restaurants: en = r.find("a", class_="list-rst__name-main") ja = r.find("small", class_="list-rst__name-ja") img = r.find("img", class_="c-img") print(ja.text, en["href"],img["src"]) s = pd.Series([ja.text, en["href"],img["src"]], index=["名稱","網址連結","圖片連結"]) df = df.append(s, ignore_index=True) df.to_csv("tabelog.csv",encoding="utf-8",index=False) ```