# Note for Web Crawler(Web Scrapying)
###### tags: Web Crawler , Web Scrapying
## :memo: Introduction of Web Crawler
### Section 1 : Prior Knowledge of HTML:
- <**a**> ➜ **Hyperlink** ,use with **href**.
- <**img**> ➜ **Image file** ,use with **src**.
- <**video**> ➜ **Video file** ,also use with **src**.
- <**div**> ➜ a normal area ,code content is covered.
### Section 2 : Characteristics of HTML:
- Features of General : **class** can be mutiple types ,but only have one id.
- General note : Like **XML**, **HTML** delimit its **tag** and gramer start from **<head>** and end in **</head>** (Just make an example).
:::info
:bulb: **Extended data:**
Before do this yourself, you can take a look at **HTML** in wikipedia.
:pushpin: [超文本標記語言 (HTML) 介紹](https://zh.wikipedia.org/wiki/HTML)
:::
### Section 3 : Source code on Web like "tabalog"
> **F12** : View the source code of this web
- **Hyperlink** : the source code starting with <**a**>, use with **href=**.

- **image** : the source code starting with <**img**>, use with **src=**.

- **video** : the source code staring with <**video**>, also use with **src=**.
- 
-**note** : What is wrapped in the middle of the two function words is called a note.

---
## Complete Web Crawler in python
- Step 1 : Import **beautifulsoup** and **urllib** :


- Step 2 : Inplement Web Crawler in python :
```Python=16
from urllib.request import urlopen
from bs4 import BeautifulSoup# import bs4就能吃到_init_,他是整個程式的老闆,引進init內的 beautifulsoup class,要引用其他檔案就要用 import bs4.dammit
#beautifulsoup主要是把網頁編碼畫成html的格式
import pandas as pd #通常都會改名,因為裡面有些函式用pd
df = pd.DataFrame(columns=["名稱","網址連結","圖片連結"])
url = "https://tabelog.com/tw/osaka/rstLst/viking/1"
print(url)
response = urlopen(url)
html = BeautifulSoup(response, from_encoding="utf-8")
restaurants = html.find_all("li", class_="list-rst")
print(restaurants)
for r in restaurants:
en = r.find("a", class_="list-rst__name-main")
ja = r.find("small", class_="list-rst__name-ja")
img = r.find("img", class_="c-img")
print(ja.text, en["href"],img["src"])
s = pd.Series([ja.text, en["href"],img["src"]], index=["名稱","網址連結","圖片連結"])
df = df.append(s, ignore_index=True)
df.to_csv("tabelog.csv",encoding="utf-8",index=False)
```