###### tags: `python`
# 擷取網站資料
## webbrowser模組:呼叫瀏覽器連到指定網址
### 根據指定網址開啟網頁
```python=
import webbrowser
webbrowser.open('http://www.yahoo.com.tw')
```
### 透過list開啟多個網頁
```python=
import webbrowser
locations = [
'台北車站',
'台中車站',
'台北市中華路一段174-1號'
]
for i in locations:
webbrowser.open('https://www.google.com/maps/search/' + i)
```
### 根據關鍵字找到google地圖所在位置
```python=
import webbrowser
add = input('請輸入要找尋的地點: ')
webbrowser.open('http://www.google.com/maps/search/' + add)
```
---
### 練習一
:::success
請根據以上範例修改
讓使用者輸入股票代號
開啟yahoo股市該代號的網頁(顯示開盤、收盤等資訊)
:::
### 根據關鍵字顯示youtube影片列表
```python=
import webbrowser
keyword = input('請輸入要找尋的影片: ')
webbrowser.open('https://www.youtube.com/results?search_query=' + keyword)
```
----
## requests模組:下載指定網址的網頁資料
```bash=
pip3 install requests
```
### 根據網址下載網頁內容
```python=
import requests
url1 = 'http://www.grandtech.info'
htmlfile1 = requests.get(url1)
print(htmlfile1)
print(htmlfile1.status_code)
print(htmlfile1.text)
url2 = 'http://www.grandtech.info/ddd'
htmlfile2 = requests.get(url2)
print(htmlfile2)
print(htmlfile2.status_code)
print(htmlfile2.text)
```
### 不同網址下的處理狀況
```python=
import requests
url = 'http://www.grandtech.info'
# url = 'http://www.grandtech.info/dddd'
# url = 'http://abc.com.kk'
try:
htmlfile = requests.get(url)
print(htmlfile)
print(htmlfile.status_code)
print(htmlfile.text)
if htmlfile.status_code == requests.codes.ok:
print('取回網頁資料', len(htmlfile.text))
else:
print('網頁資料取得失敗')
except Exception as e:
print('網頁取得失敗', e)
```
### 偽裝成一般瀏覽器抓取
```python=
import requests
url = 'http://aaa.24ht.com.tw'
htmlfile = requests.get(url)
# headers = {
# 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) \
# AppleWebKit/537.36 (KHTML, Gecko) Chrome/45.0.2454.101 \
# Safari/537.36'
# }
# htmlfile = requests.get(url, headers=headers)
htmlfile.encoding = 'utf8'
htmlfile.raise_for_status()
print(htmlfile.text)
```
### 透過網頁擷取資料存成檔案
```python=
import requests
url = 'https://www.yahoo.co.jp'
htmlfile = requests.get(url)
file_name = 'webpage.html'
with open(file_name, 'w', encoding='utf-8') as file_obj:
file_obj.write(htmlfile.text)
```
```python=
import requests
url = 'https://www.google.com.tw/search?q={0}+food&oq={0}'
keyword = input('請輸入要找的關鍵字: ')
htmlfile = requests.get(url.format(keyword))
file_name = 'webpage.html'
with open(file_name, 'w', encoding='utf-8') as file_obj:
file_obj.write(htmlfile.text)
```
### 下載PDF檔
```python=
import requests
url = 'https://www.taiwandns.com/wp-content/plugins/post2pdf-converter/post2pdf-converter-pdf-maker.php?id=4720&file=id&font=droidsansfallback&monospaced=droidsansfallback&fontsize=13&subsetting=0&ratio=1.35&header=1&title=1&wrap_title=0&logo=1&logo_file=logo.png&logo_width=60&footer=1&filters=1&shortcode=parse&ffamily=0'
htmlfile = requests.get(url)
file_name = 'webpage.pdf'
with open(file_name, 'wb') as file_obj:
for content in htmlfile.iter_content(1024):
size = file_obj.write(content)
print(size)
```
---
## pytube:下載youtube影片
```bash=
pip install pytube
```
```python=
from pytube import YouTube
YouTube('http://youtube.com/watch?v=9bZkp7q19f0').streams.first().download()
```
### 下載成指定的影片格式(webm, mp4, 3gpp)
```python=
from pytube import YouTube
YouTube('https://www.youtube.com/watch?v=axHKi38q8ag').streams.filter(progressive=True, file_extension='mp4').first().download()
```
---
## BeautifulSoup模組:解析html語法
```bash=
pip install beautifulsoup4
pip install lxml
```
## 取得youtube指定關鍵字影片網址#1
```python=
import requests
from bs4 import BeautifulSoup
youtube_url = 'https://www.youtube.com'
url = 'https://www.youtube.com/results?search_query='
keyword = input('請輸入要找的音樂關鍵字')
response = requests.get(url + keyword)
html_doc = response.text
soup = BeautifulSoup(html_doc, "lxml")
# print(soup.prettify()) # 把排版後的 html 印出來
div_result = soup.find('div', id='content')
a_result = div_result.find_all('a')
for a in a_result:
print(youtube_url + a['href'], a.text)
```
## 取得youtube指定關鍵字影片網址#2 (改善結果)
```python=
import re
import requests
from bs4 import BeautifulSoup
youtube_url = 'https://www.youtube.com'
keyword = input('請輸入要找的歌曲關鍵字: ')
resp = requests.get('https://www.youtube.com/results?search_query=' + keyword)
soup = BeautifulSoup(resp.text, 'lxml')
# print(soup.prettify())
# print(soup.title)
# print(soup.title.name)
# print(soup.title.string)
# print(soup.a)
# print(soup.a['id'])
div_result = soup.find_all('a', 'yt-uix-tile-link')
for div in div_result:
print(youtube_url + div['href'], div.string)
```
## 取得PTT指定版名的文章標題、文章網址
```python=
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.ptt.cc'
template_url = base_url + '/bbs/{0}/index.html'
groups = {
'八卦板': 'Gossiping',
'電影板': 'Movie'
}
def get_dom(group):
resp = requests.get(
url=template_url.format(groups[group]),
cookies={'over18': '1'}
)
if resp.status_code != 200:
print('網址不正確:', resp.url)
return None
else:
return resp.text
def get_title(dom):
titles = []
soup = BeautifulSoup(dom, "lxml")
div_result = soup.find_all('div', 'r-ent')
for div in div_result:
res = div.find('div', 'title').find('a')
if res:
titles.append({
'title': res.string,
'url': base_url + res['href']
})
return titles
if __name__ == '__main__':
dom = get_dom('電影板')
if dom:
titles = get_title(dom)
for t in titles:
print(t['url'], t['title'])
```
## 抓取yahoo股票資料
```python=
import re
import requests
from bs4 import BeautifulSoup
yahoo_url = 'https://tw.stock.yahoo.com/q/q?s='
def convert_string(x):
return x.string
keyword = input('請輸入要找的股票代號: ')
resp = requests.get(yahoo_url + keyword)
soup = BeautifulSoup(resp.text, 'lxml')
table_result = soup.find("table", width="750", border="2")
trs = table_result.find_all("tr")
ths = trs[0].find_all("th")
ths = map(convert_string, ths)
tds = trs[1].find_all("td")
tds = map(convert_string, tds)
stock_info = dict(zip(ths, tds))
print(stock_info)
```
----
# 產生有登入功能的簡易網站(額外補充,請略過)
參考網站:[Create a simple login](http://pythonforengineers.com/create-simple-login/)
[範例檔案下載](https://drive.google.com/file/d/1ZwjLY--NPMOpLD1TWLl51nIMVBFLvO2I/view?usp=sharing)
## 1. 安裝套件及設定環境變數
```shell=
pip3 install flask
```
## 2. 登入頁面(login.html)
請放到templates資料夾中
```htmlmixed=
<!DOCTYPE html>
<form action="" method="post">
帳號: <input type="text", name="username", value= "">
密碼: <input type="text", name="password", value= "">
<p><input type="submit" value="登入""></p>
{% if error %}
<b> 錯誤: </b> {{error}}
{% endif %}
</form>
```
## 3. 登入成功頁面(secret.html)
請放到templates資料夾中
```htmlmixed=
<!DOCTYPE html>
<html>
<p>歡迎登入本系統</p>
<a href="logout">登出</a>
</html>
```
## 4. 主程式(blog.py)
```python=
from flask import *
SECRET_KEY = 'this is a screct'
app = Flask(__name__)
app.config.from_object(__name__)
@app.route("/login", methods = ['GET', 'POST'])
def login():
error = None
if request.method == 'POST':
if request.form['username'] == "amos" or request.form['password'] == 'python':
session['logged_in'] = True
return redirect(url_for('secret'))
else:
error = "使用者帳號或密碼錯誤"
return render_template("login.html", error=error)
@app.route("/secret")
def secret():
return render_template("secret.html")
@app.route('/logout')
def logout():
session.pop('logged_in', None)
return redirect(url_for('login'))
if __name__ == '__main__':
app.run(host='0.0.0.0', port=9000)
```
## 5. 啟用網站
在命令列執行以下程式
```shell=
python3 blog.py
```
## 6. 打開瀏覽器登入此網站
- localhost:9000/login
## 7. 測試擷取需要帳密的網頁
```python=
import requests
payload = {
'username': 'amos',
'password': 'python',
}
r = requests.post('http://localhost:8080/login', data=payload)
print(r.status_code) # 狀態碼
print(r.headers['content-type']) #
print(r.encoding) # 文字編碼
print(r.text) # 取得網頁內容
```
## 補充
- 若要讓其他電腦連到此網站,請先確認本電腦IP
- ipconfig(windows)
- ifconfig(linux)
- 防火牆要打開指定的port(這裡是8080)
- 若要讓區域網路以外可以連到此電腦,對外固定IP要設定轉址到本電腦
----
## 補充資料
[網頁開發人員應了解的 HTTP 狀態碼](https://blog.miniasp.com/post/2009/01/16/Web-developer-should-know-about-HTTP-Status-Code.aspx)
[[Python] Requests 教學](http://zwindr.blogspot.com/2016/08/python-requests.html)
[給初學者的 Python 網頁爬蟲與資料分析 (1) 前言](http://blog.castman.net/%E6%95%99%E5%AD%B8/2016/12/19/python-data-science-tutorial-1.html)
[Python 使用 Beautiful Soup 抓取與解析網頁資料,開發網路爬蟲教學](https://blog.gtwang.org/programming/python-beautiful-soup-module-scrape-web-pages-tutorial/)
[[Python] Beautifulsoup4 教學](http://zwindr.blogspot.com/2016/08/python-beautifulsoup4.html)
[Beautiful Soup 4 Cheatsheet](http://akul.me/blog/2016/beautifulsoup-cheatsheet/)