擷取網站資料

###### tags: `python` # 擷取網站資料 ## webbrowser模組：呼叫瀏覽器連到指定網址 ### 根據指定網址開啟網頁 ```python= import webbrowser webbrowser.open('http://www.yahoo.com.tw') ``` ### 透過list開啟多個網頁 ```python= import webbrowser locations = [ '台北車站', '台中車站', '台北市中華路一段174-1號' ] for i in locations: webbrowser.open('https://www.google.com/maps/search/' + i) ``` ### 根據關鍵字找到google地圖所在位置 ```python= import webbrowser add = input('請輸入要找尋的地點: ') webbrowser.open('http://www.google.com/maps/search/' + add) ``` --- ### 練習一 :::success 請根據以上範例修改讓使用者輸入股票代號開啟yahoo股市該代號的網頁（顯示開盤、收盤等資訊） ::: ### 根據關鍵字顯示youtube影片列表 ```python= import webbrowser keyword = input('請輸入要找尋的影片: ') webbrowser.open('https://www.youtube.com/results?search_query=' + keyword) ``` ---- ## requests模組：下載指定網址的網頁資料 ```bash= pip3 install requests ``` ### 根據網址下載網頁內容 ```python= import requests url1 = 'http://www.grandtech.info' htmlfile1 = requests.get(url1) print(htmlfile1) print(htmlfile1.status_code) print(htmlfile1.text) url2 = 'http://www.grandtech.info/ddd' htmlfile2 = requests.get(url2) print(htmlfile2) print(htmlfile2.status_code) print(htmlfile2.text) ``` ### 不同網址下的處理狀況 ```python= import requests url = 'http://www.grandtech.info' # url = 'http://www.grandtech.info/dddd' # url = 'http://abc.com.kk' try: htmlfile = requests.get(url) print(htmlfile) print(htmlfile.status_code) print(htmlfile.text) if htmlfile.status_code == requests.codes.ok: print('取回網頁資料', len(htmlfile.text)) else: print('網頁資料取得失敗') except Exception as e: print('網頁取得失敗', e) ``` ### 偽裝成一般瀏覽器抓取 ```python= import requests url = 'http://aaa.24ht.com.tw' htmlfile = requests.get(url) # headers = { # 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) \ # AppleWebKit/537.36 (KHTML, Gecko) Chrome/45.0.2454.101 \ # Safari/537.36' # } # htmlfile = requests.get(url, headers=headers) htmlfile.encoding = 'utf8' htmlfile.raise_for_status() print(htmlfile.text) ``` ### 透過網頁擷取資料存成檔案 ```python= import requests url = 'https://www.yahoo.co.jp' htmlfile = requests.get(url) file_name = 'webpage.html' with open(file_name, 'w', encoding='utf-8') as file_obj: file_obj.write(htmlfile.text) ``` ```python= import requests url = 'https://www.google.com.tw/search?q={0}+food&oq={0}' keyword = input('請輸入要找的關鍵字: ') htmlfile = requests.get(url.format(keyword)) file_name = 'webpage.html' with open(file_name, 'w', encoding='utf-8') as file_obj: file_obj.write(htmlfile.text) ``` ### 下載PDF檔 ```python= import requests url = 'https://www.taiwandns.com/wp-content/plugins/post2pdf-converter/post2pdf-converter-pdf-maker.php?id=4720&file=id&font=droidsansfallback&monospaced=droidsansfallback&fontsize=13&subsetting=0&ratio=1.35&header=1&title=1&wrap_title=0&logo=1&logo_file=logo.png&logo_width=60&footer=1&filters=1&shortcode=parse&ffamily=0' htmlfile = requests.get(url) file_name = 'webpage.pdf' with open(file_name, 'wb') as file_obj: for content in htmlfile.iter_content(1024): size = file_obj.write(content) print(size) ``` --- ## pytube：下載youtube影片 ```bash= pip install pytube ``` ```python= from pytube import YouTube YouTube('http://youtube.com/watch?v=9bZkp7q19f0').streams.first().download() ``` ### 下載成指定的影片格式(webm, mp4, 3gpp) ```python= from pytube import YouTube YouTube('https://www.youtube.com/watch?v=axHKi38q8ag').streams.filter(progressive=True, file_extension='mp4').first().download() ``` --- ## BeautifulSoup模組：解析html語法 ```bash= pip install beautifulsoup4 pip install lxml ``` ## 取得youtube指定關鍵字影片網址#1 ```python= import requests from bs4 import BeautifulSoup youtube_url = 'https://www.youtube.com' url = 'https://www.youtube.com/results?search_query=' keyword = input('請輸入要找的音樂關鍵字') response = requests.get(url + keyword) html_doc = response.text soup = BeautifulSoup(html_doc, "lxml") # print(soup.prettify()) # 把排版後的 html 印出來 div_result = soup.find('div', id='content') a_result = div_result.find_all('a') for a in a_result: print(youtube_url + a['href'], a.text) ``` ## 取得youtube指定關鍵字影片網址#2 (改善結果） ```python= import re import requests from bs4 import BeautifulSoup youtube_url = 'https://www.youtube.com' keyword = input('請輸入要找的歌曲關鍵字: ') resp = requests.get('https://www.youtube.com/results?search_query=' + keyword) soup = BeautifulSoup(resp.text, 'lxml') # print(soup.prettify()) # print(soup.title) # print(soup.title.name) # print(soup.title.string) # print(soup.a) # print(soup.a['id']) div_result = soup.find_all('a', 'yt-uix-tile-link') for div in div_result: print(youtube_url + div['href'], div.string) ``` ## 取得PTT指定版名的文章標題、文章網址 ```python= import requests from bs4 import BeautifulSoup base_url = 'https://www.ptt.cc' template_url = base_url + '/bbs/{0}/index.html' groups = { '八卦板': 'Gossiping', '電影板': 'Movie' } def get_dom(group): resp = requests.get( url=template_url.format(groups[group]), cookies={'over18': '1'} ) if resp.status_code != 200: print('網址不正確:', resp.url) return None else: return resp.text def get_title(dom): titles = [] soup = BeautifulSoup(dom, "lxml") div_result = soup.find_all('div', 'r-ent') for div in div_result: res = div.find('div', 'title').find('a') if res: titles.append({ 'title': res.string, 'url': base_url + res['href'] }) return titles if __name__ == '__main__': dom = get_dom('電影板') if dom: titles = get_title(dom) for t in titles: print(t['url'], t['title']) ``` ## 抓取yahoo股票資料 ```python= import re import requests from bs4 import BeautifulSoup yahoo_url = 'https://tw.stock.yahoo.com/q/q?s=' def convert_string(x): return x.string keyword = input('請輸入要找的股票代號: ') resp = requests.get(yahoo_url + keyword) soup = BeautifulSoup(resp.text, 'lxml') table_result = soup.find("table", width="750", border="2") trs = table_result.find_all("tr") ths = trs[0].find_all("th") ths = map(convert_string, ths) tds = trs[1].find_all("td") tds = map(convert_string, tds) stock_info = dict(zip(ths, tds)) print(stock_info) ``` ---- # 產生有登入功能的簡易網站（額外補充，請略過）參考網站：[Create a simple login](http://pythonforengineers.com/create-simple-login/) [範例檔案下載](https://drive.google.com/file/d/1ZwjLY--NPMOpLD1TWLl51nIMVBFLvO2I/view?usp=sharing) ## 1. 安裝套件及設定環境變數 ```shell= pip3 install flask ``` ## 2. 登入頁面（login.html）請放到templates資料夾中 ```htmlmixed= <!DOCTYPE html> <form action="" method="post"> 帳號: <input type="text", name="username", value= ""> 密碼: <input type="text", name="password", value= ""> <p><input type="submit" value="登入""></p> {% if error %} <b> 錯誤: </b> {{error}} {% endif %} </form> ``` ## 3. 登入成功頁面（secret.html）請放到templates資料夾中 ```htmlmixed= <!DOCTYPE html> <html> <p>歡迎登入本系統</p> <a href="logout">登出</a> </html> ``` ## 4. 主程式（blog.py） ```python= from flask import * SECRET_KEY = 'this is a screct' app = Flask(__name__) app.config.from_object(__name__) @app.route("/login", methods = ['GET', 'POST']) def login(): error = None if request.method == 'POST': if request.form['username'] == "amos" or request.form['password'] == 'python': session['logged_in'] = True return redirect(url_for('secret')) else: error = "使用者帳號或密碼錯誤" return render_template("login.html", error=error) @app.route("/secret") def secret(): return render_template("secret.html") @app.route('/logout') def logout(): session.pop('logged_in', None) return redirect(url_for('login')) if __name__ == '__main__': app.run(host='0.0.0.0', port=9000) ``` ## 5. 啟用網站在命令列執行以下程式 ```shell= python3 blog.py ``` ## 6. 打開瀏覽器登入此網站 - localhost:9000/login ## 7. 測試擷取需要帳密的網頁 ```python= import requests payload = { 'username': 'amos', 'password': 'python', } r = requests.post('http://localhost:8080/login', data=payload) print(r.status_code) # 狀態碼 print(r.headers['content-type']) # print(r.encoding) # 文字編碼 print(r.text) # 取得網頁內容 ``` ## 補充 - 若要讓其他電腦連到此網站，請先確認本電腦IP - ipconfig(windows) - ifconfig(linux) - 防火牆要打開指定的port(這裡是8080） - 若要讓區域網路以外可以連到此電腦，對外固定IP要設定轉址到本電腦 ---- ## 補充資料 [網頁開發人員應了解的 HTTP 狀態碼](https://blog.miniasp.com/post/2009/01/16/Web-developer-should-know-about-HTTP-Status-Code.aspx) [[Python] Requests 教學](http://zwindr.blogspot.com/2016/08/python-requests.html) [給初學者的 Python 網頁爬蟲與資料分析 (1) 前言](http://blog.castman.net/%E6%95%99%E5%AD%B8/2016/12/19/python-data-science-tutorial-1.html) [Python 使用 Beautiful Soup 抓取與解析網頁資料，開發網路爬蟲教學](https://blog.gtwang.org/programming/python-beautiful-soup-module-scrape-web-pages-tutorial/) [[Python] Beautifulsoup4 教學](http://zwindr.blogspot.com/2016/08/python-beautifulsoup4.html) [Beautiful Soup 4 Cheatsheet](http://akul.me/blog/2016/beautifulsoup-cheatsheet/)