--- title: NLP - 用Selenium爬蟲『博碩士論文加值系統』 tags: self-learning, NLP --- {%hackmd BkVfcTxlQ %} # **_NLP - 用Selenium爬蟲『博碩士論文加值系統』_** > [name=BessyHuang] [time=Tues, Apr 21, 2020] # **課程大綱** [TOC] :::warning **_Reference:_** * [Python - Getting Started With Selenium WebDriver on Ubuntu/Debian](https://dzone.com/articles/python-getting-started) * [Ubuntu下配置Selenium运行环境](https://www.itfanr.cc/2016/10/19/configuration-the-selenium-running-environment-in-ubuntu/) * [使用 Selenium IDE 進行網頁自動化測試](https://tpu.thinkpower.com.tw/tpu/articleDetails/1846) * [Day13-網路爬蟲實作II selenium 模擬瀏覽器](https://ithelp.ithome.com.tw/articles/10222029) * [selenium之 搞定checkbox、radiobox](https://blog.csdn.net/huilan_same/article/details/52287955) * [Python 學習筆記 : Selenium 模組瀏覽器自動化測試 (二)](http://yhhuang1966.blogspot.com/2018/05/python-selenium_27.html) * [ Selenium-Python中文文檔](https://selenium-python-zh.readthedocs.io/en/latest/locating-elements.html) * [Day13-網路爬蟲實作II selenium 模擬瀏覽器](https://ithelp.ithome.com.tw/articles/10222029) * [Selenium如何處理CheckBox (Python篇)](https://www.qa-knowhow.com/?p=4258) ::: --- ## **(Ubuntu)安裝與設定 Selenium** * Install selenium ``` $ sudo apt-get update $ pip3 install selenium ``` * Install browser driver * Firefox * [下载 geckodriver:geckodriver-v0.26.0-linux64.tar.gz](https://github.com/mozilla/geckodriver/releases) * 將 geckodriver 加入 path (/usr/local/bin/) 目錄下,並給予執行權限 ``` $ sudo mv ./geckodriver /usr/local/bin/ $ sudo chmod a+x /usr/local/bin/geckodriver ``` * Chrome * [下载 ChromeDriver](https://chromedriver.chromium.org/downloads) > [如何查看 Chrome 版本?](https://help.zenplanner.com/hc/en-us/articles/204253654-How-to-Find-Your-Internet-Browser-Version-Number-Google-Chrome) > ![](https://i.imgur.com/AEkiwH3.png) > * 將 ChromeDriver 加入 path (/usr/local/bin/) 目錄下,並給予執行權限 ``` $ sudo mv ./chromedriver /usr/local/bin/ $ sudo chmod a+x /usr/local/bin/chromedriver ``` * Testing if Browser Is Working > If success, will see a browser be opened. > * Firefox ```python= from selenium import webdriver browser = webdriver.Firefox() browser.get('http://www.ubuntu.com/') ``` * Chrome ```python= from selenium import webdriver browser = webdriver.Chrome() browser.get('http://www.ubuntu.com/') ``` --- ## **爬論文標題** ### 爬蟲程式碼 - 丰嘉 * Install BeautifulSoup4 ``` $ pip3 install beautifulsoup4 ``` * 爬蟲程式碼 ```python= from selenium import webdriver from bs4 import BeautifulSoup as b4 browser = webdriver.Firefox() browser.get('https://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/login?o=dwebmge') # 搜尋"陳舜德" browser.find_element_by_id("ysearchinput0").send_keys("陳舜德") # 勾選"指導教授" ## 方法一 browser.find_element_by_xpath('/html/body/div[2]/table/tbody/tr[1]/td[2]/table/tbody/tr[4]/td/div[1]/form/table/tbody/tr[2]/td/table/tbody/tr[1]/td/table/tbody/tr[2]/td/input[3]').click() ## 方法二 browser.find_element_by_xpath('//input[@value="ad"]').click() # 點擊搜尋按鈕 browser.find_element_by_id("gs32search").click() # 網頁原始碼 html = browser.page_source # BeautifulSoup4 解析 soup = b4(html, 'html.parser') paper_title = soup.select('#tablefmt1 span') for i in paper_title: print(i.text) ``` ``` Output: 實體教具與虛擬教具對幼童學習影響之研究 應用混成式學習於課後輔導之研究 網路分類資源自動化擴展系統之研究 基於翻轉學習概念之互動式教學平台架構研究 建構共享共讀的圖書書庫自動化流通平台:以「愛的書庫」為例 基於自動分類為基礎的圖書題名特徵擷取之研究-以輔助圖書分類系統為例 數位典藏庫中資料分群產生之研究-以數位學習詮釋資料為例 個人化知識表徵瀏覽模型 ``` --- ### 爬蟲程式碼 - 譽錚 ```python= from selenium import webdriver from selenium.webdriver.common.by import By driverPath = r"D:\\chromedriver.exe" driver = webdriver.Chrome(driverPath) driver.get("https://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/login?o=dwebmge") print(driver.title) driver.find_element_by_id("ysearchinput0").send_keys("AI") driver.find_element_by_id("gs32search").click() print() while True: try: elements = driver.find_elements_by_class_name("etd_d") for i in elements: print(i.text) driver.find_element_by_name("gonext").click() except: break ``` ### 爬蟲程式碼智能版 - 譽錚 ```python= from selenium import webdriver from selenium.webdriver.common.by import By # 勾選核取方塊 def checkbox_check(dcf): dcf_list = dcf.split(" ") for i in dcf_list: driver.find_element_by_css_selector('input[value=' + i + ']').click() kw = input("請輸入要搜尋的字詞:") # 勾選核取方塊的項目 dcf = input("請輸入要勾取的項目的代號,以空白分隔,ti默認勾選:\n \ 論文名稱:ti\n 研究生:au\n 指導教授:ad\n 口試委員:say\n \ 關鍵詞:kw\n 摘要:ab\n 參考文獻:rf\n 不限欄位:ALLFIELD\n") driverPath = r"D:\\chromedriver.exe" driver = webdriver.Chrome(driverPath) driver.get("https://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/login?o=dwebmge") # 輸出網頁標題 print(driver.title) driver.find_element_by_id("ysearchinput0").send_keys(kw) checkbox_check(dcf) driver.find_element_by_id("gs32search").click() while True: try: elements = driver.find_elements_by_class_name("etd_d") for i in elements: print(i.text) driver.find_element_by_name("gonext").click() except: break ```