---
title: NLP - 用Selenium爬蟲『博碩士論文加值系統』
tags: self-learning, NLP
---
{%hackmd BkVfcTxlQ %}
# **_NLP - 用Selenium爬蟲『博碩士論文加值系統』_**
> [name=BessyHuang] [time=Tues, Apr 21, 2020]
# **課程大綱**
[TOC]
:::warning
**_Reference:_**
* [Python - Getting Started With Selenium WebDriver on Ubuntu/Debian](https://dzone.com/articles/python-getting-started)
* [Ubuntu下配置Selenium运行环境](https://www.itfanr.cc/2016/10/19/configuration-the-selenium-running-environment-in-ubuntu/)
* [使用 Selenium IDE 進行網頁自動化測試](https://tpu.thinkpower.com.tw/tpu/articleDetails/1846)
* [Day13-網路爬蟲實作II selenium 模擬瀏覽器](https://ithelp.ithome.com.tw/articles/10222029)
* [selenium之 搞定checkbox、radiobox](https://blog.csdn.net/huilan_same/article/details/52287955)
* [Python 學習筆記 : Selenium 模組瀏覽器自動化測試 (二)](http://yhhuang1966.blogspot.com/2018/05/python-selenium_27.html)
* [ Selenium-Python中文文檔](https://selenium-python-zh.readthedocs.io/en/latest/locating-elements.html)
* [Day13-網路爬蟲實作II selenium 模擬瀏覽器](https://ithelp.ithome.com.tw/articles/10222029)
* [Selenium如何處理CheckBox (Python篇)](https://www.qa-knowhow.com/?p=4258)
:::
---
## **(Ubuntu)安裝與設定 Selenium**
* Install selenium
```
$ sudo apt-get update
$ pip3 install selenium
```
* Install browser driver
* Firefox
* [下载 geckodriver:geckodriver-v0.26.0-linux64.tar.gz](https://github.com/mozilla/geckodriver/releases)
* 將 geckodriver 加入 path (/usr/local/bin/) 目錄下,並給予執行權限
```
$ sudo mv ./geckodriver /usr/local/bin/
$ sudo chmod a+x /usr/local/bin/geckodriver
```
* Chrome
* [下载 ChromeDriver](https://chromedriver.chromium.org/downloads)
> [如何查看 Chrome 版本?](https://help.zenplanner.com/hc/en-us/articles/204253654-How-to-Find-Your-Internet-Browser-Version-Number-Google-Chrome)
> 
>
* 將 ChromeDriver 加入 path (/usr/local/bin/) 目錄下,並給予執行權限
```
$ sudo mv ./chromedriver /usr/local/bin/
$ sudo chmod a+x /usr/local/bin/chromedriver
```
* Testing if Browser Is Working
> If success, will see a browser be opened.
>
* Firefox
```python=
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://www.ubuntu.com/')
```
* Chrome
```python=
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://www.ubuntu.com/')
```
---
## **爬論文標題**
### 爬蟲程式碼 - 丰嘉
* Install BeautifulSoup4
```
$ pip3 install beautifulsoup4
```
* 爬蟲程式碼
```python=
from selenium import webdriver
from bs4 import BeautifulSoup as b4
browser = webdriver.Firefox()
browser.get('https://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/login?o=dwebmge')
# 搜尋"陳舜德"
browser.find_element_by_id("ysearchinput0").send_keys("陳舜德")
# 勾選"指導教授"
## 方法一
browser.find_element_by_xpath('/html/body/div[2]/table/tbody/tr[1]/td[2]/table/tbody/tr[4]/td/div[1]/form/table/tbody/tr[2]/td/table/tbody/tr[1]/td/table/tbody/tr[2]/td/input[3]').click()
## 方法二
browser.find_element_by_xpath('//input[@value="ad"]').click()
# 點擊搜尋按鈕
browser.find_element_by_id("gs32search").click()
# 網頁原始碼
html = browser.page_source
# BeautifulSoup4 解析
soup = b4(html, 'html.parser')
paper_title = soup.select('#tablefmt1 span')
for i in paper_title:
print(i.text)
```
```
Output:
實體教具與虛擬教具對幼童學習影響之研究
應用混成式學習於課後輔導之研究
網路分類資源自動化擴展系統之研究
基於翻轉學習概念之互動式教學平台架構研究
建構共享共讀的圖書書庫自動化流通平台:以「愛的書庫」為例
基於自動分類為基礎的圖書題名特徵擷取之研究-以輔助圖書分類系統為例
數位典藏庫中資料分群產生之研究-以數位學習詮釋資料為例
個人化知識表徵瀏覽模型
```
---
### 爬蟲程式碼 - 譽錚
```python=
from selenium import webdriver
from selenium.webdriver.common.by import By
driverPath = r"D:\\chromedriver.exe"
driver = webdriver.Chrome(driverPath)
driver.get("https://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/login?o=dwebmge")
print(driver.title)
driver.find_element_by_id("ysearchinput0").send_keys("AI")
driver.find_element_by_id("gs32search").click()
print()
while True:
try:
elements = driver.find_elements_by_class_name("etd_d")
for i in elements:
print(i.text)
driver.find_element_by_name("gonext").click()
except:
break
```
### 爬蟲程式碼智能版 - 譽錚
```python=
from selenium import webdriver
from selenium.webdriver.common.by import By
# 勾選核取方塊
def checkbox_check(dcf):
dcf_list = dcf.split(" ")
for i in dcf_list:
driver.find_element_by_css_selector('input[value=' + i + ']').click()
kw = input("請輸入要搜尋的字詞:")
# 勾選核取方塊的項目
dcf = input("請輸入要勾取的項目的代號,以空白分隔,ti默認勾選:\n \
論文名稱:ti\n 研究生:au\n 指導教授:ad\n 口試委員:say\n \
關鍵詞:kw\n 摘要:ab\n 參考文獻:rf\n 不限欄位:ALLFIELD\n")
driverPath = r"D:\\chromedriver.exe"
driver = webdriver.Chrome(driverPath)
driver.get("https://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/login?o=dwebmge")
# 輸出網頁標題
print(driver.title)
driver.find_element_by_id("ysearchinput0").send_keys(kw)
checkbox_check(dcf)
driver.find_element_by_id("gs32search").click()
while True:
try:
elements = driver.find_elements_by_class_name("etd_d")
for i in elements:
print(i.text)
driver.find_element_by_name("gonext").click()
except:
break
```