## 抓取PChome上商品的資料並存為csv
資料包含
1. 名稱
1. 簡介
1. 價格
#### 初始設定
使用selenium和By的methond來定位資料
```python=
import pandas as pd
from selenium import webdriver
from time import sleep
from selenium.webdriver.common.by import By
opt = webdriver.ChromeOptions()
opt.add_experimental_option('prefs',
{'profile.default_content_setting_values': {'notification': 2}})
browser = webdriver.Chrome(options = opt)
browser.get('https://ecshweb.pchome.com.tw/search/v3.3/?q=book&scope=all')
sleep(3)
```
#### 抓取品名、簡介、價格
```python=
product_name = browser.find_elements(By.CLASS_NAME, "prod_name")
p_name = []
for i in range(len(product_name)):
p_name.append(product_name[i].text)
product_info = browser.find_elements(By.CLASS_NAME, "nick")
p_info = []
for i in range(len(product_info)):
p_info.append(product_info[i].text)
product_price = browser.find_elements(By.CLASS_NAME, "value")
p_price = []
for i in range(len(product_price)):
p_price.append(product_price[i].text)
p_price[i] = '$' + p_price[i]
p_price = p_price[6:]
```
#### 匯出檔案
將結果存成dictionary並轉換為pandas的dataframe,以便存為csv檔案
```python=
result = {"Product Name": p_name,
"Product Info": p_info,
"Product Price": p_price}
result_df = pd.DataFrame(result)
result_df.index = result_df.index+1
# print(result_df)
result_df.to_csv("Q1.csv", encoding='utf_8_sig')
```
## 使用re過濾出特定條件的文字
從整個CNN網站中,擷取兩個連續開頭為大寫其餘皆為小寫的字串
Ex:
User Account (O)
CNN Website (X)
log out (X)
#### 設定網頁和過濾出符合條件的文字
```python=
import requests
from bs4 import BeautifulSoup
import re
page = requests.get("https://edition.cnn.com/world/asia")
soup = BeautifulSoup(page.text, 'html.parser')
soup = str(soup)
q = re.findall(r'([A-Z][a-z]+\s[A-Z][a-z]+)', soup)
```
#### 按照出現次數排序並設定為dictionary
```python=
dict = { }
for key in q:
dict[key] = dict.get(key, 0) + 1
sorted_dict = (sorted(dict.items(), key=lambda x: x[1]))
sorted_dict.reverse()
```
#### 存成文字檔
```python=
with open("result.txt", "w", encoding="utf-8") as file:
for i in range(len(sorted_dict)):
file.write(str(sorted_dict[i][0]) + ': ' + str(sorted_dict[i][1]) + '\n')
```