## 抓取PChome上商品的資料並存為csv 資料包含 1. 名稱 1. 簡介 1. 價格 #### 初始設定 使用selenium和By的methond來定位資料 ```python= import pandas as pd from selenium import webdriver from time import sleep from selenium.webdriver.common.by import By opt = webdriver.ChromeOptions() opt.add_experimental_option('prefs', {'profile.default_content_setting_values': {'notification': 2}}) browser = webdriver.Chrome(options = opt) browser.get('https://ecshweb.pchome.com.tw/search/v3.3/?q=book&scope=all') sleep(3) ``` #### 抓取品名、簡介、價格 ```python= product_name = browser.find_elements(By.CLASS_NAME, "prod_name") p_name = [] for i in range(len(product_name)): p_name.append(product_name[i].text) product_info = browser.find_elements(By.CLASS_NAME, "nick") p_info = [] for i in range(len(product_info)): p_info.append(product_info[i].text) product_price = browser.find_elements(By.CLASS_NAME, "value") p_price = [] for i in range(len(product_price)): p_price.append(product_price[i].text) p_price[i] = '$' + p_price[i] p_price = p_price[6:] ``` #### 匯出檔案 將結果存成dictionary並轉換為pandas的dataframe,以便存為csv檔案 ```python= result = {"Product Name": p_name, "Product Info": p_info, "Product Price": p_price} result_df = pd.DataFrame(result) result_df.index = result_df.index+1 # print(result_df) result_df.to_csv("Q1.csv", encoding='utf_8_sig') ``` ## 使用re過濾出特定條件的文字 從整個CNN網站中,擷取兩個連續開頭為大寫其餘皆為小寫的字串 Ex: User Account (O) CNN Website (X) log out (X) #### 設定網頁和過濾出符合條件的文字 ```python= import requests from bs4 import BeautifulSoup import re page = requests.get("https://edition.cnn.com/world/asia") soup = BeautifulSoup(page.text, 'html.parser') soup = str(soup) q = re.findall(r'([A-Z][a-z]+\s[A-Z][a-z]+)', soup) ``` #### 按照出現次數排序並設定為dictionary ```python= dict = { } for key in q: dict[key] = dict.get(key, 0) + 1 sorted_dict = (sorted(dict.items(), key=lambda x: x[1])) sorted_dict.reverse() ``` #### 存成文字檔 ```python= with open("result.txt", "w", encoding="utf-8") as file: for i in range(len(sorted_dict)): file.write(str(sorted_dict[i][0]) + ': ' + str(sorted_dict[i][1]) + '\n') ```