--- disqus: hackmd --- ETL Assignment === > 用ETL開啟我的第一篇技術文章,也是監視自我的成長的第一篇文。 事先說明一下這個ETL題目是來自於我在Cousera線上課程平台裡的IBM Data Engineering Professional Certificate 中的 Python Project for Data Engineering 的期末作業。 題目主要要求編寫一個簡單的ETL python程式,這次題目的ETL程式可以拆解成,Extract,Transform,Load還有Logging,四大部分進行編寫。 #### 題目要求: Objectives In this final part you will: * Run the ETL process. * Extract bank and market cap data from the JSON file bank_market_cap.json. * Transform the market cap currency using the exchange rate data. * Load the transformed data into a seperate CSV. ## 事前準備: * 安裝或更新專案裡需要的套件,如:glob, pandas, requests, 和 datetime. * 導入所需要的模塊。 ```python= !pip install glob !pip install pandas !pip install requests !pip install datetime ``` <br>![](https://i.imgur.com/eB78RDK.png) ```python= import glob import pandas as pd from datetime import datetime ``` * 下載課程所提供的資料。 ```python= !wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_1.json !wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_2.json !wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-Skil ``` <br>![](https://i.imgur.com/Ju4zyyr.png) ## 1. Extract #### 題目要求: Extract Function Define the extract function that finds JSON file bank_market_cap_1.json and calls the function created above to extract data from them. Store the data in a pandas dataframe. Use the following list for the columns 編寫Json,CSV的Extract function. ```python= def extract_from_json(file_to_process): dataframe = pd.read_json(file_to_process) return dataframe ``` ```python= def extract_from_csv(file_to_process): dataframe = pd.read_csv(file_to_process) return dataframe ``` * 把bank_market_cap_1.json資料導入到dataframe。 ```python= columns=['Name','Market Cap (US$ Billion)'] def extract(): extract_data = pd.DataFrame(columns) extract_data = extract_data.append(extract_from_json('bank_market_cap_1.json'), ignore_index=True) extract_data = extract_data.drop([0], axis=1) extract_data = extract_data.drop([0, 1], axis=0) extract_data = extract_data.reset_index() extract_data = extract_data.drop(['index'], axis=1) return extract_data ``` * 執行函數,看下輸出結果是否正確。 ```python= extract() ``` <br>![](https://i.imgur.com/Hl04CBK.png) ## 2. Transform #### 題目要求: Load the file exchange_rates.csv as a dataframe and find the exchange rate for British pounds with the symbol GBP, store it in the variable exchange_rate, you will be asked for the number. Hint: set the parameter index_col to 0. * 使用DataFrame把匯率欄的‘Unnamed: 0’更名為currency及設為index;使用loc提取GBP的匯率。 ```python= def exchange_rates(currency): exchange_rate = extract_from_csv('exchange_rates.csv') exchange_rate.rename(columns = {'Unnamed: 0':'Currency'}, inplace = True) exchange_rate = exchange_rate.set_index('Currency') return exchange_rate.loc[[currency]] exchange_rates('GBP') ``` <br>![](https://i.imgur.com/EyaOeTX.png) #### 題目要求: Using exchange_rate and the exchange_rates.csv file find the exchange rate of USD to GBP. Write a transform function that 1. Changes the Market Cap (USD Billion) column from USD to GBP. 2. Rounds the Market Cap (USD Billion)` column to 3 decimal places. 3. Rename Market Cap (USB Billion) to Market Cap (GBP Billion). * 把以美金為單位的市值(Market Cap)轉換成使用英鎊為單位的市值。 ```python= def transform(bank_market_cap, exchange_rates): bank_market_cap['Market Cap (GBP$ Billion)'] = bank_market_cap['Market Cap (US$ Billion)'].mul(float(exchange_rates['Rates'])) bank_market_cap = bank_market_cap.drop('Market Cap (US$ Billion)', axis=1) return bank_market_cap # Write your code here transform(extract(), exchange_rates('GBP')) ``` <br>![](https://i.imgur.com/dlI9NBS.png) ## 3. Load #### 題目要求: Create a function that takes a dataframe and load it to a csv named `bank_market_cap_gbp.csv`. Make sure to set `index` to `False`. * 編寫load函數把轉換好的資料存成CSV檔。 ```python= def load(targetfile, data_to_load): data_to_load.to_csv(targetfile, index=False) ``` ## 4. Logging Function * logging函數是一個用來記錄ETL程式執行的每個步驟是否有順利完成。 ```python= def log(message): timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second now = datetime.now() # get current timestamp timestamp = now.strftime(timestamp_format) with open("logfile.txt","a") as f: f.write(timestamp + ',' + message + '\n') ``` ## 5. Running the ETL Process * 執行logging記錄Extract程式是否有順利進行。 ```python= log("ETL Job Started") log("Extract phase Started") ``` ```python= extracted_data = extract() extracted_data.head() ``` <br>![](https://i.imgur.com/4w5A8Or.png) ```python= log("Extract phase Ended") ``` * 執行logging記錄Transform程式是否有順利進行。 ```python= log("Transform phase Started") ``` ```python= transformed_data = transform(extracted_data, exchange_rates('GBP')) transformed_data.head(5) ``` <br>![](https://i.imgur.com/EO3dap7.png) ```python= log("Transform phase Ended") ``` * 執行logging記錄Load程式是否有順利進行。 ```python= log("Load phase Started") load("bank_market_cap_gbp.csv", transformed_data) log("Load phase Ended") ``` <br> ## 總結: > 以上是我對解答這個題目的想法,畢竟還是一隻大菜鳥,所以還有程式優化和文章表達不足的地方;歡迎大大們的指教,謝謝。