ETL Assignment

用ETL開啟我的第一篇技術文章,也是監視自我的成長的第一篇文。 事先說明一下這個ETL題目是來自於我在Cousera線上課程平台裡的IBM Data Engineering Professional Certificate 中的 Python Project for Data Engineering 的期末作業。 題目主要要求編寫一個簡單的ETL python程式,這次題目的ETL程式可以拆解成,Extract,Transform,Load還有Logging,四大部分進行編寫。

題目要求:

Objectives In this final part you will:

  • Run the ETL process.
  • Extract bank and market cap data from the JSON file bank_market_cap.json.
  • Transform the market cap currency using the exchange rate data.
  • Load the transformed data into a seperate CSV.

事前準備:

  • 安裝或更新專案裡需要的套件,如:glob, pandas, requests, 和 datetime.
  • 導入所需要的模塊。
!pip install glob !pip install pandas !pip install requests !pip install datetime


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

import glob import pandas as pd from datetime import datetime
  • 下載課程所提供的資料。
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_1.json !wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_2.json !wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-Skil


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

1. Extract

題目要求:

Extract Function
Define the extract function that finds JSON file bank_market_cap_1.json and calls the function created above to extract data from them. Store the data in a pandas dataframe. Use the following list for the columns

編寫Json,CSV的Extract function.

def extract_from_json(file_to_process): dataframe = pd.read_json(file_to_process) return dataframe
def extract_from_csv(file_to_process): dataframe = pd.read_csv(file_to_process) return dataframe
  • 把bank_market_cap_1.json資料導入到dataframe。
columns=['Name','Market Cap (US$ Billion)'] def extract(): extract_data = pd.DataFrame(columns) extract_data = extract_data.append(extract_from_json('bank_market_cap_1.json'), ignore_index=True) extract_data = extract_data.drop([0], axis=1) extract_data = extract_data.drop([0, 1], axis=0) extract_data = extract_data.reset_index() extract_data = extract_data.drop(['index'], axis=1) return extract_data
  • 執行函數,看下輸出結果是否正確。
extract()


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

2. Transform

題目要求:

Load the file exchange_rates.csv as a dataframe and find the exchange rate for British pounds with the symbol GBP, store it in the variable exchange_rate, you will be asked for the number. Hint: set the parameter index_col to 0.

  • 使用DataFrame把匯率欄的‘Unnamed: 0’更名為currency及設為index;使用loc提取GBP的匯率。
def exchange_rates(currency): exchange_rate = extract_from_csv('exchange_rates.csv') exchange_rate.rename(columns = {'Unnamed: 0':'Currency'}, inplace = True) exchange_rate = exchange_rate.set_index('Currency') return exchange_rate.loc[[currency]] exchange_rates('GBP')


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

題目要求:

Using exchange_rate and the exchange_rates.csv file find the exchange rate of USD to GBP. Write a transform function that

  1. Changes the Market Cap (USD Billion) column from USD to GBP.
  2. Rounds the Market Cap (USD Billion)` column to 3 decimal places.
  3. Rename Market Cap (USB Billion) to Market Cap (GBP Billion).
  • 把以美金為單位的市值(Market Cap)轉換成使用英鎊為單位的市值。
def transform(bank_market_cap, exchange_rates): bank_market_cap['Market Cap (GBP$ Billion)'] = bank_market_cap['Market Cap (US$ Billion)'].mul(float(exchange_rates['Rates'])) bank_market_cap = bank_market_cap.drop('Market Cap (US$ Billion)', axis=1) return bank_market_cap # Write your code here transform(extract(), exchange_rates('GBP'))


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

3. Load

題目要求:

Create a function that takes a dataframe and load it to a csv named bank_market_cap_gbp.csv. Make sure to set index to False.

  • 編寫load函數把轉換好的資料存成CSV檔。
def load(targetfile, data_to_load): data_to_load.to_csv(targetfile, index=False)

4. Logging Function

  • logging函數是一個用來記錄ETL程式執行的每個步驟是否有順利完成。
def log(message): timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second now = datetime.now() # get current timestamp timestamp = now.strftime(timestamp_format) with open("logfile.txt","a") as f: f.write(timestamp + ',' + message + '\n')

5. Running the ETL Process

  • 執行logging記錄Extract程式是否有順利進行。
log("ETL Job Started") log("Extract phase Started")
extracted_data = extract() extracted_data.head()


log("Extract phase Ended")
  • 執行logging記錄Transform程式是否有順利進行。
log("Transform phase Started")
transformed_data = transform(extracted_data, exchange_rates('GBP')) transformed_data.head(5)


log("Transform phase Ended")
  • 執行logging記錄Load程式是否有順利進行。
log("Load phase Started") load("bank_market_cap_gbp.csv", transformed_data) log("Load phase Ended")

總結:

以上是我對解答這個題目的想法,畢竟還是一隻大菜鳥,所以還有程式優化和文章表達不足的地方;歡迎大大們的指教,謝謝。