ETL Assignment

用ETL開啟我的第一篇技術文章，也是監視自我的成長的第一篇文。事先說明一下這個ETL題目是來自於我在Cousera線上課程平台裡的IBM Data Engineering Professional Certificate 中的 Python Project for Data Engineering 的期末作業。題目主要要求編寫一個簡單的ETL python程式，這次題目的ETL程式可以拆解成，Extract，Transform，Load還有Logging，四大部分進行編寫。

題目要求：

Objectives In this final part you will:

Run the ETL process.
Extract bank and market cap data from the JSON file bank_market_cap.json.
Transform the market cap currency using the exchange rate data.
Load the transformed data into a seperate CSV.

事前準備：

安裝或更新專案裡需要的套件，如：glob, pandas, requests, 和 datetime.
導入所需要的模塊。




!pip install glob
!pip install pandas
!pip install requests
!pip install datetime

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →



import glob
import pandas as pd
from datetime import datetime

下載課程所提供的資料。



!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_1.json
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_2.json
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-Skil

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

1. Extract

題目要求：

Extract Function
Define the extract function that finds JSON file bank_market_cap_1.json and calls the function created above to extract data from them. Store the data in a pandas dataframe. Use the following list for the columns

編寫Json，CSV的Extract function.



def extract_from_json(file_to_process):
    dataframe = pd.read_json(file_to_process)
    return dataframe



def extract_from_csv(file_to_process):
    dataframe = pd.read_csv(file_to_process)
    return dataframe

把bank_market_cap_1.json資料導入到dataframe。









columns=['Name','Market Cap (US$ Billion)']
def extract():
    extract_data = pd.DataFrame(columns)
    extract_data = extract_data.append(extract_from_json('bank_market_cap_1.json'), ignore_index=True)
    extract_data = extract_data.drop([0], axis=1)
    extract_data = extract_data.drop([0, 1], axis=0)
    extract_data = extract_data.reset_index()
    extract_data = extract_data.drop(['index'], axis=1)
    return extract_data

執行函數，看下輸出結果是否正確。


extract()

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

2. Transform

題目要求：

Load the file exchange_rates.csv as a dataframe and find the exchange rate for British pounds with the symbol GBP, store it in the variable exchange_rate, you will be asked for the number. Hint: set the parameter index_col to 0.

使用DataFrame把匯率欄的‘Unnamed: 0’更名為currency及設為index；使用loc提取GBP的匯率。







def exchange_rates(currency):
    exchange_rate = extract_from_csv('exchange_rates.csv')
    exchange_rate.rename(columns = {'Unnamed: 0':'Currency'}, inplace = True)
    exchange_rate = exchange_rate.set_index('Currency')
    return exchange_rate.loc[[currency]]
    
exchange_rates('GBP')

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

題目要求：

Using exchange_rate and the exchange_rates.csv file find the exchange rate of USD to GBP. Write a transform function that

Changes the Market Cap (USD Billion) column from USD to GBP.
Rounds the Market Cap (USD Billion)` column to 3 decimal places.
Rename Market Cap (USB Billion) to Market Cap (GBP Billion).

把以美金為單位的市值(Market Cap)轉換成使用英鎊為單位的市值。







def transform(bank_market_cap, exchange_rates):
    bank_market_cap['Market Cap (GBP$ Billion)'] = bank_market_cap['Market Cap (US$ Billion)'].mul(float(exchange_rates['Rates']))
    bank_market_cap = bank_market_cap.drop('Market Cap (US$ Billion)', axis=1)
    return bank_market_cap
    # Write your code here

transform(extract(), exchange_rates('GBP'))

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

3. Load

題目要求：

Create a function that takes a dataframe and load it to a csv named bank_market_cap_gbp.csv. Make sure to set index to False.

編寫load函數把轉換好的資料存成CSV檔。


def load(targetfile, data_to_load):
    data_to_load.to_csv(targetfile, index=False)

4. Logging Function

logging函數是一個用來記錄ETL程式執行的每個步驟是否有順利完成。






def log(message):
    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second
    now = datetime.now() # get current timestamp
    timestamp = now.strftime(timestamp_format)
    with open("logfile.txt","a") as f:
        f.write(timestamp + ',' + message + '\n')

5. Running the ETL Process

執行logging記錄Extract程式是否有順利進行。


log("ETL Job Started")
log("Extract phase Started")


extracted_data = extract()
extracted_data.head()


log("Extract phase Ended")

執行logging記錄Transform程式是否有順利進行。


log("Transform phase Started")


transformed_data = transform(extracted_data, exchange_rates('GBP'))
transformed_data.head(5)


log("Transform phase Ended")

執行logging記錄Load程式是否有順利進行。



log("Load phase Started")
load("bank_market_cap_gbp.csv", transformed_data)
log("Load phase Ended")

總結：

以上是我對解答這個題目的想法，畢竟還是一隻大菜鳥，所以還有程式優化和文章表達不足的地方；歡迎大大們的指教，謝謝。

ETL Assignment

題目要求：

事前準備：

1. Extract

題目要求：

2. Transform

題目要求：

題目要求：

3. Load

題目要求：

4. Logging Function

5. Running the ETL Process

總結：

Read more

PPT home sale crawler

Introduction to NoSQL Databases <br> WEEK_2 - Introducing MongoDB

IBM Data Engineering

My Data Kitchen