---
disqus: hackmd
---
ETL Assignment
===
> 用ETL開啟我的第一篇技術文章,也是監視自我的成長的第一篇文。 事先說明一下這個ETL題目是來自於我在Cousera線上課程平台裡的IBM Data Engineering Professional Certificate 中的 Python Project for Data Engineering 的期末作業。 題目主要要求編寫一個簡單的ETL python程式,這次題目的ETL程式可以拆解成,Extract,Transform,Load還有Logging,四大部分進行編寫。
#### 題目要求:
Objectives In this final part you will:
* Run the ETL process.
* Extract bank and market cap data from the JSON file bank_market_cap.json.
* Transform the market cap currency using the exchange rate data.
* Load the transformed data into a seperate CSV.
## 事前準備:
* 安裝或更新專案裡需要的套件,如:glob, pandas, requests, 和 datetime.
* 導入所需要的模塊。
```python=
!pip install glob
!pip install pandas
!pip install requests
!pip install datetime
```
<br>
```python=
import glob
import pandas as pd
from datetime import datetime
```
* 下載課程所提供的資料。
```python=
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_1.json
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_2.json
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-Skil
```
<br>
## 1. Extract
#### 題目要求:
Extract Function
Define the extract function that finds JSON file bank_market_cap_1.json and calls the function created above to extract data from them. Store the data in a pandas dataframe. Use the following list for the columns
編寫Json,CSV的Extract function.
```python=
def extract_from_json(file_to_process):
dataframe = pd.read_json(file_to_process)
return dataframe
```
```python=
def extract_from_csv(file_to_process):
dataframe = pd.read_csv(file_to_process)
return dataframe
```
* 把bank_market_cap_1.json資料導入到dataframe。
```python=
columns=['Name','Market Cap (US$ Billion)']
def extract():
extract_data = pd.DataFrame(columns)
extract_data = extract_data.append(extract_from_json('bank_market_cap_1.json'), ignore_index=True)
extract_data = extract_data.drop([0], axis=1)
extract_data = extract_data.drop([0, 1], axis=0)
extract_data = extract_data.reset_index()
extract_data = extract_data.drop(['index'], axis=1)
return extract_data
```
* 執行函數,看下輸出結果是否正確。
```python=
extract()
```
<br>
## 2. Transform
#### 題目要求:
Load the file exchange_rates.csv as a dataframe and find the exchange rate for British pounds with the symbol GBP, store it in the variable exchange_rate, you will be asked for the number. Hint: set the parameter index_col to 0.
* 使用DataFrame把匯率欄的‘Unnamed: 0’更名為currency及設為index;使用loc提取GBP的匯率。
```python=
def exchange_rates(currency):
exchange_rate = extract_from_csv('exchange_rates.csv')
exchange_rate.rename(columns = {'Unnamed: 0':'Currency'}, inplace = True)
exchange_rate = exchange_rate.set_index('Currency')
return exchange_rate.loc[[currency]]
exchange_rates('GBP')
```
<br>
#### 題目要求:
Using exchange_rate and the exchange_rates.csv file find the exchange rate of USD to GBP. Write a transform function that
1. Changes the Market Cap (USD Billion) column from USD to GBP.
2. Rounds the Market Cap (USD Billion)` column to 3 decimal places.
3. Rename Market Cap (USB Billion) to Market Cap (GBP Billion).
* 把以美金為單位的市值(Market Cap)轉換成使用英鎊為單位的市值。
```python=
def transform(bank_market_cap, exchange_rates):
bank_market_cap['Market Cap (GBP$ Billion)'] = bank_market_cap['Market Cap (US$ Billion)'].mul(float(exchange_rates['Rates']))
bank_market_cap = bank_market_cap.drop('Market Cap (US$ Billion)', axis=1)
return bank_market_cap
# Write your code here
transform(extract(), exchange_rates('GBP'))
```
<br>
## 3. Load
#### 題目要求:
Create a function that takes a dataframe and load it to a csv named `bank_market_cap_gbp.csv`. Make sure to set `index` to `False`.
* 編寫load函數把轉換好的資料存成CSV檔。
```python=
def load(targetfile, data_to_load):
data_to_load.to_csv(targetfile, index=False)
```
## 4. Logging Function
* logging函數是一個用來記錄ETL程式執行的每個步驟是否有順利完成。
```python=
def log(message):
timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second
now = datetime.now() # get current timestamp
timestamp = now.strftime(timestamp_format)
with open("logfile.txt","a") as f:
f.write(timestamp + ',' + message + '\n')
```
## 5. Running the ETL Process
* 執行logging記錄Extract程式是否有順利進行。
```python=
log("ETL Job Started")
log("Extract phase Started")
```
```python=
extracted_data = extract()
extracted_data.head()
```
<br>
```python=
log("Extract phase Ended")
```
* 執行logging記錄Transform程式是否有順利進行。
```python=
log("Transform phase Started")
```
```python=
transformed_data = transform(extracted_data, exchange_rates('GBP'))
transformed_data.head(5)
```
<br>
```python=
log("Transform phase Ended")
```
* 執行logging記錄Load程式是否有順利進行。
```python=
log("Load phase Started")
load("bank_market_cap_gbp.csv", transformed_data)
log("Load phase Ended")
```
<br>
## 總結:
> 以上是我對解答這個題目的想法,畢竟還是一隻大菜鳥,所以還有程式優化和文章表達不足的地方;歡迎大大們的指教,謝謝。