[Starlike|摩曼頓 x AWS Immersion Day (Feb 2)]
===
# 相關連結
Hackmd 短網址: https://pse.is/4r5ckk
Event Engine: https://dashboard.eventengine.run/login?hash=7ede-158fcd97c4-75
CSAT: https://survey.immersionday.com/UULRBN04g
# Agenda
|Time|Topic|Course content|Speaker|
|-|-----|--------------|------|
|09:00~09:40|Introduction to Amazon Cloud & Security|AWS Services, EC2, RDS, IAM and Shared Responsibility Model|AWS Dennis|
|09:40~10:20|Lab: Prepare the Lab environment|Management Console, Region, etc|AWS Dennis|
|10:20~10:30|Break||
|10:30~10:50|Amazon AppFlow Introduction| |AWS Dennis|
|10:50~11:50|Lab: Salesforce to Redshift|Create a Flow to ingest data from Salesforce to Redshift|AWS Dennis|
|12:00~13:00|Lunch||
|13:00~13:30|Modern Data Strategy||AWS Gavin|
|13:30~14:10|AWS Glue Introduction|Glue, Glue Studio, Glue Data Catalog|AWS Gavin|
|14:10~14:50|Lab: Working with Glue|Working with Glue Data Catalog|NL Alice|
|14:50~15:00|Break||
|15:00~15:40|Lab: Working with Glue|Working with Glue Studio|NL Alice|
|15:40~16:20|AWS Database Migration Service Introduction|With demo|NL Alice|
|16:20~17:00|Q&A|||
# Start with AWS EventEngine (Workshop Studio)
* 請依據講師提供的短網址登入 EventEngine 環境,詳細步驟請參考 [Log in with AWS Workshop Portal](https://catalog.us-east-1.prod.workshops.aws/workshops/f3a3e2bd-e1d5-49de-b8e6-dac361842e76/en-US/preparation-guide/20-event-engine)。
## 事先安裝 CloudFormation
為節省之後上課的時間,先在此將所需的 Cloudformation 都建立好:
* AppFlow Workshop: 下載以下 template.yaml 並 Create Stack
template.yaml 網址: https://static.us-east-1.prod.workshops.aws/public/5616e427-7e79-4bab-a538-59af65e94efc/static/salesforce/salesforce2redshift/redshiftsetup/redshift.yml
```
curl https://static.us-east-1.prod.workshops.aws/public/5616e427-7e79-4bab-a538-59af65e94efc/static/salesforce/salesforce2redshift/redshiftsetup/redshift.yml --output template.yaml
相關資訊:
Stack name: appflow-redshift-workshop
ClassB: 20 -- This is the 2nd Octet of IPv4 CIDR block for the VPC
DataBucketName: redshift-appflow-bucket
Database Name: workshop
MasterUserPassword: Workshop101 -- 最少 8 個字,至少一個大寫、一個小寫、一個數字 (請記錄下來)。
MasterUsername: admin -- This is the user name for the administrator
RedshiftNodeCount: 1 -- This is the number of nodes of Redshift cluster
```
* Glue workshop
下載 https://static.us-east-1.prod.workshops.aws/public/eb97e639-1cd3-42b0-a198-72f6871fd7ce/static/cloudformation/GlueImmersionDay-WithVPCv2Prod-local-v1.yaml
```
curl -O https://static.us-east-1.prod.workshops.aws/public/eb97e639-1cd3-42b0-a198-72f6871fd7ce/static/cloudformation/GlueImmersionDay-WithVPCv2Prod-local-v1.yaml --output GlueImmersionDay-WithVPCv2Prod-local-v1.yaml
相關資訊:
Stack name: glue-workshop
```
# AWS Management Console
登入後,請依講師說明,練習操作以下服務。
* 介面操作
* Region 概念
* Global Services
* EC2、S3 等基本操作展示
## 練習
* [EC2 Linux Hands on Lab](https://catalog.us-east-1.prod.workshops.aws/workshops/f3a3e2bd-e1d5-49de-b8e6-dac361842e76/en-US/basic-modules/10-ec2/ec2-linux)
* [Amazon S3 Hands on Lab](https://catalog.us-east-1.prod.workshops.aws/workshops/f3a3e2bd-e1d5-49de-b8e6-dac361842e76/en-US/basic-modules/60-s3/s3)
* [IAM Hands- on lab](https://catalog.us-east-1.prod.workshops.aws/workshops/f3a3e2bd-e1d5-49de-b8e6-dac361842e76/en-US/basic-modules/30-iam/iam)
# Amazon AppFlow
Amazon AppFlow 是一個完全託管且安全的 SaaS 服務資料整合服務,它支援各種 SaaS 應用及 AWS 服務,像是 Salesforce、Marketo、Slack、ServiceNow 以及 Amazon S3 和 Redshift 等。
在此 Lab 中,我們將把 Salesforce 的資料單向地傳輸到 Redshift 內。為了準備環境,需要先執行 Lab Setup | Manual Lab Setup 的內容,使用 Cloudformation 自動建立 Redshift。接著建立幾個之後需要的 Lab。然後再透過 Amazon AppFlow 將資料由 Salesforce 轉入 Redshift。
* Lab tutorial: [Amazon AppFlow Workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce)
* 做 Salesforce module 下的 Salesforce signup、Salesforce to Redshift 這兩個 Lab 即可,其餘可自行練習。
## [Salesforce Signup](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce/salesforce-signup)
* 請註冊一個永久有效的 salesforce developer 帳戶 (免費),電子郵件可使用"現在可以收信的"私人郵件,Username 的部分可使用唯一且"非有效"郵件 (最好不要用公司 email),例如公司 email 後面加上 .sandbox。
* 開設完成後,依照教學步驟,將 CDC 打開 (切換英文版後,到 Setup,搜尋 Change Data Capture)。
* 接著搜尋 Connected Apps OAuth Usage,在 Connected App 會顯示 Amazon AppFlow Embedded Login App,點擊 Actions | Install。
## Salesforce to Redshift | Lab Setup | [Manual Lab Setup](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce/salesforce2redshift/redshiftsetup/manual-setup)
* 請依照 Lab 內步驟下載 template.yaml,並透過 Cloudformation 建立 Redshift、S3 Bucket 等 Lab 所需環境。
* 建立完成後,透過 Query Editor 進入 Redshift 並建立所需要的 Table (account、contact、opportunity)。
## Salesforce to Redshift | [Create Flow](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce/salesforce2redshift/createflow)
* 請依照 Lab 步驟執行到 12,接著請先到 IAM Role,建立一個新的 role (RedshiftFullAccessForAppflow),並給予 AmazonRedshiftFullAccess 權限),Custom trust policy 範本如下:
```
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal":
{
"Service": "appflow.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
```
* 步驟 13 提示,其餘依照 Lab 步驟執行
* Cluster Identifier: 不用輸入 JDBC URL,請直接輸入 Redshift cluster identifier 即可。
* Database name: workshop
* Bucket details: 選擇 redshift-appflow-bucket- 開頭
* IAM role for Amazon S3 Access: AppflowRedshiftWorkshopRole
* IAM role for Amazon Redshift Data API Access: RedshiftFullAccessForAppflow
* Amazon Redshift database user name: admin
* Connection name: workshop-redshiftcluster
* 步驟 17 開始會建立三條 flow,可手動建立 mapping,也可下載 mapping csv 並上傳即可 (Upload .csv file with mapping fields)。
```
curl https://static.us-east-1.prod.workshops.aws/public/5616e427-7e79-4bab-a538-59af65e94efc/static/salesforce/salesforce2redshift/createflow/redshift-field-mapping-account.csv --output redshift-field-mapping-account.csv
curl https://static.us-east-1.prod.workshops.aws/public/5616e427-7e79-4bab-a538-59af65e94efc/static/salesforce/salesforce2redshift/createflow/redshift-field-mapping-contact.csv --output redshift-field-mapping-contact.csv
curl https://static.us-east-1.prod.workshops.aws/public/5616e427-7e79-4bab-a538-59af65e94efc/static/salesforce/salesforce2redshift/createflow/redshift-field-mapping-opportunities.csv --output redshift-field-mapping-opportunities.csv
```
## Salesforce to Redshift | [Run Flow](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce/salesforce2redshift/runflow)
* 按照步驟輪流將 sf-account-redshift、sf-contact-redshift、sf-opportunity-redshift 各執行一遍。
* 重複執行會造成資料重複。
## Salesforce to Redshift | [Visualise Salesforce Data](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce/salesforce2redshift/querydata) (Optional)
* 此 Lab 是延伸練習,學習如何將 Redshit 資料在 Quicksight 內做視覺化呈現。
# Glue (Lab)
## Setting
### Deploy CloudFormation Template (應該已經完成)
[download_link](https://static.us-east-1.prod.workshops.aws/public/eb97e639-1cd3-42b0-a198-72f6871fd7ce/static/cloudformation/GlueImmersionDay-WithVPCv2Prod-local-v1.yaml)
[ref](https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/howtostart/selfpacedlabs/cloudformation)
* Cloudformation console
* Create Stack
* Upload a template file
* Stack name : ```glueworkshop```
* Wait for 10 mins.
### Prepare S3 Bucket and Clone Files
* Cloud9 console
* glueworkshop > Open IDE > Open
* New Terminal

* Run below (如果出現 error ,請至 [此](https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/howtostart/selfpacedlabs/s3-and-local-file) 第三點複製)
```
cd ~/environment/
echo "==================> DOWNLOADING & EXECUTING THE ONE-STEP-SETUP SCRIPT <====================
$(curl -s 'https://static.us-east-1.prod.workshops.aws/public/eb97e639-1cd3-42b0-a198-72f6871fd7ce/static/download/howtostart/awseevnt/s3-and-local-file/one-step-setup.sh' --output ~/environment/glue-workshop/library/one-step-setup.sh --create-dirs)
==========================================================================================="
. ./glue-workshop/library/one-step-setup.sh 'https://static.us-east-1.prod.workshops.aws/public/eb97e639-1cd3-42b0-a198-72f6871fd7ce/static/0/'
```
或到 https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/howtostart/selfpacedlabs/s3-and-local-file 內擷取 script 指令。
## Working with Glue Data Catalog
把 AWS S3 檔案的 schema 爬下來,存入 Data Catalog
可以理解為「建立一個索引/目錄在 Data Catalog」
### Using AWS Console
* Demo : Crawlers, Database, Tables (介面陸續更新中,舊版在下面 Legacy pages)
* Glue console > Databases > Create Database ```console_glueworkshop```
* Glue console > Crawler > Create Crawler ```console-lab1```
* Choose data sources and classifiers > Not Yet
Add a Data sources > ```s3://${BUCKET_NAME}/input/lab1/```
Add a Data sources > ```s3://${BUCKET_NAME}/input/lab5/json/```
* Choose an IAM role > Choose an existing IAM role > ```AWSGlueServiceRole-glueworkshop```
* Set output and scheduling
Output configuration > ```console_glueworkshop```
Prefix > ```console_```
Crawler schedule > On demand
* Create and Run
[ref](https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/lab1)
## Working with Glue studio (ETL job)
### Basic Transformations
* Glue studio > Jobs > Create job > Visual with a blank canvas
* Job name : ```glueworkshop-lab5-basic-job```
* Job Details : 初始設定
* IAM role : ```AWSGlueServiceRole-glueworkshop```
* Glue version : ```Glue 3.0 - Support Spark 3.1, Scala 2, Python 3```
* Requested number of workers : ```4```
* Number of retries : 0
* Follow [Step 7~15](https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/lab6/etl/basic-transform)
* Data Source
* Drop Fields
* Filter
* Change Schema (Apply Mapping)
* Join
* Aggregate
* Data Target
* Run below (in cloud9),看資料內容
```
aws s3 cp s3://${BUCKET_NAME}/output/lab5/ /environment/glue-workshop/output/lab5 --recursive
cat ~/environment/glue-workshop/output/lab5/* > ~/environment/glue-workshop/output/lab5/all.json
```
### Advanced Transformations
* Glue studio > Jobs > Create job > Visual with a blank canvas
* Job name : ```glueworkshop-lab5-advance-job```
* Job Details : 初始設定
* IAM role : ```AWSGlueServiceRole-glueworkshop```
* Glue version : ```Glue 3.0 - Support Spark 3.1, Scala 2, Python 3```
* Requested number of workers : ```4```
* Number of retries : 0
* Follow [Step 5~25](https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/lab6/etl/advanced-transform)
* Data Source (COVID/STATE)
* SQL Query
* Output schema : Data preview > Use data preview schema
* Custom transform : Output 有兩個 df
* Select From Collection : 從兩個 df (df0/df1) 中挑出一個
* df0 > SQL Query > Target
* df1 > Custom transform > Select From Collection > Target
* Run below (in cloud9),看資料內容
```
aws s3 cp s3://${BUCKET_NAME}/output/lab5/ ~/environment/glue-workshop/output/lab5 --recursive
cat ~/environment/glue-workshop/output/lab5/advanced/aggregate/* > ~/environment/glue-workshop/output/lab5/advanced/aggregate/all.json
cat ~/environment/glue-workshop/output/lab5/advanced/pivot/* > ~/environment/glue-workshop/output/lab5/advanced/pivot/all.json
```