[Starlike|摩曼頓 x AWS Immersion Day (Feb 2)] === # 相關連結 Hackmd 短網址: https://pse.is/4r5ckk Event Engine: https://dashboard.eventengine.run/login?hash=7ede-158fcd97c4-75 CSAT: https://survey.immersionday.com/UULRBN04g # Agenda |Time|Topic|Course content|Speaker| |-|-----|--------------|------| |09:00~09:40|Introduction to Amazon Cloud & Security|AWS Services, EC2, RDS, IAM and Shared Responsibility Model|AWS Dennis| |09:40~10:20|Lab: Prepare the Lab environment|Management Console, Region, etc|AWS Dennis| |10:20~10:30|Break|| |10:30~10:50|Amazon AppFlow Introduction| |AWS Dennis| |10:50~11:50|Lab: Salesforce to Redshift|Create a Flow to ingest data from Salesforce to Redshift|AWS Dennis| |12:00~13:00|Lunch|| |13:00~13:30|Modern Data Strategy||AWS Gavin| |13:30~14:10|AWS Glue Introduction|Glue, Glue Studio, Glue Data Catalog|AWS Gavin| |14:10~14:50|Lab: Working with Glue|Working with Glue Data Catalog|NL Alice| |14:50~15:00|Break|| |15:00~15:40|Lab: Working with Glue|Working with Glue Studio|NL Alice| |15:40~16:20|AWS Database Migration Service Introduction|With demo|NL Alice| |16:20~17:00|Q&A||| # Start with AWS EventEngine (Workshop Studio) * 請依據講師提供的短網址登入 EventEngine 環境,詳細步驟請參考 [Log in with AWS Workshop Portal](https://catalog.us-east-1.prod.workshops.aws/workshops/f3a3e2bd-e1d5-49de-b8e6-dac361842e76/en-US/preparation-guide/20-event-engine)。 ## 事先安裝 CloudFormation 為節省之後上課的時間,先在此將所需的 Cloudformation 都建立好: * AppFlow Workshop: 下載以下 template.yaml 並 Create Stack template.yaml 網址: https://static.us-east-1.prod.workshops.aws/public/5616e427-7e79-4bab-a538-59af65e94efc/static/salesforce/salesforce2redshift/redshiftsetup/redshift.yml ``` curl https://static.us-east-1.prod.workshops.aws/public/5616e427-7e79-4bab-a538-59af65e94efc/static/salesforce/salesforce2redshift/redshiftsetup/redshift.yml --output template.yaml 相關資訊: Stack name: appflow-redshift-workshop ClassB: 20 -- This is the 2nd Octet of IPv4 CIDR block for the VPC DataBucketName: redshift-appflow-bucket Database Name: workshop MasterUserPassword: Workshop101 -- 最少 8 個字,至少一個大寫、一個小寫、一個數字 (請記錄下來)。 MasterUsername: admin -- This is the user name for the administrator RedshiftNodeCount: 1 -- This is the number of nodes of Redshift cluster ``` * Glue workshop 下載 https://static.us-east-1.prod.workshops.aws/public/eb97e639-1cd3-42b0-a198-72f6871fd7ce/static/cloudformation/GlueImmersionDay-WithVPCv2Prod-local-v1.yaml ``` curl -O https://static.us-east-1.prod.workshops.aws/public/eb97e639-1cd3-42b0-a198-72f6871fd7ce/static/cloudformation/GlueImmersionDay-WithVPCv2Prod-local-v1.yaml --output GlueImmersionDay-WithVPCv2Prod-local-v1.yaml 相關資訊: Stack name: glue-workshop ``` # AWS Management Console 登入後,請依講師說明,練習操作以下服務。 * 介面操作 * Region 概念 * Global Services * EC2、S3 等基本操作展示 ## 練習 * [EC2 Linux Hands on Lab](https://catalog.us-east-1.prod.workshops.aws/workshops/f3a3e2bd-e1d5-49de-b8e6-dac361842e76/en-US/basic-modules/10-ec2/ec2-linux) * [Amazon S3 Hands on Lab](https://catalog.us-east-1.prod.workshops.aws/workshops/f3a3e2bd-e1d5-49de-b8e6-dac361842e76/en-US/basic-modules/60-s3/s3) * [IAM Hands- on lab](https://catalog.us-east-1.prod.workshops.aws/workshops/f3a3e2bd-e1d5-49de-b8e6-dac361842e76/en-US/basic-modules/30-iam/iam) # Amazon AppFlow Amazon AppFlow 是一個完全託管且安全的 SaaS 服務資料整合服務,它支援各種 SaaS 應用及 AWS 服務,像是 Salesforce、Marketo、Slack、ServiceNow 以及 Amazon S3 和 Redshift 等。 在此 Lab 中,我們將把 Salesforce 的資料單向地傳輸到 Redshift 內。為了準備環境,需要先執行 Lab Setup | Manual Lab Setup 的內容,使用 Cloudformation 自動建立 Redshift。接著建立幾個之後需要的 Lab。然後再透過 Amazon AppFlow 將資料由 Salesforce 轉入 Redshift。 * Lab tutorial: [Amazon AppFlow Workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce) * 做 Salesforce module 下的 Salesforce signup、Salesforce to Redshift 這兩個 Lab 即可,其餘可自行練習。 ## [Salesforce Signup](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce/salesforce-signup) * 請註冊一個永久有效的 salesforce developer 帳戶 (免費),電子郵件可使用"現在可以收信的"私人郵件,Username 的部分可使用唯一且"非有效"郵件 (最好不要用公司 email),例如公司 email 後面加上 .sandbox。 * 開設完成後,依照教學步驟,將 CDC 打開 (切換英文版後,到 Setup,搜尋 Change Data Capture)。 * 接著搜尋 Connected Apps OAuth Usage,在 Connected App 會顯示 Amazon AppFlow Embedded Login App,點擊 Actions | Install。 ## Salesforce to Redshift | Lab Setup | [Manual Lab Setup](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce/salesforce2redshift/redshiftsetup/manual-setup) * 請依照 Lab 內步驟下載 template.yaml,並透過 Cloudformation 建立 Redshift、S3 Bucket 等 Lab 所需環境。 * 建立完成後,透過 Query Editor 進入 Redshift 並建立所需要的 Table (account、contact、opportunity)。 ## Salesforce to Redshift | [Create Flow](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce/salesforce2redshift/createflow) * 請依照 Lab 步驟執行到 12,接著請先到 IAM Role,建立一個新的 role (RedshiftFullAccessForAppflow),並給予 AmazonRedshiftFullAccess 權限),Custom trust policy 範本如下: ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "appflow.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } ``` * 步驟 13 提示,其餘依照 Lab 步驟執行 * Cluster Identifier: 不用輸入 JDBC URL,請直接輸入 Redshift cluster identifier 即可。 * Database name: workshop * Bucket details: 選擇 redshift-appflow-bucket- 開頭 * IAM role for Amazon S3 Access: AppflowRedshiftWorkshopRole * IAM role for Amazon Redshift Data API Access: RedshiftFullAccessForAppflow * Amazon Redshift database user name: admin * Connection name: workshop-redshiftcluster * 步驟 17 開始會建立三條 flow,可手動建立 mapping,也可下載 mapping csv 並上傳即可 (Upload .csv file with mapping fields)。 ``` curl https://static.us-east-1.prod.workshops.aws/public/5616e427-7e79-4bab-a538-59af65e94efc/static/salesforce/salesforce2redshift/createflow/redshift-field-mapping-account.csv --output redshift-field-mapping-account.csv curl https://static.us-east-1.prod.workshops.aws/public/5616e427-7e79-4bab-a538-59af65e94efc/static/salesforce/salesforce2redshift/createflow/redshift-field-mapping-contact.csv --output redshift-field-mapping-contact.csv curl https://static.us-east-1.prod.workshops.aws/public/5616e427-7e79-4bab-a538-59af65e94efc/static/salesforce/salesforce2redshift/createflow/redshift-field-mapping-opportunities.csv --output redshift-field-mapping-opportunities.csv ``` ## Salesforce to Redshift | [Run Flow](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce/salesforce2redshift/runflow) * 按照步驟輪流將 sf-account-redshift、sf-contact-redshift、sf-opportunity-redshift 各執行一遍。 * 重複執行會造成資料重複。 ## Salesforce to Redshift | [Visualise Salesforce Data](https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce/salesforce2redshift/querydata) (Optional) * 此 Lab 是延伸練習,學習如何將 Redshit 資料在 Quicksight 內做視覺化呈現。 # Glue (Lab) ## Setting ### Deploy CloudFormation Template (應該已經完成) [download_link](https://static.us-east-1.prod.workshops.aws/public/eb97e639-1cd3-42b0-a198-72f6871fd7ce/static/cloudformation/GlueImmersionDay-WithVPCv2Prod-local-v1.yaml) [ref](https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/howtostart/selfpacedlabs/cloudformation) * Cloudformation console * Create Stack * Upload a template file * Stack name : ```glueworkshop``` * Wait for 10 mins. ### Prepare S3 Bucket and Clone Files * Cloud9 console * glueworkshop > Open IDE > Open * New Terminal ![](https://i.imgur.com/Do2WNpE.png) * Run below (如果出現 error ,請至 [此](https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/howtostart/selfpacedlabs/s3-and-local-file) 第三點複製) ``` cd ~/environment/ echo "==================> DOWNLOADING & EXECUTING THE ONE-STEP-SETUP SCRIPT <==================== $(curl -s 'https://static.us-east-1.prod.workshops.aws/public/eb97e639-1cd3-42b0-a198-72f6871fd7ce/static/download/howtostart/awseevnt/s3-and-local-file/one-step-setup.sh' --output ~/environment/glue-workshop/library/one-step-setup.sh --create-dirs) ===========================================================================================" . ./glue-workshop/library/one-step-setup.sh 'https://static.us-east-1.prod.workshops.aws/public/eb97e639-1cd3-42b0-a198-72f6871fd7ce/static/0/' ``` 或到 https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/howtostart/selfpacedlabs/s3-and-local-file 內擷取 script 指令。 ## Working with Glue Data Catalog 把 AWS S3 檔案的 schema 爬下來,存入 Data Catalog 可以理解為「建立一個索引/目錄在 Data Catalog」 ### Using AWS Console * Demo : Crawlers, Database, Tables (介面陸續更新中,舊版在下面 Legacy pages) * Glue console > Databases > Create Database ```console_glueworkshop``` * Glue console > Crawler > Create Crawler ```console-lab1``` * Choose data sources and classifiers > Not Yet Add a Data sources > ```s3://${BUCKET_NAME}/input/lab1/``` Add a Data sources > ```s3://${BUCKET_NAME}/input/lab5/json/``` * Choose an IAM role > Choose an existing IAM role > ```AWSGlueServiceRole-glueworkshop``` * Set output and scheduling Output configuration > ```console_glueworkshop``` Prefix > ```console_``` Crawler schedule > On demand * Create and Run [ref](https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/lab1) ## Working with Glue studio (ETL job) ### Basic Transformations * Glue studio > Jobs > Create job > Visual with a blank canvas * Job name : ```glueworkshop-lab5-basic-job``` * Job Details : 初始設定 * IAM role : ```AWSGlueServiceRole-glueworkshop``` * Glue version : ```Glue 3.0 - Support Spark 3.1, Scala 2, Python 3``` * Requested number of workers : ```4``` * Number of retries : 0 * Follow [Step 7~15](https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/lab6/etl/basic-transform) * Data Source * Drop Fields * Filter * Change Schema (Apply Mapping) * Join * Aggregate * Data Target * Run below (in cloud9),看資料內容 ``` aws s3 cp s3://${BUCKET_NAME}/output/lab5/ /environment/glue-workshop/output/lab5 --recursive cat ~/environment/glue-workshop/output/lab5/* > ~/environment/glue-workshop/output/lab5/all.json ``` ### Advanced Transformations * Glue studio > Jobs > Create job > Visual with a blank canvas * Job name : ```glueworkshop-lab5-advance-job``` * Job Details : 初始設定 * IAM role : ```AWSGlueServiceRole-glueworkshop``` * Glue version : ```Glue 3.0 - Support Spark 3.1, Scala 2, Python 3``` * Requested number of workers : ```4``` * Number of retries : 0 * Follow [Step 5~25](https://catalog.us-east-1.prod.workshops.aws/workshops/ee59d21b-4cb8-4b3d-a629-24537cf37bb5/en-US/lab6/etl/advanced-transform) * Data Source (COVID/STATE) * SQL Query * Output schema : Data preview > Use data preview schema * Custom transform : Output 有兩個 df * Select From Collection : 從兩個 df (df0/df1) 中挑出一個 * df0 > SQL Query > Target * df1 > Custom transform > Select From Collection > Target * Run below (in cloud9),看資料內容 ``` aws s3 cp s3://${BUCKET_NAME}/output/lab5/ ~/environment/glue-workshop/output/lab5 --recursive cat ~/environment/glue-workshop/output/lab5/advanced/aggregate/* > ~/environment/glue-workshop/output/lab5/advanced/aggregate/all.json cat ~/environment/glue-workshop/output/lab5/advanced/pivot/* > ~/environment/glue-workshop/output/lab5/advanced/pivot/all.json ```