Serverless User Behaviour Analytic on AWS

# Serverless User Behaviour Analytic on AWS # Outline ## 1. [Introduction](#1introduction) ## 2. [User behaviour collection](#2user-behaviour-collection) ### 2.1. [Setup Kinesis Stream](#21setup-kinesis-stream) ### 2.2. [Setup Cognito User Pool](#22setup-cognito-user-pool) ### 2.3. [Setup Cognito Identity Pool](#23setup-cognito-identity-pool) ### 2.4. [Grant permission for Cognito Identity Pool Role](#24grant-permission-for-cognito-identity-pool-role) ### 2.5. [Setup static website on S3](#25setup-staic-website-on-s3) ### 2.6. [Demo User Behaviour Collection](#26demo-user-behaviour-collection) ## 3. [Serverless Realtime Analytic](#3serverless-realtime-analytic) ### 3.1. [Preparation](#31preparation) #### 3.1.1. [Create DynamoDB Tables](#311create-dynamodb-tables) #### 3.1.2. [Create S3 Bucket](#312create-s3-bucket) #### 3.1.3. [Create Kinesis Stream](#313create-kinesis-stream) #### 3.1.4. [Create IAM Role](#314create-iam-role) ### 3.2. [Realtime Data Processing](#32realtime-data-processing) #### 3.2.1. [Setup Kinesis Data Analytic Application](#321setup-kinesis-data-analytic-application) #### 3.2.2. [Lambda Sync Kinesis to DynamoDB](#322lambda-sync-kinesis-to-dynamodb) ### 3.3. [Deploy Dashboard to S3](#33deploy-dashboard-to-s3) #### 3.3.1. [Setup Cognito Pools](#331setup-cognito-pools) #### 3.3.2. [Deploy Dashboard to S3](#332deploy-dashboard-to-s3) #### 3.3.3. [Dashboards](#333dashboards) ## 4. [Serverless Historical Analytic](#4serverless-historical-analytic) ### 4.1. [Preparation](#41preparation) #### 4.1.1. [Create S3 Bucket](#411create-s3-bucket) #### 4.1.2. [Create IAM Role](#412create-iam-role) ### 4.2. [Historical Data Processing](#42historical-data-processing) #### 4.2.1. [Create Lambda Function](#421create-lambda-function) #### 4.2.2. [Setup Firehose Raw Ingestion](#422setup-firehose-raw-ingestion) #### 4.2.3. [Setup Glue ETL](#423setup-glue-etl) #### 4.2.4. [Play with Athena](#424play-with-athena) ### 4.3. [Dashboards with Quicksight](#43dashboards-with-quicksight) #### 4.3.1. [Setup Quicksight](#431setup-quicksight) #### 4.3.2. [Create Dashboard in Quicksight](#432create-dashboard-in-quicksight) ## 5. [Clean up resources](#5clean-up-resources) # Guide Content ## 1.Introduction Currently, many companies want to become a Data-Driven company, which means every their decisions are made based on Data. In fact, many companies like Amazon, Tiktok, Netflix, Uber, Facebook, Google are real successful use cases for this strategy, they are using user data (user behaviour) to improve user experience on their products significantly. With Cloud technologies, especially Serverless Technologies - open alot of opportunities for every companies that collecting, using their user data to improve their businesses. In this hand-on lab, we're going to use AWS Serverless Technologies to deploy the User behaviour analytic use cases: * Deploying a static book store website and collecting user behaviours on the website with: S3, Cognito User/Identity Pool, Kinesis Data Stream and KPL. * Building a Serverless Realtime Analytic pipeline to analyze realtime user behaviours with: Kinesis Data Stream, Kinesis Analytic, Lambda Function, DynamoDB, S3 Dashboard with Cognito User/Identity Pool. * Build Serverless Historical Analytic to analyze users’ historical behaviours with: Kinesis Data Stream, Kinesis Firehose, Lambda Function, S3, Glue ETL, Athena và Quicksight. Below is the architecture design on AWS Cloud Serverless. To better practice AWS services, we will use these services entirely on the AWS Console instead of using the AWS CloudFormation service. ![Serverless Analytics Architecture](https://i.imgur.com/qQuUY3w.png) **Let's go !!!** ## 2.User behaviour collection In this part, we setup static website to S3. User will authenticate with website through Cognito User Pool. Website collect some user's behaviourand send to Kenisis Data Stream for analytic in later part. ### 2.1.Setup Kinesis Stream To setup **Kinesis Stream**, please follow these steps: 1. Access **Kinesis Console** 2. At left navigation bar, click **Data Stream** ![Imgur](https://i.imgur.com/xAIG5M2.png) 3. In screen **Data Stream**, click **Create data stream** 4. In screen **Create data stream** please input: - Fill **Data stream name** : ```WebTrackingStreamxxx (xxx is random user's fill)``` - At section **Capacity mode** choose ```Provisioned``` - Fill **Provisioned shards**: ```1``` ![Imgur](https://i.imgur.com/uWi9oHF.png) 5. Scroll to page bottom, click **Create data stream** ![Imgur](https://i.imgur.com/80hfR9i.png) ### 2.2.Setup Cognito User Pool To setup **Cognito User Pool**, please follow these steps: 1. Access **Cognito Console** 2. In main page **Cognito**, choose **Manage User Pools** ![Imgur](https://i.imgur.com/VgKpT0G.png) 3. In screen **Your user pools**, click **Create a user pool** ![Imgur](https://i.imgur.com/XT21VIW.png) 4. In screen **Create a user pool** please input: - At section **Cognito user pool sign-in options** : Choose ***User name*** - Scroll to page bottom click **Next** - In screen **Configure security requirements**, at section **Multi-factor authentication** choose **No MFA** , click **Next** - In screen **Configure sign-up experience** leave default value, click **Next** - In screen **Configure message delivery**, at section Email -> Email provider -> choose **Send email with Cognito**, click **Next** - In screen **Integrate your app**, - At section **User pool name**: fill **User pool name** : ```WebAppTracking``` - At section **Initial app client**: fill **App client name**: ```bookstore``` - Click **Next** - Review all configurations then click **Create user pool** 5. Choose **User pool id** created, scroll to section **Users** -> Create new a user - At section **Invitation message** choose **Don't send an invitation** - Fill **User name**: ```ironman``` - At section **Temporary password** choose **Set a password** - Fill **password**: ```123qweA@``` ![Imgur](https://i.imgur.com/QFEYnMd.png) 6. Change password for created user - Need a environment has installed AWS CLI - Configure AWS credential with ***Securities credential*** download from ***IAM*** - Run command: ``` aws cognito-idp admin-set-user-password \ --user-pool-id <user-pool-id> \ --username <username> \ --password <new-password> \ --permanent \ --profile <profile-aws-cli> ``` ### 2.3.Setup Cognito Identity Pool To setup **Cognito Identity Pool**, we follow these steps: 1. Access **Cognito Console** 2. In main page **Cognito**, choose **Manage Identity Pools** ![Imgur](https://i.imgur.com/KtiP0SX.png) 3. In screen **Create new identity pool** please fill: - **Identity pool name**: ```DataStreamIdentityPool``` - At section **Unauthenticated identities** : Choose **Enable access to unauthenticated identities** ![Imgur](https://i.imgur.com/usYkXPi.png) 4. At section **Authentication provider**, tab **Cognito** please input: - Fill **User Pool ID**: `user_pool_id` *(from created User Pool Id in previous step)* ![Imgur](https://i.imgur.com/SfzNcH9.png) - Fill **App client id**: `app_client_id` *(from created User Pool Id in previous step)* ![Imgur](https://i.imgur.com/snPSu1M.png) 5. Scroll to page bottom, click **Create Pool** 6. In screen **Identify the IAM roles to use with your new identity pool** choose **Allow** ![Imgur](https://i.imgur.com/piwgr0p.png) 7. In screen **Getting started with Amazon Cognito** take note **Identity pool ID** for later ### 2.4.Grant permission for Cognito Identity Pool Role 1. Access **IAM Console** 2. At left navigation bar, click **Roles** 3. Fill search input ```Cognito_DataStreamIdentityPoolUnauth_Role``` to search role, choose role 4. At section **Permissions policies**, click **Add permission** dropdown then choose **Create inline policy** 5. In screen **Create policy** choose tab **JSON** to paste policy below: ``` { "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Action": [ "mobileanalytics:PutEvents", "cognito-sync:*" ], "Resource": "*" }, { "Sid": "", "Effect": "Allow", "Action": "kinesis:Put*", "Resource": "arn:aws:kinesis:*:039935687098:stream/WebTrackingStream63B28502" } ] } ``` 6. Click **Review Policy**, 7. Fill policy name ```allow-send-data-kinesis```, click **Create Policy** ![Imgur](https://i.imgur.com/2IkqJjR.png) ### 2.5.Setup staic website on S3 To setup static bookstore website, please download source code and modify some places in source code. >[Source code](https://ufile.io/zetaxtzr) Explode zip file and open source folder by own IDE (*Webstorm*, ...) or Editor (*Visual Studio Code*, ...) to modify source code 1. Open file js/event_producer.js - Replace string **REPLACE_IDENTITY_POOL_ID** by value of ***identity pool id*** taked note in step 6 section 2.2 - Replace string **REPLACE_STREAM_NAME** by value of stream name created in section 2.1 (*WebTrackingStream63B28502*) 2. Open file js/app.js - Replace string **REPLACE_USER_POOL_ID** by value of ***user pool id*** created in section 2.3 ![Imgur](https://i.imgur.com/SfzNcH9.png) - Replace string **REPLACE_APP_CLIENT_ID** by value of ***app client id*** created in section 2.3 ![Imgur](https://i.imgur.com/snPSu1M.png) 3. Create S3 bucket - Access **S3 Console** - Click **Create bucket** - Fill **Bucket name**: ```serverless-analytic-bookstore-xxx (xxx is random user's input)``` - Choose **AWS Region** is ```ap-southeast-1``` - Chọn **Enable** tại **Default encryption** - Scroll to page bottom click **Create bucket** 4. Configure static website on S3 - Choose created S3 bucket in step 3 - Choose tab **Properties**, scroll to section **Static website hosting** click **Edit** - In screen **Edit static website hosting**, Fill **Index document** : ```index.html``` - Kéo xuống cuối trang click **Save changes** - Choose tab **Permission**, in section **Block public access (bucket settings)** click **Edit** - In screen **Edit Block public access (bucket settings)**, uncheck **Block all public access** - Click **Save changes** - Scroll to section **Bucket policy**, click **Edit** - In screen **Edit bucket policy**, paste policy below: ``` { "Version": "2012-10-17", "Statement": [ { "Sid": "PublicReadGetObject", "Effect": "Allow", "Principal": "*", "Action": [ "s3:GetObject" ], "Resource": [ "arn:aws:s3:::serverless-analytic-bookstore-xxx/*" ] } ] } ``` > correct bucket name - Click **Save changes** - Choose tab **Objects**, click **Upload** upload all folder and files in source codes folder ![Imgur](https://i.imgur.com/TH9Pmgk.png) - To access website, please choose tab **Properties** scrool to section **Static website hosting** click created link: ```http://serverless-analytic-bookstore-xxx.s3-website-ap-southeast-1.amazonaws.com/``` ### 2.6.Demo User Behaviour Collection ![Imgur](https://i.imgur.com/rNdnHUc.png) Currently, website collect these user's behaviour: 1. Login 2. Logout 3. Add Cart 4. View Category 5. View Book Detail 6. Enter page Logs will be sent to Kinesis Data Stream to handle ![Imgur](https://i.imgur.com/H41PIY3.png) ![Imgur](https://i.imgur.com/yfXvySr.png) ## 3.Serverless Realtime Analytic ![Serverless Analytics Architecture](https://i.imgur.com/qQuUY3w.png) In this part of the guide, you're going to setup a realtime pipeline for realtime analytic. **It's design is the top branch in the above image** - Data from Raw Kinesis Data Stream are processed by Kinesis Analytic Application(SQL) with a window time then outputing the result to other Kinesis Data Stream. - The output Stream will trigger a Lambda function to calculate and put records to DynamoDB. - Dashboards are deployed on S3, refresh data incrementally from DynamoDB every 10s. Your pipeline will use the raw Kinesis stream from [2.User behaviour collection](#2user-behaviour-collection): ```WebTrackingStream{youruniquename}``` ### 3.1.Preparation #### 3.1.1.Create DynamoDB Tables - You need to create 2 tables: - Metrics Table contains data checkpoints for output metrics. - Metrics Detail Table contains the metrics data for our dashboard. - Access DynamoDB using AWS Console ![Create Dynamo](https://i.imgur.com/QZ3Ml55.png) 1. Create MetricsTable - Select **create table** - Fill **Table name** with value: ```MetricsTable-{youruniquename}``` - Fill **Partition key** with value: ```MetricType``` and select ``` String ``` type. - Other fields not required - Choose ``` Create table ``` to create table. ![MetricsTable](https://i.imgur.com/pJ0WBvU.png) - Your challenges: add the following items to the table. It's realtime maker for dashboard so dashboard can incremental refresh by adding latest metrics without full refresh. ![MetricsTable Item](https://i.imgur.com/lVCdvrP.png) 2. Create MetricsDetailTable - Select **create table** - Fill **Table name** with value: ```MetricsDetailTable-{youruniquename}``` - Fill ``` Partition key ``` with value: ```MetricType``` and select ``` String ``` type. - Fill ``` Sort key ``` with value: ```EventTimestamp``` and select ``` Number ``` type. - Other fields not required ![MetricsDetailTable](https://i.imgur.com/Q10R7OW.png) - Choose ``` Create table ``` to create table. #### 3.1.2.Create S3 Bucket You need to create S3 bucket for **hosting the realtime dashboard**. Access S3 Console and do the following: - Click **Create bucket** - Select **Bucket name**: ```serverless-analytic-realtimedashboard-{your-unique-name}``` - Select **AWS Region**: ```ap-southeast-1``` - Let other fields with default value. - Choose **Create bucket** to create bucket. ![Create S3 Bucket for Dashboard](https://i.imgur.com/5WtfHth.png) #### 3.1.3.Create Kinesis Stream You need to create a Kinesis Stream so Kinesis Analytic Application can write the analyzed data to. From AWS Console go to Kinesis, select **Data Streams**. ![](https://i.imgur.com/lbRYdpP.png) - Choose **Create data stream** - Enter **Data stream name**: ```UserBehaviourAnalysis-Output-{youruniquename}``` - For real usecase, you need to carefully benchmark your application and choose which ```Capacity Mode``` fit your needs. But with this practical case, using default **Capacity mode** mode ```On-demand``` is fine. ![](https://i.imgur.com/YjBkSCe.png) #### 3.1.4.Create IAM Role 1. Create Kinesis Analytic Application Role: You need to create a Kinesis Analytic Application Role to allow the following action: - Read from raw Kinesis Stream ```WebTrackingStream{youruniquename}``` - Write to output Kinesis Stream: ```UserBehaviourAnalysis-Output-{youruniquename}``` Go to AWS console then navigate to IAM. ![](https://i.imgur.com/v5GqljC.png) - Choose **Create Role** - In **Trusted entity type** choose **AWS Service**, then search Kinesis and choose **Kinesis Analytics**. ![](https://i.imgur.com/a3vOhJA.jpg) - Choose **Create Policy** ![](https://i.imgur.com/LPA4GY5.jpg) - Choose **JSON** and paste below json: ``` { "Version": "2012-10-17", "Statement": [ { "Sid": "ReadInputKinesis", "Effect": "Allow", "Action": [ "kinesis:DescribeStream", "kinesis:GetShardIterator", "kinesis:GetRecords" ], "Resource": [ "arn:aws:kinesis:ap-southeast-1:039935687098:stream/WebTrackingStream63B28502" ] }, { "Sid": "WriteOutputKinesis", "Effect": "Allow", "Action": [ "kinesis:DescribeStream", "kinesis:PutRecord", "kinesis:PutRecords" ], "Resource": [ "arn:aws:kinesis:ap-southeast-1:039935687098:stream/UserBehaviourAnalysis-Output-youruniquename" ] } ] } ``` - Choose **Next Tags** and **Next Preview** - Fill **Name** with ```KinesisApplicationStreamPolicies-{youruniquename}``` and choose **Create Policy** ![Create policies](https://i.imgur.com/SH1wu2B.png) - Back to Create Role page and press **Refresh button** and search with: ```KinesisApplicationStreamPolicies-{youruniquename}``` ![Kinesis App Policy](https://i.imgur.com/KBgyaq6.png) - Choose that policy and **Next** - Fill **Role name** with ```KinesisApplicationStreamRole-{youruniquename}``` ![](https://i.imgur.com/0gHWb1t.png) - Choose **Create role** to finish. 2. Create Lambda Role You need to create a Lambda Role to allow the following actions: - Read/Write to DynamoDB created from the previous step. - Read Kinesis Stream created from the previous step. - Do as ```Create Kinesis Analytic Application Role``` step, replace with below values: - In **Trusted entity type** choose **AWS Service** - **Role name**: ```Lambda-ServerlessAnalytic-ProcessMetrics2DynamoDB-{youruniquename}-Role``` - Using this JSON Policy: ``` { "Version": "2012-10-17", "Statement": [ { "Sid": "kinesis", "Effect": "Allow", "Action": [ "kinesis:PutRecord", "kinesis:PutRecords", "kinesis:GetShardIterator", "kinesis:GetRecords", "kinesis:DescribeStream", "kinesis:ListStreams", "kinesis:ListShards" ], "Resource": [ "arn:aws:kinesis:ap-southeast-1:039935687098:stream/UserBehaviourAnalysis-Output-youruniquename" ] }, { "Sid": "dynamodb", "Effect": "Allow", "Action": [ "dynamodb:UpdateTable", "dynamodb:Query", "dynamodb:UpdateItem", "dynamodb:Scan", "dynamodb:PutItem" ], "Resource": [ "arn:aws:dynamodb:ap-southeast-1:039935687098:table/MetricsTable-youruniquename", "arn:aws:dynamodb:ap-southeast-1:039935687098:table/MetricsDetailTable-youruniquename" ] }, { "Sid": "logs", "Effect": "Allow", "Action": [ "logs:CreateLogStream", "logs:PutLogEvents", "logs:CreateLogGroup" ], "Resource": "arn:aws:logs:ap-southeast-1:039935687098:log-group:/aws/lambda/ServerlessAnalytic-ProcessMetrics2DynamoDB-youruniquename:*" } ] } ``` ### 3.2.Realtime Data Processing #### 3.2.1.Setup Kinesis Data Analytic Application 1. Create Kinesis Data Analytics Application From AWS Console go to Kinesis, select **Analytics applications**. - Choose **SQL applications (legacy)** so we can use SQL to code our analytic application. ![](https://i.imgur.com/RC5oeFD.jpg) - Choose **Create SQL application (legacy)**, fill **Application Name** with ```UserBehaviourAnalysis-Application-{youruniquename}``` ![](https://i.imgur.com/NCpRiMT.jpg) - After creating Kinesis Analytic Application, you need to config 3 part: - **Source** - **Real-time Analytics** - **Destinations** - Dont close this page, you will continue in the next session. 2. Add **source** to Kinesis Data Analytics - At Kinesis Data Analytics console, choose **Source** tab in step 1. - Choose **Source** options with value: ```Kinesis data stream``` - Choose Browse to pick Raw Kinesis Stream. ![Raw Kinesis Stream](https://i.imgur.com/X4gmuVt.png) - Choose WebTrackingSystem{youruniquename} on the list options. ![WebTrackingSystem](https://i.imgur.com/IyFslL9.png) - In **Choose from IAM roles that Kinesis Data Analytics can assume** choose IAM role:```KinesisApplicationStreamRole-{youruniquename}``` - Choose **Discover schema** to let kinesis dicover schema of the stream. ![Discover schema](https://i.imgur.com/jPnwaAt.jpg) - Choose **Save Changes** to save. 3. Add **Destination** to the Kinesis Data Analytics - At Kinesis Data Analytics console in step 1, choose **Destinations** tab and click **Add Destination** - Fill with below values: - **Destination**:```Kinesis data stream``` - **Kinesis data stream**: Browse to ```UserBehaviourAnalysis-Output-{youruniquename}``` - **Choose from IAM roles that Kinesis Data Analytics can assume**:```KinesisApplicationStreamRole-{youruniquename}``` - **In-application stream name**: ```USER_BEHAVIOUR_ANALYSIS_STREAM``` - **Output format**: ```JSON``` ![Kinesis App Destination](https://i.imgur.com/5GX91Du.jpg) 4. Add **Real-time Analytics** code to the Kinesis Data Analytics - At Kinesis Data Analytics console in step 1, choose **Real-time Analytics** tab. - Choose **Configure** ![Configure Real-time Analytics 1](https://i.imgur.com/wQa9DfB.png) - Paste sql code from this file [Kinesis App SQL Code](https://ufile.io/mbdm8swy) to the SQL editor: ![Configure Real-time Analytics 2](https://i.imgur.com/V0CJLo9.png) - Choose **Save and run application** and wait for this status ```Application UserBehaviourAnalysis-Application-youruniquename has been successfully updated.``` - Scroll down and choose tab **Input** then choose ```SOURCE_SQL_STREAM_001``` and wait for the input Kinesis Data Stream. If you dont see any data, checking your producer in web application. ![Input Kinesis App](https://i.imgur.com/DH6Q98P.jpg) - Scroll down and choose tab **Output** then choose ```USER_BEHAVIOUR_ANALYSIS_STREAM``` and wait for the output from this Kinesis application. If you dont see any data then checking ```error_stream``` for more error details. ![Output Kinesis App](https://i.imgur.com/WdO836b.jpg) #### 3.2.2.Lambda sync Kinesis to DynamoDB In this step, you will deploy a Lambda function. This Lambda function sync data from Kinesis stream and do some transformation then feed into DynamoDB Tables. 1. Create lambda function - Go to Lambda Console and choose **Create Function** ![Lambda Function](https://i.imgur.com/gUyPLtG.png) - Fill option and blank with below values: - Choose option ```Author from scratch``` - **Function Name**: ```ServerlessAnalytic-ProcessMetrics2DynamoDB-{youruniquename}``` - **Runtime**: ```Node.js 14x``` - **Architecture**: ```x86_64``` ![Lambda Function Settings](https://i.imgur.com/XWgfAGL.jpg) - Choose **Create Function** 2. Set Lambda function timeout Your Lambda function do some transformation on data so it takes time. You need to increase it's timeout. - Choose **Configuration** tab then select **General configuration** ![Timeout config 1](https://i.imgur.com/Vb7Bfe5.png) - Click **Edit** and increae **Timeout** to ```2 minutes``` then **Save**. ![Timeout config 2](https://i.imgur.com/g1EpQoH.png) 3. Configure your Lambda function - Download this file: [lambda-processmetrics](https://ufile.io/n48xi5y1) - Go to your new created function - Choose **Upload from** on the right side of console and choose **.zip file** from the dropdown list. - Upload zip that you've just download. - Choose **Configuration** tab and choose **Environment variables** on the left. Fill these below envs then choose **Save**. - **METRIC_TABLE**: ```MetricsTable-{youruniquename}``` - **METRIC_DETAILS_TABLE***: ```MetricsDetailTable-{youruniquename}``` ![Lambda Envs](https://i.imgur.com/2d6ccMY.png) - Back to the main page of the Lambda function and choose **Deploy**. ![Deploy Lambda function](https://i.imgur.com/2gDaTs3.png) - But your function can not run without a trigger. You will config a Kinesis trigger in the next step. 4. Add Lambda trigger - Choose **Add Trigger** button. ![Kinesis Trigger](https://i.imgur.com/VkI0tjh.png) - Search ```Kinesis``` and choose that one. ![Search Kinesis](https://i.imgur.com/AHxJEXo.png) - In **Kinesis stream[Select a Kinesis stream to listen for updates on.]**, fill value ```UserBehaviourAnalysis-Output-{youruniquename}``` - Let other params be default and choose **Add**. Wait for Kinesis Trigger status turn to be ```Enabled``` ![Lambda Kinesis Trigger](https://i.imgur.com/u5R17T6.png). 5. Check your data - Go to Dynamo Table: ```MetricsDetailTable-{youruniquename}``` and choose **Explore table items** to see table data. ![MetricsDetailTable data](https://i.imgur.com/FDQTuuD.png) - If you dont see any data, go back to the Lambda function then check **Monitor** tab and **Logs** ![Lambda Logs](https://i.imgur.com/MjJhU33.png) ### 3.3.Deploy Dashboard to S3 #### 3.3.1.Setup Cognito Pools In real use cases, you need to setup Cognito User Pool and Identity Pool for your dashboard. But in this case, we will leverage the Cognito setup from previous step: - [2.2.Setup Cognito User Pool](#2.2.Setup-Cognito-User-Pool) - [2.3.Setup Cognito Identity Pool](#2.3.Setup-Cognito-Identity-Pool) You also need to add this policy to IdentityPoolAuth_Role: ``` { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "dynamodb:BatchGetItem", "dynamodb:DescribeTable", "dynamodb:GetItem", "dynamodb:Scan", "dynamodb:Query" ], "Resource": [ "arn:aws:dynamodb:ap-southeast-1:039935687098:table/MetricsTable-{youruniquename}", "arn:aws:dynamodb:ap-southeast-1:039935687098:table/MetricsDetailTable-{youruniquename}" ] } ] } ``` ![Cognito User - Scan DynamoDB](https://i.imgur.com/LhOONRj.png) #### 3.3.2.Deploy Dashboard to S3 Now you can host your realtime dashboard on your S3 bucket. 1. Configure Dashboard Code - Download this file: [user_hehaviour_analytic_dasboard](https://ufile.io/h5g7o4uk) - Unzip and update placeholders in this file: js/dash.js with your information: - COGNITO_APP_CLIENT_ID - COGNITO_USER_POOL_ID - COGNITO_IDENTITY_POOL_ID - MetricsTable-{youruniquename} - MetricsDetailTable-{youruniquename} 2. Deploy to S3 You will do the same as section 4 in [2.5.Setup staic website on S3](#25setup-staic-website-on-s3). - Website code: Using the code that you just downloaded and modified. - Bucket name: ```serverless-analytic-realtimedashboard-{youruniquename}``` - Go to ```http://serverless-analytic-realtimedashboard-{youruniquename}.s3-website-ap-southeast-1.amazonaws.com/``` to view Dashboard #### 3.3.3.Dashboards Your dashboards are refreshed every 10s. 1. **Visitor Count** and **User Agents** - **Visitor Count** has window time: ```60s```. - **User Agents** has window time: ```20s```. ![Visitor Count and User Agents](https://i.imgur.com/U1BC3Kh.png) 2. **All Recent Events** - **All Recent Events** count has window time: ```20s```. ![All Recent Events](https://i.imgur.com/2KIVRkJ.png) 3. **Recent events** - **Recent Events** count has window time: ```60s``` ![Recent Events](https://i.imgur.com/EysXnKD.png) 4. **Pages access count** - **Pages access count** count has window time: ```20s``` ![Pages Count](https://i.imgur.com/1T8FHFq.png) We can update the window time from Kinesis Analytic Application code (SQL). Your challenges: modify the window time in Kinesis Application code to analyze data more frequently. ## 4.Serverless Historical Analytic ![Serverless Analytics Architecture](https://i.imgur.com/qQuUY3w.png) In this part of the guide, you're going to setup a batch pipeline for Historical Analytic. **It's design is the bottom branch in the above image**. - Data from Raw Kinesis Data Stream are writed to S3 bucket in JSON format by Kinesis Firehose. You gonna use a Lambda transform to add ```\n``` character to every record before write out to S3. Besides you dont do anything else to the raw data. - Glue ETL run a job to transform JSON format to Parquet format in other bucket and we also partition the data. - Athena can query the data from S3 with Glue helps. - You gonna use Quicksight to create some dashboards. Your pipeline will use the Raw Kinesis stream from [2.User behaviour collection](#2user-behaviour-collection): ```WebTrackingStream{youruniquename}``` ### 4.1.Preparation #### 4.1.1.Create S3 Bucket 1. Create bucket for Firehose and Athena/Glue ETL Access S3 Console and do the following: - Click **Create bucket** - Select **Bucket name**: ```serverless-analytic-adhoc-{your-unique-name}``` - Select **AWS Region**: ```ap-southeast-1``` - Let other fields with default value. - Choose **Create bucket** to create bucket. - Also you need to create 3 folders in the bucket for Firehose and Athena/Glue ETL. - Firehose: ```serverless-analytic-adhoc-{your-unique-name}/raw_events```. - Athena/Glue ETL: ```serverless-analytic-adhoc-{your-unique-name}/optimized_raw_events```. - Athena Logs: ```serverless-analytic-adhoc-{your-unique-name}/athena-query-logs```. ![S3 Historical analytic folder](https://i.imgur.com/2MAMqIy.png) #### 4.1.2.Create IAM Role 1. Glue Role: read/write to S3 - Go to IAM console, choose **Role** and **Create role** - In **Trusted entity type** section, select ```AWS service``` - In **Use case** section, search and select ```Glue```(Allows Glue to call AWS services on your behalf.) then **Next**. - Select **Create Policy** and go to policy page - Paste this policy ``` { "Version": "2012-10-17", "Statement": [ { "Sid": "glue", "Effect": "Allow", "Action": [ "glue:*" ], "Resource": [ "*" ] }, { "Sid": "role", "Effect": "Allow", "Action": [ "iam:ListRolePolicies", "iam:GetRole", "iam:GetRolePolicy" ], "Resource": [ "*" ] }, { "Sid": "s3", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:PutObjectAcl", "s3:GetObjectAcl", "s3:GetObject", "s3:GetObjectAcl", "s3:GetBucketAcl", "s3:ListBucket", "s3:GetBucketLocation", "s3:ListBucket", "s3:ListAllMyBuckets" ], "Resource": [ "arn:aws:s3:::serverless-analytic-adhoc-your-unique-name", "arn:aws:s3:::serverless-analytic-adhoc-your-unique-name/*", "arn:aws:s3:::serverless-analytic-adhoc-your-unique-name/raw_events", "arn:aws:s3:::serverless-analytic-adhoc-your-unique-name/raw_events/*", "arn:aws:s3:::serverless-analytic-adhoc-your-unique-name/optimized_raw_events", "arn:aws:s3:::serverless-analytic-adhoc-your-unique-name/optimized_raw_events/*", "arn:aws:s3:::aws-glue-assets-039935687098-ap-southeast-1", "arn:aws:s3:::aws-glue-assets-039935687098-ap-southeast-1/*" ] }, { "Effect": "Allow", "Action": [ "logs:CreateLogStream", "logs:PutLogEvents", "logs:CreateLogGroup" ], "Resource": "arn:aws:logs:ap-southeast-1:039935687098:*" }, { "Effect": "Allow", "Action": [ "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": [ "arn:aws:logs:ap-southeast-1:039935687098:log-group:/aws-glue/jobs/UserLogs-GlueETLConvert-youruniquename:*" ] } ] } ``` - Name it ```GlueETL-Convert-Policies-{youruniquename}``` and **Create policies** ![GlueETL-Convert-Policies](https://i.imgur.com/QBAcxwm.png) - Back to the previous role page **Add permissions**, choose button **refresh**. Search and select policy ```GlueETL-Convert-Policies-{youruniquename}``` ![Search and select policy](https://i.imgur.com/fBdy79O.jpg) - Choose **Next** and fill **Role name** with ```GlueETL-Convert-Role-{youruniquename}``` ![](https://i.imgur.com/uEXyXQI.png) - Choose **Create Role** ### 4.2.Historical Data Processing #### 4.2.1.Create Lambda Function 1. Create Lambda Function - Go to Lambda Console and choose **Create Function**. - Fill option and blank with below values: - Choose option ```Author from scratch``` - **Function Name**: ```ServerlessAnalytic-Preprocess4Athena-{youruniquename}``` - **Runtime**: ```Python 3.9``` - **Architecture**: ```x86_64``` ![Lambda Function Settings](https://i.imgur.com/TikJjGI.png) - Choose **Create Function** 2. Set Lambda function timeout Your Lambda function do some transformation on data so it takes time. You need to increase it's timeout. - Choose **Configuration** tab then select **General configuration** - Click **Edit** and increae **Timeout** to ```2 minutes``` then **Save**. 3. Configure your Lambda function - Go to your new created function - Choose **Code** section and paste python code from below this file: [preprocess-4-athena](https://ufile.io/whu2n1aq) - Choose **Deploy** to deploy code. #### 4.2.2.Setup Firehose Raw Ingestion 1. Go to Kinesis console and select **Delivery streams**: - Choose **Create delivery stream** ![](https://i.imgur.com/PXzcp9i.png) - Choose source and destination: - Source ```Amazon Kinesis Data Streams``` - Destination ```S3``` 2. Configure Kinesis Firehose - In Source: - Choose the data created for raw event from web application ```WebTrackingSystem{youruniquename}``` ![Raw Stream](https://i.imgur.com/VN41i7L.png) - In box **Delivery stream name** fill your unique name: ```Kinesis-Firehose-logs-2-s3-{youruniquename}``` - In ```Transform and convert records``` > ```Transform source records with AWS Lambda``` > ```Data transformation``` choose Enabled then choose your Lambda function ```ServerlessAnalytic-Preprocess4Athena-{youruniquename}``` ![1](https://i.imgur.com/wtoP36B.jpg) ![2](https://i.imgur.com/TNusfDi.png) - In **Destination settings**/**S3 bucket**, choose ```serverless-analytic-adhoc-{your-unique-name}``` and add 2 prefix as below - **Raw Prefix**: ```raw_events/raw_user_logs/``` - **S3 bucket error output prefix***: ```raw_events/error/``` ![Destination settings](https://i.imgur.com/G9RQCFQ.png) - Choose **Create** to create the delivery stream. - Wait for approx 2 minutes and go to S3 subfolders to see our raw data: ```serverless-analytic-adhoc-{your-unique-name}/raw_events/raw_user_logs/``` ![Firehose Output](https://i.imgur.com/qhKLOsU.png) #### 4.2.3.Setup Glue ETL You have raw data in a S3, it's important to keep raw data in original format but the data have JSON format and it's not a good fit for data analytic. You should convert this JSON data to Parquet, it works perfectly for bigdata. In this part, you will use Glue ETL to convert JSON to Parquet format. Search **AWS Glue** in AWS Console: ![AWS Glue](https://i.imgur.com/Nhs2LM6.png) 1. Create Glue database - In the left navigation bar, under **Data catalog** section select **Databases** ![AWS Glue Catalog](https://i.imgur.com/Z8vKxQa.png) - Choose **Add database**, and fill **Database name** with ```serverless_analytic``` - Select **Create** to create a database in Glue. 2. Configure Glue ETL - In the left navigation bar, under **ETL** section select **Jobs (new)** ![AWS Glue ETL](https://i.imgur.com/dJsMyAt.png) - In new window, you should see something like this: ![Glue ETL Jobs](https://i.imgur.com/CFG4XrV.png) - Keep default option **Visual with a source and target** so you dont need any code, just drag and drop. - Select **Create**. ![Glue ETL Graphs](https://i.imgur.com/9NDwlI6.png) - On the center screen, you will see a sample graph but we need to clear the graph by selecting node on graph and choose **Remove** ![Glue ETL Jobs](https://i.imgur.com/Up3jSc5.png) - Select **Job details** tab to config role for this job and fill with below values: - **Name**: ```UserLogs-GlueETLConvert-{youruniquename}``` - **IAM Role**: ```GlueETL-Convert-Role-{youruniquename}``` - Uncheck option **Generate job insights** - Other option can be default. - Choose **Save** on the right. ![Glue ETL Jobs config](https://i.imgur.com/tOGlsTS.png) 3. Add **Source** to Glue ETL Job Glue ETL Job need some input, in our case it will be a S3 bucket. - Go to **Visual** tab. - Choose **Source** dropdown and search ```Amazon S3``` and select. ![Amazon S3 Source](https://i.imgur.com/fl3hzlA.png) - Select **Data source - S3 bucket** from the graph and see on the right. ![Data source - S3 bucket](https://i.imgur.com/fYomNeQ.jpg) - In tab **Data source properties - S3** fill out the form with below values: - **S3 source type**: ```S3 location``` - **S3 URL**: ```s3://serverless-analytic-adhoc-{your-unique-name}/raw_events/raw_user_logs/``` - **Data format**: ```JSON``` - Choose **Infer schema** to infer schema for next steps. ![Data source properties - S3](https://i.imgur.com/75ieOtU.png) - Choose tab **Output schema** to see schema output ![](https://i.imgur.com/ukH5nce.png) 4. Add **Transformation** to Glue ETL Job You could write this source directly to another S3 bucket but in this timeseries data, we will partition this data to seperate bucket by date. - Choose **Transform** dropdown and search ```SQL``` and select. ![Spark SQL code](https://i.imgur.com/s2U0jAo.png) - In tab **Transform**, paste this code to **SQL query** form: ```sql select *, year(dt) as year, month(dt) as month, day(dt) as day from ( select *, (to_date(datetime, 'dd/MM/yyyy HH:mm:ss') + interval 7 hour) as dt from myDataSource ) ``` - Switch to tab **Data preview** to see sample output data. ![Preview data](https://i.imgur.com/h38JevA.png) 5. Add **Target** to Glue ETL Job Glue ETL Job need some input, in our case it will be a S3 bucket. - Go to **Visual** tab. - Choose **Target** dropdown and search ```Amazon S3``` and select. ![Amazon S3 Target](https://i.imgur.com/m1oouT9.png) - Choose node **Data target - S3 bucket**: - In the right, choose **Node properties** and delect all **Node parents**, select **Transform - SQL Query** as Node parents ![Target Node parents](https://i.imgur.com/ViQUQyz.png) - Choose **Data target properties - S3** tab and paste this S3 location ```s3://serverless-analytic-adhoc-{your-unique-name}/optimized_raw_events/athena_user_events/```. - In **Data target properties - S3**: - **Format**: ```Parquet``` - Select ```Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions``` option to auto register S3 location to Glue Catalog for Athena. - **Database**: ```serverless_analytic{youruniquename}``` - **Tablename**: ```athena_user_events``` ![S3-GlueCatalog](https://i.imgur.com/Hmz9fKz.png) - **Partition keys**: select 3 field in following order year > month > day ![Partition keys](https://i.imgur.com/Tk7mfbr.png) 6. Run Glue ETL On the right top of current screen: - Choose **Save** then **Run** to start your first Glue ETL. - Check **Runs** menu view **job runs** ![Glue ETL Jobs](https://i.imgur.com/09nh53z.png) Go to S3 location ```s3://serverless-analytic-adhoc-{your-unique-name}/optimized_raw_events/athena_user_events/``` to check Glue ETL output: ![Glue ETL Jobs Output](https://i.imgur.com/uyNZPn3.png) #### 4.2.4.Play with Athena Search **AWS Athena** in AWS Console: ![AWS Athena](https://i.imgur.com/6ZS1h8r.png) 1. Configure Athena - Choose **View settings** then click **Manage** ![Manage settings Athena](https://i.imgur.com/MvHChpl.png) - Paste this S3 location: ```s3://serverless-analytic-adhoc-{your-unique-name}/athena-query-logs``` - Choose **Save** ![S3 locaton for athena logs](https://i.imgur.com/hjp5oYY.png) 2. Play with Athena Console - Go back to **Editor** tab: ![Athena Editor](https://i.imgur.com/0PWhQD0.png) - Write your SQL code to editor and select **Run** - For example: - Select data in a specific day: ``` select * from serverless_analytic.athena_user_events where year = 2022 and month = 3 and day = 19 limit 10 ``` ![signin-user in a day](https://i.imgur.com/CV1V8jo.png) - Which page is accessed by many user(anonymous and authenticated user) ``` select page, count(*) as number_of_user from serverless_analytic.athena_user_events where year = 2022 and month = 3 and day = 19 group by page limit 100 ``` ![](https://i.imgur.com/BzYCMjr.png) - Where are your users location? ``` select country, city, count(*) as number_of_user from serverless_analytic.athena_user_events where year = 2022 and month = 3 and day = 19 group by country, city order by count(*) limit 100 ``` ![user location](https://i.imgur.com/BAZXCgU.png) - You can do more by yourself. ### 4.3.Dashboards with Quicksight #### 4.3.1.Setup Quicksight 1. Setup Quicksight Account - From AWS Console, search ```Quicksight``` and go to the main page. ![Main page - Quicksight](https://i.imgur.com/7V1DLvt.png) - Choose **Signup for Quicksight** - Choose your edition, in this exercise, we gonna use ```Enterprise``` with Free trial for 30 days. Then **Continue** ![Quicksight free trial](https://i.imgur.com/qdGpV4I.png) - Fill form with below values: - **Authentication method**: ```Use IAM federated identities & QuickSight-managed users``` - **QuickSight region**: ```Asia Pacific (Singopore)``` - In **Account info** section: - **QuickSight account name**: ```quicksight-{youruniquename}``` - **Notification email address**: ```your email``` ![Account info](https://i.imgur.com/ytioin9.png) - In **QuickSight access to AWS services** section: - Choose **IAM role**: ```Use QuickSight-managed role (default)``` - In **Allow access and autodiscovery for these resources**, choose ```IAM```, ```Amazon S3``` with location ,```Amazon Athena``` ![Quicksight-access](https://i.imgur.com/3yOrvYW.png) ![S3 location](https://i.imgur.com/UD9EH2X.png) - Choose **Finish** to complete setup process. You should see this when completed: ![Quicksight account](https://i.imgur.com/GvkbXfO.png) 2. Add Athena dataset to Quicksight - Choose **Datasets** from the left navigation bar. You should see some examples but we will add another dataset from Athena source. ![Quicksight Datasets](https://i.imgur.com/oGTS64B.png) - Choose **New dataset** then choose **Athena** Data source and add your **Data source name**. Click **Create data source** ![Athena Data source](https://i.imgur.com/0x1nTu5.png) - Fill with information in the below image on the next screen then press **select** ![Choose Athena table](https://i.imgur.com/ey6iuHL.png) - Choose option **Directly query your data** to get refresh data from Athena then press **Visualize** ![Not Choose SPICE](https://i.imgur.com/1DwQpzf.png) #### 4.3.2.Create Dashboard in Quicksight You will use Athena user events data to create dashboards in Quicksight. You can do what best fit with your usecases, it's your challenge. Below is some example result: - Event count by Pages and User Agent. ![Pages and UserAgent](https://i.imgur.com/LvO37tT.png) - Event count by User Location. ![User Location](https://i.imgur.com/0liJWnJ.png) ## 5.Clean-up-resources We will need to delete the resources we created in this exercise. - S3 buckets. - DynamoDB Tables. - Kinesis Stream/Firehose/Data Analytic. - Cognito User/Idenity pool. - Lambda functions. - Glue ETL Jobs, Glue Catalog Tables and database.