Try   HackMD
tags: AWS Glue Immersion Day

Amazon Glue Immersion Day - Tips and Tricks

This is my list of hints and tips for this course. It's markdown so you can save it, access it or store it anywhere. I'll add specific answers to questions I get during the course. I'll share it with everyone.

Administrivia, Schedule and Planning

This immersion day will be delivered in a single day.

The Glue Immersion Labs https://catalog.us-east-1.prod.workshops.aws/event/dashboard/en-US contains 10 labs. This is a significant body of work and not practical to squeeze into a single day.

Open your AWS Account in a new tab via the link on the left hand side:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The event will be delivered as follows:

Lab Topics Duration StartTime
One Intro to Glue 15 9:30am
Two Glue Data Catalog 30 9:45am
Three Working with Spark 45 10:15am
- Break 15 11:00am
Four Glue ETL 45 11:45am
Five Glue Studio 45 12:30pm
- Lunch 45 1:15am
Six Glue DataBrew 45 2:00pm
Seven Monitor 45 2:45pm
Eight Orchestration 45 3:30pm
Other Glue DataBrew - EOB
- Drinks - -

Additional Labs - Glue DataBrew Immersion Day Topics:
https://catalog.us-east-1.prod.workshops.aws/workshops/6532bf37-3ad2-4844-bd26-d775a31ce1fa/en-US/30-howtostart/cloudformation
(These can be run in the same event engine)

Launch the cloudformation stack and create stack. Once its completed make sure you are using the US East 1 (N Virginia) region in the top right of the console.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Lab Topics Duration Time
One Profiling and Data Quality 60 *
Two Standard Transform 45 *
Three Advanced Transform 60 *

Facility

Chime Link

Glue Immersion Day Labs

Each of you will get new accounts they will be shared on the day. A event hash will be shared with you and login via workshop studio:
https://catalog.us-east-1.prod.workshops.aws

Lab Info

Make sure you complete the setup steps under:

How to Start? > AWS Event > Prepare S3 Bucket and Clone Files

HINT: You will be asked to copy/paste your ${BucketName} in most labs so keep it handy in a notepad.

SECTION 01: WORKING WITH GLUE DATA CATALOG

Use AWS Glue to discover and store a metadata schema using the Glue Data Catalog and a Glue Crawler.

SECTION 02: WORKING WITH APACHE SPARK

Demonstrate with code samples how to use standard PySpark and Glue-flavored PySpark to develop Glue ETL (extract, transform, load) code and use 3rd party Python libraries in Glue.

Notes
Lab 02: Working with Apache Spark > Spark Dataframe.
replace ‘dfcsv’ with ‘df’ in the Notebook’s Step 4

SECTION 03: WORKING WITH GLUE ETL

Develop, package, and deploy regular Glue ETL jobs.

SECTION 04: WORKING WITH GLUE STUDIO

How to use Glue Studio to create ETL and streaming ETL jobs. Lab focuses on how to create custom transformation with Glue script in a notebook development environment.

Notes
Lab 05: Working with Glue Studio > Basic Transformations.
Source, Tranform, Target buttons are now Source, Action, Target in the UI in Step 6.

Optional: ML with Glue & Glue Studio
This particular lab is going to cover the steps required to work with the Glue ML Transforms, more specifically, it will teach you about how to create, train (with labeling files), and write Glue ETL code that leverages the Glue's FindMatches ML Transform. For simplicity, you will leverage AWS Glue Studio Notebooks to test, validate and further operationalize the code you developed that includes the ML transform you created and trained during this lab, into a Glue Job.

SECTION 05: WORKING WITH GLUE DATABREW

How to use Glue DataBrew, a visual data preparation tool, to create a DataBrew dataset and dataset profile. Work inside DataBrew project and create a DataBrew Recipe, manage DataBrew recipes, and execute DataBrew jobs.

SECTION 07: MONITORING AND TROUBLESHOOTING

Learn how to monitor and troubleshoot AWS Glue ETL jobs

Notes
Lab 07: Monitoring, Troubleshooting and Scaling > Troubleshoot Job Using Log.
You can't edit 'glueworkshop-lab3-etl-job' config fields in the UI as its being controlled by the notebook configuration . You must define configuration using the %magic at the top of the notebook. To define continous logging in the magic %%configure

%%configure 
{
  "--enable-spark-ui": "true",
  "--spark-event-logs-path": "s3://${BUCKETNAME}/output/lab3/sparklog/",
  "--enable-continuous-cloudwatch-log": "true",
  "--enable-continuous-log-filter": "false",
  "--continuous-log-logGroup": "glueworkshop-lab3-etl-ddb-job",
  "max_retries": "0"         
}

Or configure argument list for the AWS CLI
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

$ aws glue start-job-run --job-name "CSV to CSV" --arguments='--enable-continuous-cloudwatch-log": "true""'

Make sure your IAM Role has - GetLogEvents, PutLogEvents, CreateLogStream, DescribeLogStreams, CreateLogGroup.

Lab 07: Monitoring, Troubleshooting and Scaling > Enabling Spark UI and Monitoring
Launch the Spark history server CloudFormation stack for glue 3.0 in region
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html#monitor-spark-ui-history-cfn

AP-SOUTHEAST-2 (20 minutes)
Issues with CFN deployment currently will review.

Lab 07: Monitoring, Troubleshooting and Scaling > Job Status Notification
Complete the commands in your Cloud9 terminal. Make sure you replace the SNS arn details with the ones you create in Step 1.

SECTION 08: GLUE JOB ORCHESTRATION

Learn how to orchestrate AWS Glue jobs using Glue Workflow and StepFunction.

Notes: Focus on using Step Functions to create modern serverless worklflows between Glue and other resources.

What is AWS Glue?

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
A good place to start learning about AWS Glue is https://aws.amazon.com/glue/faqs/

Data Catalog

You can use the Data Catalog to quickly discover and search multiple AWS datasets without moving the data.

Glue Studio

AWS Glue Studio makes it easier to visually create, run, and monitor AWS Glue ETL jobs. You can build ETL jobs that move and transform data using a drag-and-drop editor, and AWS Glue automatically generates the code.

Glue DataBrew

With AWS Glue DataBrew, you can explore and experiment with data directly. You can choose from over 250 prebuilt transformations in DataBrew to automate data preparation tasks such as filtering anomalies, standardizing formats, and correcting invalid values.

Data Integration Services

While this event is themed around AWS Glue let's ensure we have a broad view of relevant architectures. AWS Services don't constrain you to picking 'one size fits all'.

AWS Glue versus EMR?

The question arose; 'Glue versus EMR?' While not having context on use cases the following may be helpful to answer that question:

AWS Glue and EMR are both capable of enabling ETL processes and workflows. However, there are some fundamental differences in the way the two services operate.

SERVERLESS VS. MANAGED SERVICES

AWS Glue is a serverless data integration platform that handles the infrastructure, configuration options, and setup. It can work with structured and semi-structured data formats to automatically infer schema references. The AWS Glue framework is Apache Spark as the driver.

Amazon EMR is a managed service overlay for self-configured infrastructure cluster running on Amazon EC2 instances. EMR does also offer a dedicated serverless option. EMR supports Apache Hadoop ecosystem components like Spark, Hive, HBase and Presto, with data storage in Amazon Athena, Amazon Redshift, and other big data analytics solutions.

Can I just use EMR Serverless, is that like Glue?

EMR Serverless removed the need to provision compute capacity in your EMR cluster. You can then just submit queries to the cluster. EMR Serverless figures out the capacity to run that job.

Value Proposition - Don't have to figure out the how many workers you need! They are provisioned as needed. Removes the need to figure out the provisioned capacity.

Gotcha - You stil need to manage the lifecycle of your serverless application with commands e.g. Create, Start, Stop, Terminate. This isn't automatic. With EMR serverless you still need to DELETE your app or you will be charged. You just don't need to manage capacity.

AWS Glue offers user interfaces for Glue Studio and Glue DataBrew to perform regular ETL and data preperation workflows.

SUMMARY

AWS Glue and Amazon EMR are similar platforms differentiated by their simplicity and flexibility. AWS Glue is a quick, low-effort way to execute ETL jobs in the cloud. EMR is a more robust, feature-rich big data processing solution that enables ETL alongside real-time data streaming for ML workloads using existing infrastructure. EMR’s flexibility comes with a management burden that can become costly if you do not terminate your long running clusters when idle.

Glue Links

Operational

Use Cases

ML Focus

Self Paced Learning (Free)

Skill Builder

Data Analytics Fundamentals (3:30:00)

https://explore.skillbuilder.aws/learn/course/external/view/elearning/44/data-analytics-fundamentals

This course takes you through five key factors that indicate the need for specific AWS services in collecting, processing, analyzing, and presenting your data. This includes learning basic architectures, value propositions, and potential use cases.

AWS Cloud Quest: Cloud Practitioner

https://explore.skillbuilder.aws/learn/course/external/view/elearning/11458/aws-cloud-quest-cloud-practitioner

AWS Cloud Quest: Cloud Practitioner is a role-playing learning game that helps you develop practical cloud skills through interactive learning and hands-on activities using AWS services.

AWS Educate

https://www.awseducate.com/

Introduction to Cloud 101 (8:00:00)

The goal of this course is to set a foundation for cloud knowledge and help you decide where to take your learning next.

Machine Learning Foundations (3:00:00)

This course was designed to introduce you to the fundamental concepts of machine learning and the machine learning pipeline. By the end of this course you will be able to discuss machine learning and how to apply the machine learning pipeline to solve a business problem

Data Science (35:00:00)

Advanced Role - This pathway introduces the skills needed to be a Data Scientist.

AWS re:Invent 2022 Session of interest

https://reinvent.awsevents.com/?trk=live.awsevents.com

How Disney used AWS Glue as a data integration and ETL framework [ANT335]
Disney Parks, Experiences and Products is one of the world’s leading providers of family travel and leisure experiences. Disney Parks, Experiences and Products uses AWS Glue—a serverless data integration service—as a key component to replace thousands of Apache Hadoop, Spark, and Sqoop jobs. In this session, Disney and AWS Glue experts discuss some ways to scale AWS Glue beyond the traditional setup and how they configure AWS Glue for job monitoring and performanc

Reinvent your business with an AWS modern data strategy [PEX308]
AWS can help organizations quickly get answers from data by providing mature and integrated analytics services ranging from cloud data warehouses to serverless data lakes. Join this session to learn about AWS modern data architecture and how AWS analytics services can help your customers navigate data challenges, optimize analytics processes, and deliver insights to their businesses faster. This session also includes customer case studies and a demo of a modern data platform using AWS analytics serverless services with real-world streaming data. This session is intended for AWS Partners.

Simplifying ETL migration and data integration with AWS Glue [ANT322]
Organizations are modernizing their data stacks with AWS. This chalk talk reviews how AWS Glue makes it easy to migrate your data integration and ETL workloads to the cloud using a serverless architecture that lets you focus on your data. See demos and a deep dive into some of the methods AWS Glue provides for migration

re:Invent 2022 Announcements

Introducing AWS Glue for Ray: Scaling your data integration workloads using Python
https://aws.amazon.com/blogs/big-data/introducing-aws-glue-for-ray-scaling-your-data-integration-workloads-using-python/

Questions

How do I use external Python libraries in my AWS Glue 2.0 ETL jobs?
https://aws.amazon.com/premiumsupport/knowledge-center/glue-version2-external-python-libraries/

What modules are included with PySpark?
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

Can I use MLlib with AWS Glue?
MLlib comes as part of Spark 2.4, which is the default version on AWS Glue. There is no need to add libraries to use MLlib within AWS Glue. After any pre-processing, transorms you've done in your ETL you need to train your data. Access the MLlib in AWS Glue via import statements. Example;

from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator

Survey