###### tags: `AWS` ` Glue` ` Immersion Day` # Amazon Glue Immersion Day - Tips and Tricks This is my list of hints and tips for this course. It's markdown so you can save it, access it or store it anywhere. I'll add specific answers to questions I get during the course. I'll share it with everyone. # Administrivia, Schedule and Planning This immersion day will be delivered in a single day. The Glue Immersion Labs https://catalog.us-east-1.prod.workshops.aws/event/dashboard/en-US contains 10 labs. This is a significant body of work and not practical to squeeze into a single day. Open your AWS Account in a new tab via the link on the left hand side: ![](https://i.imgur.com/ei27h9G.png) The event will be delivered as follows: | Lab | Topics | Duration | StartTime | |---------|-----------------------------|----------|--------| | One | Intro to Glue | 15 | 9:30am | | Two | Glue Data Catalog | 30 | 9:45am | | Three | Working with Spark | 45 | 10:15am | | - | **Break** | **15** | **11:00am** | | Four | Glue ETL | 45 | 11:45am | | Five | Glue Studio | 45 | 12:30pm | | - | **Lunch** | **45** | **1:15am**| | Six | Glue DataBrew | 45 | 2:00pm | | Seven | Monitor | 45 | 2:45pm | | Eight | Orchestration | 45 | 3:30pm | | Other | Glue DataBrew | - | EOB | | - | **Drinks** | - | - | **Additional Labs** - Glue DataBrew Immersion Day Topics: https://catalog.us-east-1.prod.workshops.aws/workshops/6532bf37-3ad2-4844-bd26-d775a31ce1fa/en-US/30-howtostart/cloudformation *(These can be run in the same event engine)* Launch the cloudformation stack and create stack. Once its completed make sure you are using the US East 1 (N Virginia) region in the top right of the console. ![](https://i.imgur.com/1ZHLbbf.png) | Lab | Topics | Duration | Time | |---------|-----------------------------|----------|--------| | One | Profiling and Data Quality | 60 | * | | Two | Standard Transform | 45 | * | | Three | Advanced Transform | 60 |* | ### Facility Chime Link # Glue Immersion Day Labs Each of you will get new accounts they will be shared on the day. A event hash will be shared with you and login via workshop studio: https://catalog.us-east-1.prod.workshops.aws ## Lab Info Make sure you complete the setup steps under: How to Start? > AWS Event > [Prepare S3 Bucket and Clone Files](https://catalog.us-east-1.prod.workshops.aws/event/dashboard/en-US/workshop/howtostart/awseevnt/s3-and-local-file) **HINT**: You will be asked to copy/paste your ${BucketName} in most labs so keep it handy in a notepad. ### SECTION 01: WORKING WITH GLUE DATA CATALOG *Use AWS Glue to discover and store a metadata schema using the Glue Data Catalog and a Glue Crawler.* ### SECTION 02: WORKING WITH APACHE SPARK *Demonstrate with code samples how to use standard PySpark and Glue-flavored PySpark to develop Glue ETL (extract, transform, load) code and use 3rd party Python libraries in Glue.* **Notes** Lab 02: Working with Apache Spark > Spark Dataframe. replace ‘dfcsv’ with ‘df’ in the Notebook’s Step 4 ### SECTION 03: WORKING WITH GLUE ETL *Develop, package, and deploy regular Glue ETL jobs.* ### SECTION 04: WORKING WITH GLUE STUDIO *How to use Glue Studio to create ETL and streaming ETL jobs. Lab focuses on how to create custom transformation with Glue script in a notebook development environment.* **Notes** Lab 05: Working with Glue Studio > Basic Transformations. Source, Tranform, Target buttons are now Source, Action, Target in the UI in Step 6. **Optional: ML with Glue & Glue Studio** This particular lab is going to cover the steps required to work with the Glue ML Transforms, more specifically, it will teach you about how to create, train (with labeling files), and write Glue ETL code that leverages the Glue's FindMatches ML Transform. For simplicity, you will leverage AWS Glue Studio Notebooks to test, validate and further operationalize the code you developed that includes the ML transform you created and trained during this lab, into a Glue Job. ### SECTION 05: WORKING WITH GLUE DATABREW *How to use Glue DataBrew, a visual data preparation tool, to create a DataBrew dataset and dataset profile. Work inside DataBrew project and create a DataBrew Recipe, manage DataBrew recipes, and execute DataBrew jobs.* ### SECTION 07: MONITORING AND TROUBLESHOOTING *Learn how to monitor and troubleshoot AWS Glue ETL jobs* **Notes** Lab 07: Monitoring, Troubleshooting and Scaling > Troubleshoot Job Using Log. You can't edit 'glueworkshop-lab3-etl-job' config fields in the UI as its being controlled by the notebook configuration . You must define configuration using the %magic at the top of the notebook. To define continous logging in the magic %%configure ``` %%configure { "--enable-spark-ui": "true", "--spark-event-logs-path": "s3://${BUCKETNAME}/output/lab3/sparklog/", "--enable-continuous-cloudwatch-log": "true", "--enable-continuous-log-filter": "false", "--continuous-log-logGroup": "glueworkshop-lab3-etl-ddb-job", "max_retries": "0" } ``` Or configure argument list for the AWS CLI https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html ``` $ aws glue start-job-run --job-name "CSV to CSV" --arguments='--enable-continuous-cloudwatch-log": "true""' ``` Make sure your IAM Role has - GetLogEvents, PutLogEvents, CreateLogStream, DescribeLogStreams, CreateLogGroup. Lab 07: Monitoring, Troubleshooting and Scaling > Enabling Spark UI and Monitoring Launch the Spark history server CloudFormation stack for glue 3.0 in region https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html#monitor-spark-ui-history-cfn [AP-SOUTHEAST-2](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-2#/stacks/new?templateURL=https%3A%2F%2Faws-glue-sparkui-prod-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/public/cfn/glue-3_0/sparkui.yaml&stackName=spark-ui-glue3) (20 minutes) **Issues** with CFN deployment currently will review. Lab 07: Monitoring, Troubleshooting and Scaling > Job Status Notification Complete the commands in your Cloud9 terminal. Make sure you replace the SNS arn details with the ones you create in Step 1. ### SECTION 08: GLUE JOB ORCHESTRATION *Learn how to orchestrate AWS Glue jobs using Glue Workflow and StepFunction.* **Notes**: Focus on using Step Functions to create modern serverless worklflows between Glue and other resources. # What is AWS Glue? AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. A good place to start learning about AWS Glue is https://aws.amazon.com/glue/faqs/ #### Data Catalog You can use the Data Catalog to quickly discover and search multiple AWS datasets without moving the data. #### Glue Studio AWS Glue Studio makes it easier to visually create, run, and monitor AWS Glue ETL jobs. You can build ETL jobs that move and transform data using a drag-and-drop editor, and AWS Glue automatically generates the code. #### Glue DataBrew With AWS Glue DataBrew, you can explore and experiment with data directly. You can choose from over 250 prebuilt transformations in DataBrew to automate data preparation tasks such as filtering anomalies, standardizing formats, and correcting invalid values. ## Data Integration Services While this event is themed around AWS Glue let's ensure we have a broad view of relevant architectures. AWS Services don't constrain you to picking 'one size fits all'. ## AWS Glue versus EMR? The question arose; 'Glue versus EMR?' While not having context on use cases the following may be helpful to answer that question: AWS Glue and EMR are both capable of enabling ETL processes and workflows. However, there are some fundamental differences in the way the two services operate. ### SERVERLESS VS. MANAGED SERVICES AWS Glue is a serverless data integration platform that handles the infrastructure, configuration options, and setup. It can work with structured and semi-structured data formats to automatically infer schema references. The AWS Glue framework is Apache Spark as the driver. Amazon EMR is a managed service overlay for self-configured infrastructure cluster running on Amazon EC2 instances. EMR does also offer a dedicated serverless option. EMR supports Apache Hadoop ecosystem components like Spark, Hive, HBase and Presto, with data storage in Amazon Athena, Amazon Redshift, and other big data analytics solutions. ### Can I just use EMR Serverless, is that like Glue? EMR Serverless removed the need to provision compute capacity in your EMR cluster. You can then just submit queries to the cluster. EMR Serverless figures out the capacity to run that job. **Value Proposition** - Don't have to figure out the how many workers you need! They are provisioned as needed. Removes the need to figure out the provisioned capacity. **Gotcha** - You stil need to manage the lifecycle of your serverless application with commands e.g. Create, Start, Stop, Terminate. This isn't automatic. With EMR serverless you still need to DELETE your app or you will be charged. You just don't need to manage capacity. AWS Glue offers user interfaces for Glue Studio and Glue DataBrew to perform regular ETL and data preperation workflows. ### SUMMARY AWS Glue and Amazon EMR are similar platforms differentiated by their simplicity and flexibility. AWS Glue is a quick, low-effort way to execute ETL jobs in the cloud. EMR is a more robust, feature-rich big data processing solution that enables ETL alongside real-time data streaming for ML workloads using existing infrastructure. EMR’s flexibility comes with a management burden that can become costly if you do not terminate your long running clusters when idle. # Glue Links * Getting started: https://docs.aws.amazon.com/glue/latest/dg/setting-up.html * Glue Best Practices: https://docs.aws.amazon.com/prescriptive-guidance/latest/serverless-etl-aws-glue/best-practices.html * Cost savings with Glue Flex exectuion: https://aws.amazon.com/blogs/big-data/introducing-aws-glue-flex-jobs-cost-savings-on-etl-workloads/ * Official AWS Glue tutorials: https://docs.aws.amazon.com/glue/latest/ug/tutorial-create-job.html * **NEW** Introducing AWS Glue for Ray: Scaling your data integration workloads using Python https://aws.amazon.com/blogs/big-data/introducing-aws-glue-for-ray-scaling-your-data-integration-workloads-using-python/ * Re:Invent videos (Level 400) https://youtu.be/S_xeHvP7uMo AWS re:Invent 2018: Building serverless analytics pipelines with AWS Glue (1:01:13) * Autoscaling AWS Glue jobs in Apache Spark https://aws.amazon.com/blogs/big-data/introducing-aws-glue-auto-scaling-automatically-resize-serverless-computing-resources-for-lower-cost-with-optimized-apache-spark/ ### Operational * Working with Glue Job's locally: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html * Bring your own IDE - Interactive sessions with Microsoft Visual Studio Code: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-vscode.html * Python libraries for local development of glue pyspark batch jobs https://github.com/awslabs/aws-glue-libs * Develop and test AWS Glue version 3.0 jobs locally using a Docker container https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/ * Enabling the DataBrew extension for JupyterLab https://docs.aws.amazon.com/databrew/latest/dg/jupyter-enabling-databrew.html ### Use Cases * Build, Test and Deploy ETL solutions using AWS Glue and AWS CDK based CI/CD pipelines https://aws.amazon.com/blogs/big-data/build-test-and-deploy-etl-solutions-using-aws-glue-and-aws-cdk-based-ci-cd-pipelines/ * A serverless operational data lake for retail with AWS Glue, Amazon Kinesis Data Streams, Amazon DynamoDB, and Amazon QuickSight https://aws.amazon.com/blogs/big-data/a-serverless-operational-data-lake-for-retail-with-aws-glue-amazon-kinesis-data-streams-amazon-dynamodb-and-amazon-quicksight/ ### ML Focus * Preparing data for ML models using AWS Glue DataBrew in a Jupyter notebook https://aws.amazon.com/blogs/big-data/preparing-data-for-ml-models-using-aws-glue-databrew-in-a-jupyter-notebook/ * Moving from notebooks to automated ML pipelines using Amazon SageMaker and AWS Glue https://aws.amazon.com/blogs/machine-learning/moving-from-notebooks-to-automated-ml-pipelines-using-amazon-sagemaker-and-aws-glue/ * Prepare data at scale in Amazon SageMaker Studio using serverless AWS Glue interactive sessions https://aws.amazon.com/blogs/machine-learning/prepare-data-at-scale-in-amazon-sagemaker-studio-using-serverless-aws-glue-interactive-sessions/ * Large-scale feature engineering with sensitive data protection using AWS Glue interactive sessions and Amazon SageMaker Studio https://aws.amazon.com/blogs/machine-learning/large-scale-feature-engineering-with-sensitive-data-protection-using-aws-glue-interactive-sessions-and-amazon-sagemaker-studio/ # Self Paced Learning (Free) ### Skill Builder ### #### Data Analytics Fundamentals (3:30:00) #### https://explore.skillbuilder.aws/learn/course/external/view/elearning/44/data-analytics-fundamentals *This course takes you through five key factors that indicate the need for specific AWS services in collecting, processing, analyzing, and presenting your data. This includes learning basic architectures, value propositions, and potential use cases.* #### AWS Cloud Quest: Cloud Practitioner #### https://explore.skillbuilder.aws/learn/course/external/view/elearning/11458/aws-cloud-quest-cloud-practitioner *AWS Cloud Quest: Cloud Practitioner is a role-playing learning game that helps you develop practical cloud skills through interactive learning and hands-on activities using AWS services.* ### AWS Educate ### https://www.awseducate.com/ #### Introduction to Cloud 101 (8:00:00) #### *The goal of this course is to set a foundation for cloud knowledge and help you decide where to take your learning next.* #### Machine Learning Foundations (3:00:00) #### *This course was designed to introduce you to the fundamental concepts of machine learning and the machine learning pipeline. By the end of this course you will be able to discuss machine learning and how to apply the machine learning pipeline to solve a business problem* #### Data Science (35:00:00) #### *Advanced Role - This pathway introduces the skills needed to be a Data Scientist.* # AWS re:Invent 2022 Session of interest https://reinvent.awsevents.com/?trk=live.awsevents.com **[How Disney used AWS Glue as a data integration and ETL framework [ANT335]](https://www.youtube.com/watch?v=ce6t3FqB_Z4)** Disney Parks, Experiences and Products is one of the world’s leading providers of family travel and leisure experiences. Disney Parks, Experiences and Products uses AWS Glue—a serverless data integration service—as a key component to replace thousands of Apache Hadoop, Spark, and Sqoop jobs. In this session, Disney and AWS Glue experts discuss some ways to scale AWS Glue beyond the traditional setup and how they configure AWS Glue for job monitoring and performanc **[Reinvent your business with an AWS modern data strategy [PEX308]](https://www.youtube.com/watch?v=zA3s4ZM6CWo)** AWS can help organizations quickly get answers from data by providing mature and integrated analytics services ranging from cloud data warehouses to serverless data lakes. Join this session to learn about AWS modern data architecture and how AWS analytics services can help your customers navigate data challenges, optimize analytics processes, and deliver insights to their businesses faster. This session also includes customer case studies and a demo of a modern data platform using AWS analytics serverless services with real-world streaming data. This session is intended for AWS Partners. **Simplifying ETL migration and data integration with AWS Glue [ANT322]** Organizations are modernizing their data stacks with AWS. This chalk talk reviews how AWS Glue makes it easy to migrate your data integration and ETL workloads to the cloud using a serverless architecture that lets you focus on your data. See demos and a deep dive into some of the methods AWS Glue provides for migration ## re:Invent 2022 Announcements Introducing AWS Glue for Ray: Scaling your data integration workloads using Python https://aws.amazon.com/blogs/big-data/introducing-aws-glue-for-ray-scaling-your-data-integration-workloads-using-python/ # Questions **How do I use external Python libraries in my AWS Glue 2.0 ETL jobs?** https://aws.amazon.com/premiumsupport/knowledge-center/glue-version2-external-python-libraries/ **What modules are included with PySpark?** https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html **Can I use MLlib with AWS Glue?** MLlib comes as part of Spark 2.4, which is the default version on AWS Glue. There is no need to add libraries to use MLlib within AWS Glue. After any pre-processing, transorms you've done in your ETL you need to train your data. Access the MLlib in AWS Glue via import statements. Example; ``` from pyspark.ml import Pipeline from pyspark.ml.regression import DecisionTreeRegressor from pyspark.ml.evaluation import RegressionEvaluator ``` # Survey