# Adelaide Tech Series - Big Data ETL Adelaide Tech Series: Big Data ETL https://aws-tech-series-big-data-etl.splashthat.com/ ## When Wednesday, Sep 13 at 12:45pm to 5pm ## About the Event Join Ian Falconer and the AWS Adelaide technical team for a hands on workshop on Big Data. Ian, and the team, will showcase big data patterns for ETL / ELT, Data preparation and cleansing and for data processing at scale. The presentation will showcase Data Lakes, Data Warehouses and Federated Data Factories from the business perspective, then we'll focus the labs on working with the data that is consumed by these data consumers. We're going to focus on data processing; the middle layer between data storage and data visualisation. We'll explore several real world patterns, and architectures, that customers use to work with data silo'd in relational, file and other locations, We'll see how modern, serverless, zero ETL removes the undifferentiated heavy lifting of data cleansing and data fusion. Then we'll conclude with a proven pattern for cost effective data manipulation at Petabyte scale. Most of the labs will be automated, in terms of building the data processing solution and loading the data. You will spend your time on reverse engineering the patterns to understand how data is processed. You don't need any specific coding or data science skills. For those interested in customising these patterns, or diving deeper, we'll have additional links and challenges for you to explore during the session or later in your own time. # A Quick Refresh on Big Data ## What is Big Data? Although the term big data has been supplanted by 'Data Something' (Data Lakes, Data Warehouse, Federated Data Lakes, Data Swamp, Lake House, etc) the majority of work in this space is still all about Big Data. Even Gartner dropped the term Big Data from their future forecasts some years back. So what is Big Data? It's really any data that is too big or complicated to work with in a single computer, or in overused and very limiting tools like spreadsheet software. You can decide what is big data. Size, complexity, use of, etc are all highly variable. An Attempt at Data definitions - https://en.wikipedia.org/wiki/Data_engineering - https://en.wikipedia.org/wiki/Data_lake . I have a fondness for this link as it mentions 'data swamps' They seem quite common IMHO. Much of AWS guidance in this space helps to avoid, or repair, those data swamps. There are some key big picture metrics here that are often missing. Effort consumed per amount of data, or speed to insight are key to managing Big Data at scale. ### Further Reading - Big Data is such a varied topic that it is challenging to find broadly applicable advice. Start with 'Big Data Analytics Options on AWS - AWS Whitepaper'. This paper combines AWS Benefit, Well Architected and mapping to AWS Services. https://docs.aws.amazon.com/pdfs/whitepapers/latest/big-data-analytics-options/big-data-analytics-options.pdf - If you filter the AWS Whitepaper portal for 'Big Data and Analytics' you'll find 43 papers. https://aws.amazon.com/whitepapers - If you filter the AWS Prescriptive Guidance portal for 'Big Data and Analytics' you'll find 65 articles. https://aws.amazon.com/prescriptive-guidance ## Where Data Resides There are many new and legacy terms in use that have technical and marketing uses: - Data Lakes - Data Warehouse - Federated Data Lakes - Data Swamp - Lake House - Object Storage (aka Amazon S3) - HPC Clusters and attached storage - Backups and Storage Caching But the basics never change. To paraphrase, with a spin on data, an old military quote (by General Omar Bradley in WW2) "amateurs obsess over their data system, but professionals make sense of data" There are several technical concepts that remain valid when dealing with big data. They include: - Storage and Processing space required - Data definitions - Data structure and impact on query efficiency - Data quality and consistency - Insights sought (summation, statistics, patterns, categoration, bias, sensitivity of insights) - Hardware resources required - Libraries, frameworks and tools applicable The benefit of data processing in AWS is that the past constraints of hardware are eliminated. Now one can focus purely on dealing with the data And the Pareto law remains valid; 80% of effort will go to data processing. In the following labs were focused on that 80%; data processing. We'll whiteboard and discuss our approaches here. You can also checkout the further reading links in each section. # Labs We're doing 3 labs from 3 different AWS Workshops today. You'll be accessing a different AWS Account for each lab. The labs are: - Proprietary database migration (Oracle or MS SQL Server) - No code, reusable data transformation using AWS Glue Databrew - Fully customisable, cost efficient no scaling limit using Amazon EMR You'll need to log out of each account to access another account. You can access the labs at https://catalog.workshops.aws/ Your facilitator will share individual workshop login details during the session. ## Lab 1 - Data Silos In this lab we migrate from a relational database to another relational database. This would typically by on prem to AWS or may be from proprietary database to an open source or AWS managed database. This lab highlights how Amazon DMS makes database migrations simpler to execute than traditional migrations on prem. LabTitle AWS Database Migration Workshop / AWS DMS workshop for immersion days Lablink - https://catalog.us-east-1.prod.workshops.aws/workshops/77bdff4f-2d9e-4d68-99ba-248ea95b3aca/en-US/intro - Users choose one lab part only. Either Oracle or MS Sql Server Duration - 1 hour Prerequisites - EC2 Key Pair can be downloaded from the Workshop Studio console. Follow the instructions. - I would recommend read through the lab instructions of your chosen db migration (Oracle or MS SQL Server) first. Then do the lab. ### Further Reading - AWS Database Migration Service FAQs https://aws.amazon.com/dms/faqs/ - AWS RePost is always a good place to start with specific questions. https://repost.aws/tags/questions/TAloTfMMIZRtqwjRAie3cDVg?view=all&sort=votes - AWS DMS Documentation and migration advisory documents https://docs.aws.amazon.com/dms/ ## Lab 2 - Serverless This lab demonstrates the low code / no code approach using AWS Glue Databrew. Users may find this similar to using MS Excel (you can see the data and the changes you make near real time) but loosely coupled from the data and an ability to replay their ETL on multiple data sets. It's also worth looking at the data structure built by the 'setup' Cloudformation template. Think how you might use this approach to scaffold test data for testing and root causing. When finished you can tear everything down. LabTitle - AWS Glue DataBrew Immersion Day Advanced Transform NOTE: there is an optional Quicksight visualisation lab part. Lablink - https://catalog.us-east-1.prod.workshops.aws/workshops/6532bf37-3ad2-4844-bd26-d775a31ce1fa/en-US/60-advancetransform - Advanced Transform lab only Duration - 1 hour Prerequisites - Create data source and resources as per the 'How to Start / Self Paced Labs' section ### Further Reading - AWS Glue FAQs https://aws.amazon.com/glue/faqs/ ## Lab 3 - EMR This lab demonstrates the scalability and flexibility of Amazon Elastic Map Reduce (EMR) to process data using any Hadoop framework at PB scale. We run a compute intensive query that forces the environment to scale; and best of all we leverage spot compute to minimise scaling cost without impacting our ability to scale when needed. LabTitle ETL on Amazon EMR EMR Managed Scaling Lablink - https://catalog.us-east-1.prod.workshops.aws/workshops/c86bd131-f6bf-4e8f-b798-58fd450d3c44/en-US/emr-managed-scaling - EMR Managed Scaling lab only Duration 1/1/2 hr Prerequisites - Setup your Amazon Cloud9 IDE and create an EC2 key pair as per the Setup instructions. - Complete the Cluster Creation steps and confirm that you can SSH into your EMR cluster - Then you'll jump straight to the EMR Managed Scaling section - Cloudwatch dashboard built from AWS Cloudformation template supplied ### Changes to Lab Instructions - I encountered a missing permission when running the first command on hive. The error message is clear. I added the needed permission on the IAM role. ### Further Reading - Amazon EMR FAQs https://aws.amazon.com/emr/faqs/ - The Hadoop ecosystem - A long Tabular list of Hadoop tools and implementations https://hadoopecosystemtable.github.io/ - Hadoop in 5 cartoons https://content.pivotal.io/blog/demystifying-apache-hadoop-in-5-pictures - AWS Big Data Blog - Tag = EMR https://aws.amazon.com/blogs/big-data/tag/emr/page/2/ - AWS re:Invent 2020: Under the hood: How Amazon uses AWS for analytics at petabyte scale https://www.youtube.com/watch?v=XiiwXwiR0m8 - Amazon EMR Deep Dive and Best Practices https://www.youtube.com/watch?v=dU40df0Suoo - Turning Amazon EMR into a Massive Amazon S3 Processing Engine with Campanile https://aws.amazon.com/blogs/big-data/turning-amazon-emr-into-a-massive-amazon-s3-processing-engine-with-campanile/