Key AWS Services for Data Engineering

# Key AWS Services for Data Engineering Here's a more detailed look at how some core AWS services are typically used in Data Engineering pipelines, based on the provided course material (`AWSCertifiedDataEngineerSlides.pdf`): ## Amazon S3 (Simple Storage Service) * **Use:** * Often serves as the foundation for a **data lake**[cite: 39]. * Stores vast amounts of raw data (structured, semi-structured, unstructured) in its native format cost-effectively[cite: 39]. * Data Engineers utilize different S3 storage classes (Standard, IA, Glacier, Intelligent-Tiering) and lifecycle policies to manage cost and access frequency. * Replication (CRR/SRR) can be used for backup, lower latency access, or log aggregation across regions/accounts[cite: 208]. * **Integration:** * Acts as a central hub. Data can be ingested into S3, processed by services like EMR or Glue, queried directly by Athena or Redshift Spectrum, and results stored back in S3[cite: 39, 377, 525, 562]. * S3 events can trigger processing workflows via Lambda or EventBridge. ## AWS Glue * **Use:** * Acts as a serverless data integration service[cite: 523, 524]. * The **Glue Data Catalog** is central; crawlers scan data sources (S3, RDS, etc.) to infer schemas and populate this catalog, making data easily discoverable and queryable[cite: 525]. * Core function is **ETL (Extract, Transform, Load)**[cite: 60, 524, 530]. Data Engineers use Glue to build ETL jobs (often using Spark via Python or Scala) to clean, transform, and enrich data, moving it between different systems (e.g., from S3 to Redshift). * Supports both batch and **streaming ETL**[cite: 543]. * **Glue Studio** provides a visual interface for building pipelines[cite: 544]. * Includes **Glue Data Quality** for defining and evaluating rules [cite: 545] and **Glue DataBrew** for visual data preparation[cite: 547]. * **Job bookmarks** prevent reprocessing old data[cite: 539]. * **Integration:** * Tightly integrated with S3 (source/target, Data Catalog)[cite: 525, 531], Athena (uses Data Catalog)[cite: 565], Redshift Spectrum (uses Data Catalog)[cite: 380, 525]. * Glue Workflows orchestrate complex multi-step ETL processes[cite: 550]. ## Amazon Kinesis * **Use:** * Family of services for **real-time streaming data**[cite: 621]. * **Kinesis Data Streams:** Ingests high-volume, real-time data (clickstreams, IoT data, logs) reliably, with ordering preserved within shards[cite: 621, 623]. * **Kinesis Data Firehose:** Simplifies loading streaming data into destinations like S3, Redshift, OpenSearch, handling buffering, compression, and light transformations (via Lambda). Near real-time, not persistent storage[cite: 661]. * **Kinesis Data Analytics (Managed Service for Apache Flink):** Real-time SQL or Flink-based analysis and processing directly on streams. * **Integration:** * Producers: Custom apps (SDK/KPL), Kinesis Agent, AWS services (IoT, CloudWatch Logs)[cite: 627, 634]. * Consumers: Custom apps (KCL/SDK), Lambda, Firehose, Kinesis Data Analytics[cite: 635, 640]. * Firehose often lands data in S3 or Redshift[cite: 656]. ## Amazon Redshift * **Use:** * Fully managed, petabyte-scale **data warehouse** optimized for complex analytical queries (OLAP)[cite: 36, 377]. * Data Engineers load processed data (often via Glue ETL or `COPY` from S3)[cite: 388]. * Requires designing schemas, distribution styles (EVEN, KEY, ALL), and sort keys for performance. * Used for business intelligence reporting and analytics dashboards[cite: 378]. * **Redshift Spectrum** allows querying data directly in S3 without loading, using the Glue Data Catalog[cite: 380]. * **Integration:** * Integrates with S3 (COPY/UNLOAD, Spectrum)[cite: 380, 388], Glue (Data Catalog for Spectrum)[cite: 525], DMS (as target)[cite: 449], QuickSight (visualization)[cite: 732], EMR[cite: 597]. ## Amazon EMR (Elastic MapReduce) * **Use:** * Managed service for running big data frameworks like **Apache Spark**, Hadoop, Hive, Presto, Flink on scalable EC2 clusters[cite: 601]. * Used for large-scale batch processing, complex ETL, data transformation, and machine learning tasks requiring specific frameworks[cite: 601]. * Can process data from S3 (using **EMRFS**) or HDFS (ephemeral storage on the cluster)[cite: 606]. * **Integration:** * Reads from/writes to S3[cite: 605, 606], integrates with Glue Data Catalog (as Hive metastore)[cite: 529, 605], DynamoDB, Kinesis[cite: 596, 605], and can be orchestrated by Data Pipeline or Step Functions[cite: 605, 794]. ## Amazon Athena * **Use:** * **Serverless, interactive query service** to analyze data directly in **Amazon S3** using standard SQL[cite: 562]. * Used for ad-hoc querying, data exploration, quick analysis of logs (CloudTrail, ELB, etc.) without setting up clusters or loading data[cite: 564, 570]. * **Integration:** * Relies on **AWS Glue Data Catalog** for schemas[cite: 565]. * Source for QuickSight[cite: 564]. Integrates with Lake Formation[cite: 553]. * Federated Queries allow querying other sources (RDS, DynamoDB) via Lambda connectors[cite: 589]. ## AWS Lambda * **Use:** * Serverless compute for **event-driven processing**[cite: 469, 475]. * Used to trigger transformations on new S3 data[cite: 236, 478], process records from Kinesis or DynamoDB Streams[cite: 329, 371, 640], perform light ETL, or orchestrate pipeline steps (via Step Functions)[cite: 786]. * **Integration:** * Triggered by S3 events, SNS, SQS, Kinesis, DynamoDB Streams, API Gateway, EventBridge, etc[cite: 477]. * Can interact with almost any AWS service via the SDK[cite: 998]. ## AWS Database Migration Service (DMS) & Schema Conversion Tool (SCT) * **Use:** * **DMS:** Migrates databases to AWS, supporting **homogeneous** and **heterogeneous** migrations while source remains online. Supports continuous data replication (CDC)[cite: 448, 451]. * **SCT:** Used with DMS for heterogeneous migrations to convert source schema and code objects to be compatible with the target database[cite: 450]. * **Integration:** * Moves data from various databases to AWS targets like RDS, Aurora, Redshift, DynamoDB, S3[cite: 449]. Requires an EC2 replication instance[cite: 448].