Amazon - HackMD

## Highly Detailed AWS Data Engineer Interview Study Guide + Comparisons (Based on Provided Slides) ### I. Data Engineering Fundamentals (Pages 12-75) * **Types of Data (Pages 13-16):** * **Structured:** Organized in a defined manner/schema (rows/columns). Easily queryable. *Examples: Relational DB tables, consistent CSVs, Excel.* (Page 14) * **Unstructured:** No predefined structure. Not easily queryable without preprocessing. *Examples: Text files without fixed format, videos, audio, images, emails, Word docs.* (Page 15) * **Semi-Structured:** Some organization via tags, hierarchies, or patterns. More flexible than structured. *Examples: XML, JSON, Email headers (mix of structured fields like date/subject and unstructured body), Log files with varied formats.* (Page 16) * **Properties of Data (The 3 Vs) (Pages 17-20):** * **Volume:** The amount/size of data (GB -> PB -> EB). Challenges storage, processing, analysis. *Examples: Social media data (TBs daily), retail transaction history (PBs).* (Page 18) * **Velocity:** The speed at which data is generated, collected, processed. High velocity requires real-time/near-real-time capabilities. *Examples: IoT sensor data streaming every millisecond, high-frequency trading.* (Page 19) * **Variety:** Different types, structures, and sources of data (structured, semi-, unstructured). *Examples: Business analyzing relational DBs, emails, and JSON logs.* (Page 20) * **Data Storage Paradigms (Pages 21-28):** | **Feature** | **Data Warehouse (DW)** | **Data Lake (DL)** | **Lakehouse** | **Data Mesh** | | :-------------- | :----------------------------------------------------- | :--------------------------------------------------- | :---------------------------------------------- | :------------------------------------------------ | | **Definition** | Centralized repo, optimized for analysis, structured format | Stores vast raw data (all types) in native format | Hybrid combining best of DW & DL | Organizational/governance paradigm | | **Primary Use** | BI, Analytics, Reporting | Raw Data Storage, ML, Discovery, Exploration | Unified Analytics & ML | Domain-based data ownership/products | | **Data Type** | Structured | All (Structured, Semi-, Unstructured) | All | All (managed as products) | | **Schema** | Schema-on-Write (Defined before load) | Schema-on-Read (Defined at query time) | Both | Defined by domain/product | | **Processing** | ETL (Extract, Transform, Load) | ELT (Extract, Load, Transform) | Both | Defined by domain/product | | **Agility** | Less Agile (rigid schema) | More Agile (stores raw data) | Combines agility & structure | High (via domain ownership) | | **Cost** | Optimized for query performance (costly) | Cheap storage, processing cost varies | Aims for balance | Tooling/governance overhead | | **Key Tech/Ex** | Redshift, BigQuery, Snowflake | S3 + Glue + Athena, ADLS, HDFS | Lake Formation + Redshift Spectrum, Delta Lake | Organizational Structure + Tech (DW/DL/etc.) | | **When to Use** | Known structured sources, BI needs | Diverse data types, scalability, ML | Need DW features + DL flexibility | Large orgs needing decentralized data ownership | | **Slide Pages** | 22-23, 25-26 | 24-26 | 27 | 28 | * **ETL / ELT (Pages 29-32):** * **ETL Process:** * *Extract (Page 29):* Retrieve raw data from sources (DBs, CRMs, flat files, APIs), ensure integrity. Real-time or batch. * *Transform (Page 30):* Convert data to target format. Operations: Data cleansing (duplicates, errors), enrichment (adding data), format changes (dates, strings), aggregations/computations, encoding/decoding, handling missing values. * *Load (Page 31):* Move transformed data into target (DW). Batch or streaming. Ensure integrity. * **ELT Process:** Extract -> Load (raw into DL) -> Transform (as needed). Associated with Data Lakes. * **Pipeline Management (Page 32):** Needs reliable automation/orchestration. *Tools: AWS Glue, AWS Step Functions, Amazon MWAA,da, Glue Workflows.* * **Data Sources & Connectivity (Page 33):** * **JDBC:** Java standard. Platform-independent, Java-dependent. * **ODBC:** Open standard. Platform-dependent (via drivers), language-independent. * Other sources: Raw logs, APIs, Streams. * **Common Data Formats (Pages 34-38):** | **Format** | **Type** | **Structure** | **Schema** | **Compression** | **Use Cases** | **Key Systems** | **Page** | | :---------- | :------- | :---------------- | :--------- | :-------------- | :------------------------------------------------------------- | :------------------------------------------- | :------- | | **CSV** | Text | Tabular (Rows) | Implicit | Text-based | Small/medium data, interchange, spreadsheets, human-readable | Databases, Excel, Pandas, R, ETL tools | 35 | | **JSON** | Text | Key-Value Pairs | Flexible | Text-based | Web APIs, config files, NoSQL DBs, nested/flexible data | Web browsers, APIs, NoSQL DBs (MongoDB) | 36 | | **Avro** | Binary | Row-based | Included | Binary | Big Data serialization, Kafka, schema evolution, Spark/Flink | Kafka, Spark, Flink, Hadoop | 37 | | **Parquet** | Binary | **Columnar** | Included | **High** | **Analytics**, large datasets, selective column reads, Spark/Hive | Hadoop, Spark, Hive, Impala, Redshift Spectrum | 38 | | **ORC** | Binary | **Columnar** | Included | **High** | **Analytics**, Hive optimization, Spark | Hive, Spark, Presto | (Mentioned w/ Parquet) | * **Data Modeling (Page 39):** * Slides briefly show a Star Schema (Fact table surrounded by Dimension tables) as an Entity Relationship Diagram (ERD). Mentions Primary/Foreign Keys. * *Crucial for DE interviews:* Understand Relational Modeling (Normalization 1NF, 2NF, 3NF), Dimensional Modeling (Star, Snowflake schemas, Fact/Dimension tables, Slowly Changing Dimensions - SCDs), and NoSQL Modeling patterns (Key-Value, Document, Columnar, Graph) and when to use denormalization. * **Data Lineage (Pages 40-41):** * *Definition:* Tracking the origin, movement, and transformation of data. * *Importance:* Debugging errors, ensuring compliance, understanding data dependencies. * *Example (Page 41):* Uses Spline (Spark agent) with Glue to capture lineage, Lambda to send it to Neptune (graph DB), queried via Neptune Workbench or custom frontend. * **Schema Evolution (Page 42):** * *Definition:* Managing changes to data structure (schema) over time without breaking applications. * *Importance:* Allows systems to adapt to changing requirements, add/remove/modify fields, maintain backward/forward compatibility. * *Formats supporting it well:* Avro, Parquet (with schema merging). * *AWS Tool:* AWS Glue Schema Registry helps manage schemas, check compatibility (BACKWARD, FORWARD, FULL, NONE), and serialize/deserialize data accordingly, often used with Kinesis/MSK/Lambda. * **Database Performance Optimization Techniques (Page 43):** * **Indexing:** Creates data structures (e.g., B-trees) to speed up data retrieval based on specific columns, avoiding full table scans. Crucial for OLTP, important for DW query performance on filter/join columns. * **Partitioning:** Dividing large tables into smaller, manageable pieces based on column values (e.g., date range, region). Queries filtering on the partition key only scan relevant partitions, significantly improving performance and reducing cost (e.g., in Athena, Redshift). Also helps with data management (archiving/dropping old partitions). * **Compression:** Reduces storage size and I/O. Columnar formats (Parquet, ORC) offer excellent compression. Different algorithms (GZIP, Snappy, ZSTD, LZOP, BZIP2) offer trade-offs between compression ratio and CPU overhead. * **Data Sampling Techniques (Pages 44-47):** Used for analysis, testing, or creating subsets. * **Random:** Equal probability for each element. * **Stratified:** Divide into groups (strata), then random sample from each. Ensures subgroup representation. * **Systemic:** Select every Nth element. * Others mentioned: Cluster, Convenience, Judgmental. * **Data Skew (Pages 48-49):** * *Problem:* Uneven distribution of data/workload across partitions/nodes in a distributed system (e.g., Spark, EMR, DynamoDB partitions). Leads to stragglers and poor performance. "Celebrity problem" / "Hot key". * *Causes:* Poor choice of partition key (non-uniform distribution), inadequate partitioning strategy. * *Mitigation:* Choose high-cardinality partition keys, **Salting** (add random prefix/suffix to distribute hot keys), Adaptive/Custom Partitioning logic, Repartitioning data, Sampling to identify skew. * **Data Validation & Profiling (Page 50):** Assessing data quality. * **Completeness:** Presence of required data (check nulls, missing values). * **Consistency:** Data agreement across different sources/representations (check formats, referential integrity). * **Accuracy:** Correctness of data values compared to reality or trusted source. * **Integrity:** Maintaining correctness/consistency throughout lifecycle (e.g., relational integrity via constraints). * *Tools:* AWS Glue Data Quality (DQDL rules), AWS Glue DataBrew (profiling/validation rules). * **SQL Review (Pages 51-67):** Essential for data querying and manipulation. * Aggregates: `COUNT`, `SUM`, `AVG`, `MAX`, `MIN`. * Conditional Logic: `CASE WHEN ... THEN ... ELSE ... END`. Useful within aggregates. * Grouping: `GROUP BY` (used with aggregates). * Joins: `INNER`, `LEFT`/`RIGHT`/`FULL OUTER`, `CROSS`. Understand how they combine tables based on keys. * Pivoting: Rows to columns (specific `PIVOT` function or conditional aggregation). * Regular Expressions: `~` or `REGEXP` functions for advanced pattern matching in `WHERE` clauses. * **Git Review (Pages 68-75):** Essential for code management (IaC, ETL scripts, application code). Understand basic workflow: clone -> branch -> add -> commit -> push -> merge/pull request. ### II. Key AWS Services for Data Engineering * **Storage (Pages 76-145):** * **Amazon S3 (Simple Storage Service) (Pages 77-129):** * **Core:** Foundational object storage. Virtually unlimited scalability. Buckets are regional containers with globally unique names. Objects identified by keys (full path). * **Storage Classes (Pages 93-100):** Trade-off cost vs. access time/frequency/availability. *Standard* (default, frequent access), *Standard-IA* (infrequent, needs fast retrieval, retrieval fee), *One Zone-IA* (like S-IA but single AZ, cheaper, less resilient), *Glacier Instant Retrieval* (archive, ms access), *Glacier Flexible Retrieval* (archive, mins-hrs retrieval), *Glacier Deep Archive* (long-term archive, hours retrieval, cheapest), *Intelligent-Tiering* (automatic tiering based on access, good for unpredictable patterns, avoids retrieval fees between tiers). **All have 11 9s durability.** * **Lifecycle Management (Pages 101-104):** Automate transitions between storage classes (e.g., Standard -> IA -> Glacier) and expiration (deletion) based on object age, prefix, or tags. Crucial for cost optimization. * **Versioning (Page 90):** Protects against accidental overwrites/deletes. Creates object versions. Essential for replication and compliance. Can use lifecycle rules to manage noncurrent versions. * **Replication (CRR/SRR) (Pages 91-92):** Asynchronous copy to another bucket (same/different region/account). Requires versioning. Use S3 Batch Replication for existing objects. * **Security (Pages 83-90, 112-120):** Layered approach: IAM (users/roles), Bucket Policies (resource-based, cross-account), ACLs (legacy, generally disable), Block Public Access. Encryption (In transit: HTTPS; At rest: SSE-S3, SSE-KMS, SSE-C, Client-Side). OAC (Origin Access Control) for private CloudFront access. * **Performance (Pages 109-111):** Prefix naming strategy for parallelism (>=3.5k PUT, >=5.5k GET per prefix/sec). Multi-Part Upload (>100MB recommended, >5GB required). Transfer Acceleration (uses edge network). Byte-Range Fetches (parallel downloads, partial reads). * **Data Lake Features:** Central role. Use with Glue Catalog, Athena, Redshift Spectrum, EMR (via EMRFS), Lake Formation. Optimize format (Parquet/ORC) and partitioning for query performance/cost. * **Event Notifications (Pages 106-108):** Trigger Lambda, SQS, SNS on events (Put, Post, Copy, Delete, etc.). Use EventBridge for advanced filtering/routing. * **Access Points (Pages 121-122):** Simplify managing access to shared buckets with distinct policies/network origins per application/team. * **Object Lambda (Page 123):** Modify data on-the-fly during GET requests via Lambda. * **Storage Lens (Pages 124-129):** Org-wide visibility into usage, activity, cost optimization, data protection practices. * **EC2 Instance Storage (Pages 130-141):** * **EBS (Elastic Block Store):** Persistent block storage for EC2. AZ-specific. Various performance tiers (gp2/gp3, io1/io2). Snapshots for backups/AZ migration. Elastic Volumes allow online modification. * **EFS (Elastic File System):** Managed NFS file system. Accessible from multiple EC2 instances (Linux) across AZs. Scales automatically. Performance/Throughput modes. Standard/IA storage classes. Use for shared application data, home directories, content management. More expensive than EBS/S3 per GB. * **Instance Store:** Temporary block storage physically attached to host EC2. Highest performance, but data *lost* on instance stop/termination/failure. Good for cache, scratch space, data replicated elsewhere (like HDFS). * **AWS Backup (Pages 142-145):** Centralized, policy-based backup management across multiple AWS services. Automates backup/retention/cold storage transition. Backup Vault Lock for immutability (WORM). * **AWS Lake Formation (Pages 338-346):** Service to build, secure, and manage S3 data lakes. Centralizes permissions (on Glue Catalog & S3 data) using grant/revoke model. Supports fine-grained (column/row/cell) access control via Data Filters and tag-based access control. Integrates Glue Crawlers/Workflows/Jobs via Blueprints. Governed Tables provide ACID transactions on S3 data. * **Database (Pages 146-241):** * **Amazon DynamoDB (Pages 147-186):** * **Type:** Managed NoSQL Key-Value/Document DB. Serverless. * **Core:** Tables, Items (max 400KB), Attributes. Primary Key (Partition Key or Partition+Sort Key). * **Capacity:** Provisioned (RCU/WCU, plan/autoscale) vs. On-Demand (pay-per-request, auto-scales). Understand RCU/WCU calculations based on item size and consistency. * **Consistency:** Eventually Consistent (default) vs. Strongly Consistent (costs 2x RCU). * **Features:** GSIs/LSIs for flexible queries, DAX (cache), Streams (CDC), TTL, PITR, Global Tables. PartiQL for SQL-like access. * **Use Cases:** High-scale web apps, gaming, IoT, session management, metadata store. Large Object pattern (metadata in DDB, data in S3). * **Security:** IAM, VPC Endpoints (Gateway), KMS Encryption. * **Amazon RDS (Pages 187-198):** * **Type:** Managed Relational Database Service (MySQL, PostgreSQL, MariaDB, SQL Server, Oracle). * **Core:** Manages patching, backups, scaling, HA. ACID compliant. * **Features:** Read Replicas (read scaling), Multi-AZ (HA), Automated/Manual Snapshots, Encryption (KMS at rest, SSL in transit), IAM DB Auth. * **Use Cases:** OLTP applications, structured data storage, source/target for ETL. * **Amazon Aurora (Pages 190-191, 197):** * **Type:** MySQL/PostgreSQL-compatible relational DB built for cloud. * **Advantages:** Higher performance, scalability, availability (6 copies across 3 AZs) than standard RDS. * **Features:** Auto-scaling storage, up to 15 replicas, Serverless option, Global Database. * **Amazon Redshift (Pages 206-241):** * **Type:** Managed Data Warehouse (OLAP). Columnar, MPP. * **Core:** Leader/Compute Nodes. Distribution Styles (KEY, ALL, EVEN) & Sort Keys crucial for performance. * **Features:** Spectrum (query S3), Concurrency Scaling, WLM, VACUUM/ANALYZE, Materialized Views, Data Sharing, Federated Queries (RDS/Aurora), Lambda UDFs, Data API, Redshift ML, Serverless option. RA3 nodes decouple compute/storage. * **Loading:** `COPY` from S3 (best for bulk), Streaming Ingestion (KDS/MSK), DMS, Aurora Zero-ETL. * **Use Cases:** Analytics, BI reporting, large-scale data aggregation. * **Other Databases:** * **DocumentDB (Page 199):** MongoDB-compatible document DB. * **Neptune (Pages 202-203):** Graph database (Gremlin, SPARQL, openCypher). * **Timestream (Pages 204-205):** Time-series database. Serverless. * **Keyspaces (Page 201):** Cassandra-compatible wide-column DB. Serverless. * **MemoryDB for Redis (Page 200):** Redis-compatible, *durable* in-memory DB. * **Migration and Transfer (Pages 252-268):** | **Service** | **Type** | **Use Case** | **Connectivity** | **Key Feature(s)** | **Page(s)** | | :--------------- | :-------------------- | :----------------------------------------------------------------------- | :--------------- | :---------------------------------------------------- | :---------- | | **ADS** | Discovery | Planning on-prem migrations | Agent/Agentless | Gathers server inventory, performance, dependencies | 253 | | **MGN** | Server Migration | Lift-and-shift servers (physical/virtual/cloud) to EC2 | Agent | Continuous replication, minimal downtime | 254 | | **DMS** | Database Migration | Homogeneous/Heterogeneous DB migration to AWS | Network | CDC for ongoing replication, Replication Instance | 255-259 | | **SCT** | Schema Conversion | Converts schema/code for heterogeneous DMS migrations | Runs locally/EC2 | Assessment report, works with DMS | 257 | | **DataSync** | Online File/Object | Transfer files/objects between on-prem/cloud <=> AWS storage (S3/EFS/FSx) | Agent (on-prem) | Scheduling, preserves metadata, faster than CLI | 260-262 | | **Snow Family** | Offline Transfer | Moving PB-scale data, Edge computing | Physical device | Snowball Edge (Storage/Compute Optimized) | 263-266 | | **Transfer Fam** | Managed SFTP/FTP | Secure file exchange (SFTP/FTPS/FTP) directly with S3/EFS | Network endpoint | Managed endpoints, various auth integrations | 267-268 | * **Compute (Pages 269-296):** * **EC2 (Pages 270-271):** Base compute, used by EMR, DMS, etc. Choose instance types/pricing wisely. Graviton for price/performance. * **Lambda (Pages 272-294):** Serverless functions. Event-driven. Good for short-running tasks, stream processing, API backends, simple ETL triggers. 15 min timeout. Manage dependencies via Layers or packaging. Use SAM/CDK for deployment. * **Batch (Pages 295-296):** Managed batch computing using Docker containers. Provisions EC2/Spot. Use for non-ETL batch jobs. * **Containers (Pages 297-314):** * **ECR (Page 310):** Docker image registry. * **ECS (Pages 305-309):** AWS-native orchestrator. EC2 (manage instances) or Fargate (serverless) launch types. Task Definitions, Tasks, Services. Integrates well with AWS services (ELB, IAM, EFS). * **EKS (Pages 311-314):** Managed Kubernetes. Use for K8s compatibility/ecosystem. Managed/Self-Managed Nodes or Fargate. Storage via CSI drivers (EBS, EFS, FSx). * **Analytics (Pages 315-474):** * **Glue (Pages 316-337):** Serverless ETL & Data Catalog. Crawlers discover schema. ETL Jobs run Spark/Python. DynamicFrames handle schema drift. Job Bookmarks for incremental processing. DataBrew for visual prep. Data Quality rules. Workflows orchestrate. Schema Registry for streams. * **Athena (Pages 347-364):** Serverless SQL query engine for S3 data lakes. Uses Glue Catalog. Pay per scan. Optimize with partitioning, columnar formats (Parquet/ORC), compression. Federated Queries to other sources. ACID via Iceberg. * **EMR (Pages 372-387):** Managed Hadoop/Spark/Presto/Flink clusters on EC2. More control than Glue. Use EMRFS for S3 access. Managed Scaling adjusts cluster size. Serverless option available. Use for complex big data processing, specific framework needs. * **Kinesis (Pages 388-433):** Real-time data suite. * *Data Streams (KDS):* Core streaming ingestion. Sharded, ordered, replayable (1-365d retention). Producers (SDK/KPL/Agent) & Consumers (SDK/KCL/Lambda/KDF/KDA). Provisioned/On-Demand capacity. Enhanced Fan-Out. * *Data Firehose (KDF):* Simple stream *loading* to S3, Redshift, OpenSearch, etc. Near real-time (buffering). Lambda transformations. Auto-scaling. * *Data Analytics (KDA/MSAF):* Real-time stream processing using SQL or Apache Flink. Serverless. * **MSK (Pages 434-441):** Managed Kafka. Use for Kafka compatibility. Serverless option. MSK Connect for connectors. * **OpenSearch Service (Pages 442-460):** Managed OpenSearch/Elasticsearch for search/log analytics. Index State Management, storage tiers (Hot/Warm/Cold). Serverless option. * **QuickSight (Pages 461-480):** Serverless BI service. Connects to various sources. SPICE in-memory engine. Dashboards, ML Insights. RLS/CLS. **Service Comparison: Glue vs. EMR** | **Feature** | **AWS Glue ETL** | **Amazon EMR** | | :--------------- | :-------------------------------------------- | :---------------------------------------------------- | | **Management** | Serverless (fully managed) | Managed Cluster (EC2 instances) / Serverless Opt. | | **Primary Use** | Data Integration, ETL, Cataloging | Big Data Processing (Spark, Hive, Presto, Flink) | | **Engine** | Apache Spark, Python Shell | Multiple (Spark, Hive, Presto, Flink, HBase...) | | **Control** | Less control over environment | More control (instance types, software, SSH) | | **Job Duration** | Better for short-medium jobs | Suitable for long-running jobs/clusters | | **Cost Model** | Pay per DPU-hour (serverless) | Pay per EC2 instance-hour (+ EMR fee) / Serverless | | **Ease of Use** | Simpler setup, auto-generates code | More complex setup, requires framework knowledge | | **Flexibility** | Limited to Spark/Python, built-in transforms | Highly flexible, install custom software | | **When to Use** | Serverless Spark ETL, data cataloging needed | Need specific frameworks, more control, long jobs | | **Slide Pages** | 316-337 | 372-387 | **Service Comparison: Kinesis Data Streams vs. SQS** | **Feature** | **Kinesis Data Streams (KDS)** | **SQS Standard** | **SQS FIFO** | | :---------------- | :---------------------------------------------------- | :----------------------------------------- | :--------------------------------------------- | | **Type** | Real-time Data Stream | Message Queue | Message Queue | | **Ordering** | Preserved within Shard (Partition Key) | Best-Effort | Strictly Preserved (per Message Group ID) | | **Delivery** | At-least-once | At-least-once | Exactly-once Processing (within dedupe window) | | **Consumers** | Multiple per stream (independent progress) | Typically one logical consumer app per queue | Typically one logical consumer app per queue | | **Replayability** | Yes (1-365 days retention) | No (message deleted after processing) | No (message deleted after processing) | | **Throughput** | High (per shard, scalable via resharding) | Very High (virtually unlimited) | Lower (300/s or 3000/s with batching) | | **Max Size** | 1 MB per record | 256 KB (more via Extended Client) | 256 KB (more via Extended Client) | | **Key Use Case** | Real-time analytics, log aggregation, event sourcing | Decoupling applications, buffering, tasks | Ordered tasks, exactly-once needs | | **Slide Pages** | 388-413, 418-423, 491-493 | 482-485, 488-496 | 486, 491-496 | **Service Comparison: Athena vs. Redshift Spectrum** | **Feature** | **Amazon Athena** | **Redshift Spectrum** | | :-------------- | :---------------------------------------- | :----------------------------------------- | | **Service** | Standalone Query Service | Feature *within* Amazon Redshift | | **Engine** | Presto/Trino | Redshift MPP Engine | | **Primary Use** | Interactive queries on S3 | Extend Redshift queries to S3 data lake | | **Data Location**| Primarily S3 | S3 (can join with data *in* Redshift) | | **Metadata** | AWS Glue Data Catalog | AWS Glue Data Catalog or Redshift External Schema | | **Cost Model** | Pay per TB scanned in S3 ($5/TB) | Pay per TB scanned in S3 ($5/TB) + Redshift cluster cost | | **Setup** | Simpler (define table in Glue) | Requires Redshift cluster, define external schema/table | | **SQL Dialect** | Presto SQL | Redshift SQL (PostgreSQL based) | | **Performance** | Good for ad-hoc, can vary | Can leverage Redshift cluster resources | | **When to Use** | Serverless ad-hoc S3 queries, data exploration | Need to join S3 data with Redshift tables | | **Slide Pages** | 347-364 | 210 (Redshift feature) | * **Application Integration (Pages 481-527):** * **SQS (Pages 482-496):** Queues for decoupling. Standard vs FIFO. DLQ. * **SNS (Pages 496-506):** Pub/Sub notifications. Fan-out. Standard vs FIFO Topics. Filtering. * **Step Functions (Pages 507-511):** Serverless workflow orchestration. Visual DAGs. Error handling. * **AppFlow (Pages 516-517):** Managed SaaS <=> AWS data transfer. * **EventBridge (Pages 518-522):** Serverless event bus. Rules (pattern/schedule) & Targets. Schema Registry. * **MWAA (Pages 523-527):** Managed Airflow for complex batch DAGs (Python). * **Security, Identity, and Compliance (Pages 528-574):** * **IAM:** Least Privilege. Roles for services/cross-account. MFA. * **Encryption:** KMS (Key Policies vital), SSE options, Client-Side, TLS. * **Secrets Manager:** Store/rotate secrets. * **Macie:** Discover sensitive S3 data. * **WAF/Shield:** Web exploit/DDoS protection. * *Review service-specific security details.* * **Networking and Content Delivery (Pages 575-601):** * **VPC:** Subnets, Route Tables, IGW, NAT GW. NACLs vs Security Groups. Flow Logs. * **Connectivity:** Peering, VPC Endpoints (Gateway/Interface), PrivateLink, VPN, Direct Connect. * **Route 53:** DNS, Health Checks. Public/Private Zones. * **CloudFront:** CDN. Edge caching. Origins (S3/OAC, Custom HTTP). Invalidation. * **Management and Governance (Pages 602-647):** * **CloudWatch:** Metrics, Logs (Insights), Alarms, Dashboards. Unified Agent. * **CloudTrail:** API audit trail. Logs to S3/CW Logs. CloudTrail Lake (SQL query). * **Config:** Configuration tracking & compliance (Rules). Remediation. * **IaC:** CloudFormation, CDK, SAM. * **SSM Parameter Store:** Config/secret storage. * **Well-Architected Tool:** Review against best practices. * **Managed Grafana:** Managed visualization platform. * **Machine Learning (Pages 650-668):** * **SageMaker:** ML Platform. Feature Store (centralized features), Lineage Tracking (workflow audit), Data Wrangler (visual data prep). * **Developer Tools (Pages 669-683):** * CLI/SDK access. IaC (CFN, CDK, SAM). CI/CD (CodeCommit - *deprecated*, CodeBuild, CodeDeploy, CodePipeline). * **Cost Management (Pages 684-690):** * **Budgets:** Alerts on cost/usage thresholds. * **Cost Explorer:** Visualize/analyze costs, forecasting, RI/SP recommendations. ### III. General Interview Preparation * **Behavioral:** STAR method + Amazon Leadership Principles are critical. * **System Design:** Practice designing pipelines (real-time, batch, data lake, DW). Focus on requirements, service selection trade-offs, data modeling, scalability, reliability, security, monitoring, cost. * **SQL & Python:** Strong SQL needed. Python (Pandas essential, PySpark basics helpful). * **Core Concepts:** Solidify understanding of ETL/ELT, DW/DL, file formats, partitioning, compression, consistency, etc. Good luck with your Amazon interview!