Data Engineering Interview Study Guide

# Data Engineering Interview Study Guide ## Introduction This document outlines fundamental and essential concepts for Data Engineering interviews. Mastering these areas will provide a strong foundation for discussing data pipelines, architecture, and problem-solving approaches. The goal is to understand the "what," "why," and "how" of each concept. ## 1. Data Warehousing * **Concept:** A Data Warehouse (DW or EDW) is a central repository designed for reporting and analysis. It integrates data from diverse operational sources into a single, consistent format. * **Purpose:** To support Business Intelligence (BI), analytics, data mining, and reporting activities by providing a unified view of historical data. * **Distinction from OLTP:** Unlike Online Transaction Processing (OLTP) databases optimized for day-to-day transactions (inserts, updates, deletes), DWs are optimized for complex analytical queries (Online Analytical Processing - OLAP) that read large volumes of data. * **Key Characteristics:** * **Subject-Oriented:** Data is organized around core business subjects (e.g., Customer, Product, Sales, Time) rather than specific operational applications. This makes it easier for analysts to find relevant data. * **Integrated:** Data from various sources is cleaned, transformed, and combined to ensure consistency, resolving inconsistencies in naming conventions, formats, and encoding. This creates a "single version of the truth." * **Time-Variant:** Data represents information over a specific historical period. Historical data is retained, allowing for trend analysis, comparisons over time, and forecasting. Every data point is associated with a point or period in time. * **Non-Volatile:** Once data is loaded into the warehouse, it is generally permanent and not subject to real-time updates or deletions. New data is added periodically in batches. * **Typical Architecture Layers:** * **Data Source Layer:** The original operational systems (databases, CRMs, ERPs), external data feeds, files (CSV, JSON), etc. * **Staging Area:** An intermediate storage area where data extracted from sources is temporarily held and undergoing cleaning, transformation, and preparation before being loaded into the warehouse. This is often where ETL/ELT processes begin their work. * **Data Storage Layer:** The core database where the transformed, integrated, and time-variant data resides. Often utilizes relational databases optimized for analytical queries (e.g., columnar storage, Massively Parallel Processing - MPP). Examples: Amazon Redshift, Google BigQuery, Snowflake, Teradata. * **Metadata Layer:** Stores information about the data in the warehouse, including source systems, data definitions, data types, transformation rules, lineage (data origin and flow), refresh schedules, and data quality rules. * **Analysis/Access Layer:** Tools and applications used by end-users to access and analyze the data. This includes BI tools (Tableau, Power BI, Looker, QuickSight), reporting tools, SQL query interfaces, and data mining tools. * **Cloud Data Warehouses:** Leverage cloud infrastructure for scalability, cost-effectiveness, and ease of management. They often separate storage and compute, allowing them to scale independently. Key features include MPP, columnar storage, and often built-in support for semi-structured data. * **Interview Focus:** Be prepared to define DWs, explain the 4 characteristics with examples, describe the architecture layers, and differentiate between OLTP and OLAP systems clearly. Discuss the benefits of modern cloud DWs. ## 2. ETL (Extract, Transform, Load) * **Concept:** A traditional three-step data integration process primarily used to populate data warehouses. Data is processed *before* being loaded into the target system. * **The Three Steps in Detail:** * **Extract:** * Reading data from source systems (databases, applications, files, APIs). * Methods: Full extraction (all data) or incremental extraction (only changes since last extraction, e.g., using timestamps or CDC). * Data is typically copied into a temporary staging area. * **Transform:** * This is the core processing step, happening in the staging area or a dedicated ETL engine. * **Purpose:** To clean, standardize, aggregate, and prepare the data for the target schema. * Common Transformations: * **Cleansing:** Handling missing values (nulls), fixing errors, removing duplicates, standardizing text/codes (e.g., converting 'California', 'CA', 'Calif.' to 'CA'). * **Format Revision:** Converting data types, standardizing date/time formats, adjusting units (e.g., lbs to kgs), handling character sets. * **Derivation:** Creating new calculated fields based on existing data (e.g., calculating `Profit = Revenue - Cost`). * **Joining:** Combining data from multiple source tables/files based on common keys (e.g., joining Customer data with Sales data). * **Splitting/Merging:** Splitting a single source column into multiple target columns (e.g., `Full Name` into `First Name`, `Last Name`) or merging multiple source columns into one. * **Aggregation:** Summarizing data (e.g., calculating daily total sales from individual transaction records). * **Filtering:** Selecting only relevant records based on specific criteria. * **Sorting:** Ordering data for easier processing or loading. * **Load:** * Writing the transformed and structured data into the target data warehouse. * Methods: * **Full Load:** Erasing existing data in the target and loading the new, transformed dataset. * **Incremental Load:** Loading only the new or changed data into the target. This often involves inserting new rows, updating existing rows (less common in non-volatile DWs unless implementing SCDs), or using techniques like partitioning or micro-batches. * Considerations: Performance, integrity constraints, handling rejected data. * **Characteristics:** * **Schema-on-Write:** The schema of the target data warehouse must be defined and understood *before* the transformation logic is built. Data is transformed to conform to this predefined schema. * Transformation happens outside the target DW, requiring separate compute resources for the ETL engine/staging area. * Can be resource-intensive and time-consuming, especially the transformation phase for large, complex datasets, potentially creating bottlenecks. * **Interview Focus:** Be able to explain each step of ETL with practical examples of transformations. Understand the 'Schema-on-Write' concept and the typical flow of data through the steps. ## 3. ELT (Extract, Load, Transform) * **Concept:** A more modern data integration approach, especially popular with cloud data warehouses and data lakes. Data is loaded into the target *before* significant transformation occurs. * **The Three Steps in Detail:** * **Extract:** Similar to ETL, data is pulled from sources. Often designed to handle raw, sometimes unstructured or semi-structured data. * **Load:** * The raw, untransformed (or minimally transformed) data is loaded directly into the target system. * The target system is typically a powerful cloud data warehouse (BigQuery, Snowflake, Redshift) or a data lake (S3, GCS, ADLS). * This step is often faster than the ETL Load step because it involves less processing before writing. * **Transform:** * Transformations are performed *after* the data is loaded into the target system. * This leverages the computational power and scalability of the target (e.g., running SQL queries in a powerful DW, or using Spark jobs on data in a data lake). * Allows for more flexible and iterative transformation development, as you can work directly with the raw data in the target. * **ELT vs. ETL Comparison (Crucial for Interviews):** | Feature | ETL (Extract, Transform, Load) | ELT (Extract, Load, Transform) | | :-------------------- | :-------------------------------------------- | :---------------------------------------------- | | **Transformation** | Happens **before** loading. | Happens **after** loading. | | **Processing Location** | Dedicated ETL engine or staging area. | **Within** the target system (DW or Data Lake). | | **Data Type Suitability** | Primarily structured data. | Structured, semi-structured, and unstructured. | | **Speed** | Can be slower overall due to transformation bottleneck before loading. | Often faster for loading large volumes; transformation speed depends on target system. | | **Flexibility** | Less flexible; target schema rigid upfront. | More flexible; transformations can be developed and refined iteratively on raw data. | | **Schema Approach** | Schema-on-Write (Target schema predefined). | Can be Schema-on-Read (Schema defined when reading/transforming). | | **Architecture** | Often requires separate staging/ETL server infrastructure. | Relies heavily on the target system's compute power. | | **Cost Model** | Cost tied to ETL server + DW storage/compute. | Cost tied heavily to target DW/Data Lake storage & compute for transformation. | | **Initial Load** | Transformed data is loaded. | Raw data is loaded. | | **Compliance** | Sensitive data can be masked/aggregated *before* loading. | Raw sensitive data lands in the target; requires careful access control and processing *within* the target. | * **Interview Focus:** Explain each step of ELT. Clearly articulate the differences between ETL and ELT, focusing on the timing and location of transformation, and when you would choose one over the other (e.g., for large data volumes, cloud environments, diverse data types). Understand 'Schema-on-Read'. ## 4. Slowly Changing Dimensions (SCD) * **Concept:** Techniques used in data warehousing to manage changes in dimension attributes over time while maintaining historical accuracy for analysis. Dimensions (e.g., Customer, Product, Geography) contain descriptive data that can change, although less frequently than facts. * **Common SCD Types (Focus on Type 1 & 2):** * **Type 0 (Retain Original):** * **Mechanism:** No changes are tracked or stored. The original value is kept permanently. * **Use Case:** Attributes that should never change or where historical tracking is irrelevant (e.g., Date of Birth, original Order Date). * **Type 1 (Overwrite):** * **Mechanism:** The old value in the dimension table is simply overwritten with the new value. * **Effect:** History is *not* preserved. The dimension record always shows the most current information. Facts linked to this dimension row will *retroactively* appear associated with the new value, even for historical events. * **Use Case:** Correcting errors, minor changes where historical analysis on that specific attribute is not needed (e.g., correcting a typo in a customer's name if the history of the typo isn't important). * **Type 2 (Add New Row):** * **Mechanism:** When a change occurs, the existing dimension row is marked as inactive (e.g., by setting an `EndDate` or a `Current_Flag = 'N'`), and a new row is inserted into the dimension table with the updated information. This new row is marked as active (e.g., using a `StartDate` and `Current_Flag = 'Y'`). A unique **Surrogate Key** is crucial for each row to distinguish between different versions of the same business entity. * **Effect:** Full history is preserved. Facts link to the specific *version* of the dimension row that was active at the time of the event. This allows accurate historical reporting (e_g_ seeing sales grouped by the customer's city *at the time of the sale*). * **Use Case:** Tracking significant changes where historical context is essential for analysis (e.g., Customer Address, Employee Department, Product Price changes). This is the most common SCD type for tracking history. * **Type 3 (Add New Attribute/Column):** * **Mechanism:** The dimension table adds a new column (or columns) to store the previous value of an attribute alongside the current value (e.g., `Current_Address`, `Previous_Address`). * **Effect:** Tracks a limited history (usually just the previous state). Does not require adding new rows. * **Use Case:** When you only need to track the *immediately* preceding value, not the full history of all changes. * **Type 4 (Add History Table):** * **Mechanism:** The main dimension table only stores the current version of the dimension members (like Type 1). A separate history table stores all previous versions of the dimension records with effective dates. * **Effect:** Keeps the main dimension table smaller and faster for queries needing only current data, while preserving full history in a separate table. Requires joining to the history table for historical analysis. * **Type 6 (Combined Approach):** * **Mechanism:** Combines techniques from Type 1 (overwriting certain attributes), Type 2 (adding a new row for major changes), and Type 3 (storing a previous value in a column) within a single dimension table structure. * **Effect:** Offers flexibility in handling different attributes within the same dimension. Can become complex to manage. * **Interview Focus:** Define SCDs and why they are necessary. Be able to explain Type 1 and Type 2 in detail, providing clear examples of how data changes and what the resulting dimension table looks like. Discuss the trade-offs between Type 1 and Type 2. Understand the role of Surrogate Keys in Type 2. Briefly describe other types. ## 5. Data Modelling for Data Warehousing * **Concept:** The process of creating a structured representation of data and relationships. In DW, the focus is on organizing data for analytical querying efficiency and understandability, differing from OLTP modelling which prioritizes data integrity and minimizing redundancy. * **Key Techniques for Data Warehousing:** * **Dimensional Modeling (Most Common):** * Developed by Ralph Kimball. Organizes data around business processes. * **Core Components:** * **Facts:** Tables that store quantitative measurements or metrics related to a business process (e.g., Sales Amount, Quantity Sold, Profit, Transaction Count). They typically contain foreign keys linking to dimension tables and the numerical measures. * *Types of Facts:* Additive (sum across any dimension, e.g., Sales Amount), Semi-Additive (sum across some dimensions but not all, e.g., Inventory Balance), Non-Additive (cannot be summed, e.g., Unit Price, Ratios). * **Dimensions:** Tables that store descriptive attributes related to the facts. They provide context for the measurements (e.g., Customer details, Product attributes, Time periods, Geographic locations). Dimensions are typically wide tables containing many attributes. * **Popular Schemas:** * **Star Schema:** * **Structure:** A central fact table directly connected to multiple surrounding dimension tables. Dimension tables are typically denormalized (contain all attributes in one table). * **Pros:** Simple to understand and implement. Optimized for query performance due to fewer joins required (usually one join per dimension). Faster aggregation queries. * **Cons:** Data redundancy in denormalized dimension tables. * **Snowflake Schema:** * **Structure:** A central fact table connected to dimension tables, and those dimension tables are further normalized into sub-dimensions (forming a snowflake shape). * **Pros:** Reduces data redundancy. Easier to maintain dimensions. * **Cons:** More complex structure. Requires more joins for queries, potentially impacting performance compared to a Star schema. * **Fact Constellation (Galaxy Schema):** Multiple fact tables that share one or more common dimension tables. Represents multiple related business processes. * **Entity-Relationship (ER) Modeling (Typically for OLTP Sources):** * Focuses on identifying entities (objects) and the relationships between them. * Primary goal is data integrity and minimizing redundancy through normalization. * Used to design transactional databases. While not the final model *of* the DW, understanding source system ER models is crucial for data extraction. * **Normalization vs. Denormalization:** * **Normalization:** The process of organizing data in a database to reduce redundancy and improve data integrity. Involves dividing larger tables into smaller, related tables. Typically used for OLTP systems (e.g., up to 3NF). * **Denormalization:** The process of intentionally adding redundant data into a database, typically by combining data from multiple tables into one. Used in Data Warehousing (especially dimensional modeling) to reduce the number of joins needed for read-heavy analytical queries, thus improving query performance. * **Interview Focus:** Define data modelling in the DW context. Explain Dimensional Modeling, facts, and dimensions with examples. Be able to compare and contrast Star and Snowflake schemas and discuss their trade-offs. Explain the difference between normalization and denormalization and why denormalization is common in DWs. ## 6. Data Orchestration * **Concept:** The automation, scheduling, monitoring, and management of complex data workflows or pipelines. It ensures tasks within a data pipeline run in the correct order, handle dependencies, and are executed reliably. * **Key Principles/Components:** * **Workflow Definition:** Defining the sequence of tasks that need to be executed and their dependencies. Often represented as a Directed Acyclic Graph (DAG), where nodes are tasks and directed edges represent dependencies. * **Scheduling:** Triggering workflows based on time (e.g., daily at midnight using cron-like schedules), events (e.g., file arrival in storage), or external triggers. * **Execution:** Managing the running of individual tasks, potentially across distributed systems or different compute environments (e.g., running a Python script, executing a SQL query, launching a Spark job). * **Monitoring:** Tracking the status of workflows and individual tasks (running, successful, failed, skipped). Provides visibility into pipeline health. * **Error Handling & Retries:** Defining how to react when a task fails (e.g., automatically retrying a few times, sending alerts, stopping the workflow). * **Logging:** Capturing detailed logs for each task execution for debugging and auditing. * **Parameterization:** Designing workflows to accept parameters (e.g., a date, a configuration file) to make them reusable for different inputs or environments. * **Dependency Management:** Ensuring tasks only run after the tasks they depend on have successfully completed. * **Popular Orchestration Tools:** * **Open Source:** * **Apache Airflow:** Very popular, Python-based, workflows defined as DAGs in Python code. Strong community and extensibility. * Prefect: Modern tool, Python-focused, aims for easier development and dynamic workflows. * Dagster: Focuses on data assets and testing, Python-based. * Luigi: Developed at Spotify, Python-based, simpler than Airflow but less feature-rich. * **Cloud-Native:** * AWS Step Functions: State machine service for coordinating distributed applications and microservices, often used for data workflows. * Azure Data Factory: Cloud-based ETL and ELT service with strong orchestration capabilities for various data sources and destinations. * Google Cloud Composer: Managed Apache Airflow service on GCP. * Google Cloud Functions / AWS Lambda: Serverless compute, can be used for simple, event-driven orchestration tasks. * **Enterprise:** Control-M, Astronomer (Managed Airflow platform). * **Interview Focus:** Explain *why* orchestration is critical for production data pipelines. Describe the key features of an orchestration tool (scheduling, dependencies, monitoring, error handling). Explain what a DAG is. Name one or two popular tools you are familiar with and perhaps briefly discuss their approach (e.g., Airflow's Python DAGs). ## 7. Expanding Beyond the Core These concepts frequently appear in Data Engineering roles and interviews. While not covered in the initial list, understanding them is vital. * **7.1 Databases and SQL:** * **Relational Databases:** Understanding concepts like tables, schemas, keys (primary, foreign, surrogate), indexing. * **SQL Proficiency:** This is non-negotiable. You must be comfortable writing complex queries involving: * JOINs (INNER, LEFT, RIGHT, FULL, CROSS). * Aggregation functions (COUNT, SUM, AVG, MIN, MAX). * GROUP BY and HAVING clauses. * Window Functions (RANK, ROW_NUMBER, LEAD, LAG, aggregate functions over partitions). * Common Table Expressions (CTEs). * Subqueries. * UNION and UNION ALL. * INSERT, UPDATE, DELETE statements. * **NoSQL Databases:** Awareness of different types (Key-Value, Document, Column-Family, Graph) and their typical use cases. You might interact with NoSQL databases as source systems. * **7.2 Big Data Technologies:** * Understanding the challenges of processing data exceeding the capacity of a single machine. * **Apache Spark:** A powerful, open-source unified analytics engine for large-scale data processing. * Concepts: RDDs (Resilient Distributed Datasets), DataFrames/Datasets (structured APIs), Spark SQL, Spark Streaming. * Programming Languages: PySpark (Python API) is very common for DE. Scala and Java are also used. * Use Cases: Large-scale ETL/ELT, data transformation, streaming processing, machine learning preparation. * **Hadoop (Historical Context):** Understand HDFS (distributed file system) and MapReduce (original processing model) as they laid the groundwork for Spark and the modern data ecosystem, but MapReduce is less commonly used directly now. * **7.3 Cloud Platforms (AWS, GCP, Azure):** * Data Engineering roles are heavily cloud-based. Be familiar with the data-related services in at least one major cloud provider. * **Common Service Categories across Clouds:** * **Object Storage:** Scalable, durable, low-cost storage for raw data (S3 - AWS, GCS - GCP, Azure Blob Storage). Often serves as a data lake foundation. * **Managed Databases:** Relational (RDS, Cloud SQL, Azure SQL Database) and NoSQL (DynamoDB, Cloud Datastore/Firestore, Azure Cosmos DB). * **Data Warehousing:** Covered in Section 1 (Redshift, BigQuery, Synapse Analytics/Azure SQL DW). * **ETL/ELT & Data Integration:** Managed services for building data pipelines (Glue - AWS, Dataflow/Dataproc - GCP, Data Factory - Azure). * **Streaming Ingestion & Processing:** Services for handling real-time data (Kinesis - AWS, Pub/Sub / Dataflow - GCP, Event Hubs / Stream Analytics - Azure). * **Compute:** Virtual Machines (EC2, GCE, Azure VMs) and Serverless (Lambda, Cloud Functions) for running code and processing tasks. * **Orchestration:** Covered in Section 6 (Step Functions, Cloud Composer, Data Factory). * **7.4 Data Streaming Concepts:** * **Real-time vs. Batch Processing:** Understand the difference and use cases for each. * **Messaging Queues/Brokers:** Systems that enable publishing and subscribing to streams of data (Kafka, Kinesis, Pub/Sub). Essential for decoupling systems and building streaming pipelines. * **Stream Processing Frameworks:** Tools for processing data streams in real-time (Spark Streaming, Apache Flink, Kinesis Data Analytics). * **7.5 Data Quality and Governance:** * **Data Quality:** The accuracy, completeness, consistency, validity, and timeliness of data. * **Data Quality Activities:** Data profiling (understanding data characteristics), data validation (checking data against rules), data cleansing (fixing quality issues), data monitoring (tracking quality over time). * **Data Governance:** The overall management of the availability, usability, integrity, and security of data. Includes defining standards, policies, and processes. * **Data Lineage:** Understanding where data originated, how it was transformed, and where it moved over time. Important for debugging, auditing, and compliance. * **Data Catalogs:** Tools that provide a searchable inventory of an organization's data assets, including metadata, lineage, and data quality information. * **7.6 Testing Data Pipelines:** * Testing DE pipelines is challenging but critical for reliability. * **Types of Tests:** * **Unit Tests:** Testing individual functions or components (e.g., a single transformation script). * **Integration Tests:** Testing how different components of the pipeline interact (e.g., extractor connecting to source, loader writing to target). * **Data Validation Tests:** Checking the *output* data against expected quality rules, counts, schemas, or values. This is often the most important type of test in DE. * End-to-End Tests: Testing the entire pipeline from source to target. * **7.7 Programming Languages:** * **Python:** Dominant language in DE. Used for scripting, automation, interacting with APIs and cloud services (e.g., Boto3 for AWS), building data transformation logic (e.g., with Pandas or PySpark), and defining orchestration workflows (Airflow). * **SQL:** The language of databases. Absolutely essential for interacting with relational databases and data warehouses, often used directly for ELT transformations. * Scala or Java: Used for Spark development, especially for building libraries or highly optimized jobs, though PySpark is increasingly common for application logic. * **7.8 System Design:** * Be prepared to discuss how you would design a data pipeline to solve a specific business problem. * Think about: Source systems, data volume and velocity, required latency (batch vs streaming), data storage choices, processing engine (ETL/ELT tool, Spark), orchestration, monitoring, error handling, scalability, and cost considerations. * Practice drawing simple architecture diagrams. ## 8. Interview Practice Strategy * **Explain Concepts:** Practice explaining each section in your own words, simply and clearly. Imagine explaining it to someone relatively technical but not necessarily a DE expert. * **Scenario Questions:** Think about how these concepts apply to real-world scenarios. "How would you build a pipeline to ingest clickstream data?" "How would you handle late-arriving data?" * **System Design Practice:** Start with simple problems (e.g., "Design a system to count website visits per day") and gradually add complexity (user sessions, different data sources, real-time requirements). Focus on trade-offs. * **SQL Practice:** Use platforms like LeetCode, HackerRank, or DataCamp to practice SQL queries, especially window functions and complex joins. * **Coding Practice (Python):** Practice data manipulation with Pandas, interacting with APIs, and potentially basic PySpark if relevant to the roles you're applying for. ## 9. Further Study * **Hands-on Practice:** The best way to learn is by doing. Set up free-tier accounts on cloud providers (AWS, GCP, Azure) and build small example pipelines. Work with sample datasets. * **Official Documentation:** Read the documentation for specific tools (Airflow, Spark) and cloud services (S3, BigQuery, Glue, Kinesis, etc.) relevant to your interests or job requirements. * **Online Courses:** Platforms like Coursera, edX, Udemy, DataCamp, and Udacity offer comprehensive courses on Data Engineering fundamentals and specific technologies. * **Books:** "The Data Warehouse Toolkit" by Ralph Kimball is a classic for dimensional modeling. "Designing Data-Intensive Applications" by Martin Kleppmann provides deep insights into data systems. * **Blogs and Community:** Follow data engineering blogs, attend webinars, and engage with the community to stay updated on best practices and new technologies. ---