# Data Engineer Interview Q&A - My Notes
| # | 1 |
|:---:|:-------- |
| Que | **How Does a Data Warehouse Differ from an Operational Database?** |
| Ans |<li>Insert and Update is standard operational databases that focus on speed and efficiency. As a result, analyzing data can be a little more complicated. </li> <li>With a data warehouse, aggregations, calculations, and select statements are the primary focus. These make data warehouse an ideal choice for data analysis.</li> |
| # | 2 |
|:---:|:-------- |
| Que |**What Do "args" and "kwargs" Mean?** |
| Ans |<li>Both are used when we are not sure about the number of arguments that can be passed to a function.</li> <li> *args => Non Keyword Arguments => (i.e) function(3,5)</li> <li> **kwargs => Keyword Arguments => (i.e) function(Name=Jaffar, Age=26)</li>|
| # | 3 |
|:---:|:-------- |
| Que |**As a Data Engineer, How Have You Handled a Job-Related Crisis?**|
| Ans |<li>Based on the situation I will apply problem solving abilities to overcome the issue.</li> <li>For example, if data were to get lost or corrupted, I would work with IT to make sure data backups were ready to be loaded, and that other team members have access to what they need.</li>|
| # | 4 |
|:---:|:-------- |
| Que |**Do You Have Any Experience with Data Modeling?**|
| Ans |<li>Yes, from my recent POC I have basic experience in data modeling.</li> <li>Data modeling is the process of creating a visual representation of either a whole information system or parts of it to communicate connections between data points and structures.</li> <li>In simple words, Data Model is like an architect's building plan which emphasizes on what data is needed and how it should be organized. </li>|
| # | 5 |
|:---:|:-------- |
| Que |**What are the essential skills required to be a data engineer?**|
| Ans |<li>Comprehensive knowledge about Data Modelling.</li><li>Understanding about database design & database architecture.(SQL)</li><li>Working experience of data stores and distributed systems like Hadoop (HDFS).</li><li>Data Visualization Skills.</li><li>Experience in Data Warehousing and ETL tools.</li>|
| # | 6 |
|:---:|:-------- |
| Que |**Can you name the essential frameworks and applications for data engineers?**|
| Ans |<li>SQL, Hadoop, Spark, Oozie, Python, Bash scripting and some visualization tool</li>|
| # | 7 |
|:---:|:-------- |
| Que |**Can you differentiate between a Data Engineer and Data Scientist?**|
| Ans |<li>Data engineers build and maintain the systems that allow data scientists to access and interpret data. The role generally involves creating data models, building data pipelines and overseeing ETL (extract, transform, load).</li><li>Data scientists build and train predictive models using data after it’s been cleaned. They then communicate their analysis to managers and executives.</li>|
| # | 8 |
|:---:|:-------- |
| Que |**What, according to you, are the daily responsibilities of a data engineer?**|
| Ans |<li>Development, testing, and maintenance of architectures.</li><li>Data acquisition and development of data set processes.</li><li>Developing pipelines for various ETL operations and data transformation</li><li>Simplifying data cleansing and improving the de-duplication and building of data.</li><li>Identifying ways to improve data reliability, flexibility, accuracy and quality.</li>|
| # | 9 |
|:---:|:-------- |
| Que |**Can you list and explain the design schemas in Data Modelling?**|
| Ans |<li>***Star Schema:*** The center of the star can have one fact table and a number of associated dimension tables. It is optimized for querying large data sets.The fact table which contains keys and measures to every dimension table.</li> <li>**Characteristics of Star Schema:**</li><li>Every dimension in a star schema is represented with the only one-dimension table.</li><li>The dimension table is joined to the fact table using a foreign key.</li><li>The dimension table are not joined to each other.</li><li>The dimension tables are not normalized.</li>--------------------------------------------------<li>***Snowflake Schema:*** A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. The dimension tables are normalized which splits data into additional tables.</li><li>**Characteristics of Snowflake Schema:**</li><li>It uses smaller disk space.</li><li>Due to multiple tables query performance is reduced.</li><li>Perform more maintenance efforts because of the more lookup tables.</li>|
| # | 10 |
|:---:|:-------- |
| Que |**How would you validate a data migration from one database to another?**|
| Ans |<li>Schema Validation.</li><li>Cell-by-Cell Comparison.</li><li>Reconciliation Checks: This ensures that the data is not corrupted, date formats are maintained, and that the data is completely loaded. </li><li>NULL Validation.</li><li>Security Validation</li>|
| # | 11 |
|:---:|:-------- |
| Que |**Have you worked with ETL? If yes, please state, which one do you prefer the most and why?**|
| Ans ||
| # | 12 |
|:---:|:-------- |
| Que |**What is Hadoop? How is it related to Big data? Can you describe its different components?**|
| Ans |<li>Hadoop is the most common tool for processing Big data and it is an open-source software framework.</li><li>***Hadoop components:***</li><li>**HDFS:** stands for Hadoop Distributed File System and stores all of the data of Hadoop. Being a distributed file system, it has a high bandwidth and preserves the quality of data.</li><li>**MapReduce:** is a processing technique and processes large volumes of data.</li><li>**YARN:** (Yet Another Resource Negotiator) deals with the allocation and management of resources in Hadoop.</li><li>**Hadoop Common:** to provide common utilities that can be used across all modules.</li>|
| # | 13 |
|:---:|:-------- |
| Que |**Do you have any experience in building data systems using the Hadoop framework?**|
| Ans ||
| # | 14 |
|:---:|:-------- |
| Que |**Can you tell me about NameNode? What happens if NameNode crashes or comes to an end?**|
| Ans |<li>It is the central node of the Hadoop Distributed File System (HDFS), and it does not store actual data. It stores metadata. For example, the data being stored in DataNodes on which rack and which DataNode the information is stored. It tracks the different files present in clusters.</li><li>Generally, there is one NameNode, so when it crashes, the system may not be available and there will not any data loss.</li>|
| # | 15 |
|:---:|:-------- |
| Que |**Are you familiar with the concepts of Block and Block Scanner in HDFS?**|
| Ans |<li>Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 128MB.</li><li>Block Scanner – tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors.</li>|
| # | 16 |
|:---:|:-------- |
| Que |**What happens when Block Scanner detects a corrupted data block?**|
| Ans |<li>First, the DataNode will report about the corrupted block to the NameNode.Then, NameNode will start the process of creating a new replica using the correct replica of the corrupted block present in other DataNodes.</li><li>The corrupted data block will not be deleted until the replication count of the correct replicas matches with the replication factor (3 by default).</li>|
| # | 17 |
|:---:|:-------- |
| Que |**What are the two messages that NameNode gets from DataNode?**|
| Ans |<li>**Heartbeat:** This message signals that DataNode is still alive. Periodic receipt of Heartbeat is vey important for NameNode to decide whether to use a DataNode or not.</li><li>**Block Report:** This is a list of all the data blocks hosted on a DataNode. With this report, NameNode gets information about what data is stored on a specific DataNode.</li>|
| # | 18 |
|:---:|:-------- |
| Que |**Can you elaborate on Reducer in Hadoop MapReduce? Explain the core methods of Reducer?**|
| Ans |<li>Reducer is the second stage of data processing in the Hadoop Framework. The Reducer processes the data output of the mapper and produces a final output.</li><li>***The Reducer has 3 phases:***</li><li>**Shuffle:** The output from the mappers is shuffled and acts as the input for Reducer.</li><li>**Sorting** is done simultaneously with shuffling, and the output from different mappers is sorted.</li><li>**Reduce:** Reduces aggregates the key-value pair and gives the required output.</li><li>***There are 3 core methods in Reducer:***</li><li>**Setup:** It configures various parameters like input data size.</li><li>**Reduce:** In this method, a task is defined for the associated key.</li><li>**Cleanup:** This method cleans temporary files at the end of the task.</li>|
| # | 19 |
|:---:|:-------- |
| Que |**How can you deploy a big data solution?**|
| Ans |<li>**Data Ingestion:** Extraction of data using data sources like RDBMS, Salesforce, MySQL.</li><li>**Data storage:** The extracted data would be stored in an HDFS or NoSQL database.</li><li>**Data processing:** Deploying the solution using processing frameworks like MapReduce and Spark.</li>|
| # | 20 |
|:---:|:-------- |
| Que |**Which Python libraries would you utilize for proficient data processing?**|
| Ans |<li>**NumPy** as it is utilized for efficient processing of arrays of numbers</li><li>**Pandas** which is great for statistics and data preparation for machine learning work.</li>|
| # | 21 |
|:---:|:-------- |
| Que |**Can you differentiate between list and tuples?**|
| Ans |<li>Lists are mutable and can be edited, but Tuples are immutable and cannot be modified.</li>|
| # | 22 |
|:---:|:-------- |
| Que |**How can you deal with duplicate data points in an SQL query?**|
| Ans |<li>The use of SQL keywords DISTINCT, UNIQUE and GROUP BY with HAVING to reduce duplicate data points</li>|
| # | 23 |
|:---:|:-------- |
| Que |**Did you ever work with big data in a cloud computing environment?**|
| Ans ||
| # | 24 |
|:---:|:-------- |
| Que |**How can data analytics help the business grow and boost revenue?**|
| Ans |<li>The advantages of data analytics to boost revenue, improve customer satisfaction, and increase profit.</li><li>Data analytics helps in setting realistic goals and supports decision making.</li>|
| # | 25 |
|:---:|:-------- |
| Que |**Relational vs Non-Relational Databases**|
| Ans |<li>**Relational databases** use tables that are all connected to each other. </li><li>**Non-relational databases** are document-oriented which are responsible for a single type of data, they can store information under different categories, which all depend on different commands.</li>**Relational_DB_Example:****Non-Relational_DB_Example:**<em>[Reference](https://jelvix.com/blog/relational-vs-non-relational-database)</em>|
| # | 26 |
|:---:|:-------- |
| Que |**SQL Aggregation Functions**|
| Ans |<li>Perform a mathematical operation on a result set. Examples AVG, COUNT, MIN, MAX, and SUM. Often, you’ll need GROUP BY and HAVING clauses to complement these aggregations.</li>|
| # | 27 |
|:---:|:-------- |
| Que |**Cache Databases**|
| Ans |<li>Cache databases hold frequently accessed data. They live alongside the main SQL and NoSQL databases. Their aim is to alleviate load and serve requests faster.<li>It can be partitioned and scaled according to your needs, but it’s typically much smaller in size than your main database.</li><li>***How It Works:*** When a request comes in, it first check the cache database, then the main database. This way, you can prevent any unnecessary and repetitive requests from reaching the main database’s server.</li><li>*Example:* Redis</li>|
| # | 28 |
|:---:|:-------- |
| Que |**ETL Challenges**|
| Ans |<li>Heavy data loads</li><li>Long-running, inefficient queries</li><li>Poorly coded mappings</li><li>Incorrect design of source and target systems</li><em>[Reference](https://www.datavail.com/blog/4-issues-that-can-negatively-affect-your-etl-processes/)</em>|
| # | 29 |
|:---:|:-------- |
| Que |**Big Data Design Patterns**|
| Ans ||