Jaffar
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Data Engineer Interview Q&A - My Notes | # | 1 | |:---:|:-------- | | Que | **How Does a Data Warehouse Differ from an Operational Database?** | | Ans |<li>Insert and Update is standard operational databases that focus on speed and efficiency. As a result, analyzing data can be a little more complicated. </li> <li>With a data warehouse, aggregations, calculations, and select statements are the primary focus. These make data warehouse an ideal choice for data analysis.</li> | | # | 2 | |:---:|:-------- | | Que |**What Do "args" and "kwargs" Mean?** | | Ans |<li>Both are used when we are not sure about the number of arguments that can be passed to a function.</li> <li> *args => Non Keyword Arguments => (i.e) function(3,5)</li> <li> **kwargs => Keyword Arguments => (i.e) function(Name=Jaffar, Age=26)</li>| | # | 3 | |:---:|:-------- | | Que |**As a Data Engineer, How Have You Handled a Job-Related Crisis?**| | Ans |<li>Based on the situation I will apply problem solving abilities to overcome the issue.</li> <li>For example, if data were to get lost or corrupted, I would work with IT to make sure data backups were ready to be loaded, and that other team members have access to what they need.</li>| | # | 4 | |:---:|:-------- | | Que |**Do You Have Any Experience with Data Modeling?**| | Ans |<li>Yes, from my recent POC I have basic experience in data modeling.</li> <li>Data modeling is the process of creating a visual representation of either a whole information system or parts of it to communicate connections between data points and structures.</li> <li>In simple words, Data Model is like an architect's building plan which emphasizes on what data is needed and how it should be organized. </li>| | # | 5 | |:---:|:-------- | | Que |**What are the essential skills required to be a data engineer?**| | Ans |<li>Comprehensive knowledge about Data Modelling.</li><li>Understanding about database design & database architecture.(SQL)</li><li>Working experience of data stores and distributed systems like Hadoop (HDFS).</li><li>Data Visualization Skills.</li><li>Experience in Data Warehousing and ETL tools.</li>| | # | 6 | |:---:|:-------- | | Que |**Can you name the essential frameworks and applications for data engineers?**| | Ans |<li>SQL, Hadoop, Spark, Oozie, Python, Bash scripting and some visualization tool</li>| | # | 7 | |:---:|:-------- | | Que |**Can you differentiate between a Data Engineer and Data Scientist?**| | Ans |<li>Data engineers build and maintain the systems that allow data scientists to access and interpret data. The role generally involves creating data models, building data pipelines and overseeing ETL (extract, transform, load).</li><li>Data scientists build and train predictive models using data after it’s been cleaned. They then communicate their analysis to managers and executives.</li>| | # | 8 | |:---:|:-------- | | Que |**What, according to you, are the daily responsibilities of a data engineer?**| | Ans |<li>Development, testing, and maintenance of architectures.</li><li>Data acquisition and development of data set processes.</li><li>Developing pipelines for various ETL operations and data transformation</li><li>Simplifying data cleansing and improving the de-duplication and building of data.</li><li>Identifying ways to improve data reliability, flexibility, accuracy and quality.</li>| | # | 9 | |:---:|:-------- | | Que |**Can you list and explain the design schemas in Data Modelling?**| | Ans |<li>***Star Schema:*** The center of the star can have one fact table and a number of associated dimension tables. It is optimized for querying large data sets.The fact table which contains keys and measures to every dimension table.</li> <li>**Characteristics of Star Schema:**</li><li>Every dimension in a star schema is represented with the only one-dimension table.</li><li>The dimension table is joined to the fact table using a foreign key.</li><li>The dimension table are not joined to each other.</li><li>The dimension tables are not normalized.</li>--------------------------------------------------<li>***Snowflake Schema:*** A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. The dimension tables are normalized which splits data into additional tables.</li><li>**Characteristics of Snowflake Schema:**</li><li>It uses smaller disk space.</li><li>Due to multiple tables query performance is reduced.</li><li>Perform more maintenance efforts because of the more lookup tables.</li>| | # | 10 | |:---:|:-------- | | Que |**How would you validate a data migration from one database to another?**| | Ans |<li>Schema Validation.</li><li>Cell-by-Cell Comparison.</li><li>Reconciliation Checks: This ensures that the data is not corrupted, date formats are maintained, and that the data is completely loaded. </li><li>NULL Validation.</li><li>Security Validation</li>| | # | 11 | |:---:|:-------- | | Que |**Have you worked with ETL? If yes, please state, which one do you prefer the most and why?**| | Ans || | # | 12 | |:---:|:-------- | | Que |**What is Hadoop? How is it related to Big data? Can you describe its different components?**| | Ans |<li>Hadoop is the most common tool for processing Big data and it is an open-source software framework.</li><li>***Hadoop components:***</li><li>**HDFS:** stands for Hadoop Distributed File System and stores all of the data of Hadoop. Being a distributed file system, it has a high bandwidth and preserves the quality of data.</li><li>**MapReduce:** is a processing technique and processes large volumes of data.</li><li>**YARN:** (Yet Another Resource Negotiator) deals with the allocation and management of resources in Hadoop.</li><li>**Hadoop Common:** to provide common utilities that can be used across all modules.</li>| | # | 13 | |:---:|:-------- | | Que |**Do you have any experience in building data systems using the Hadoop framework?**| | Ans || | # | 14 | |:---:|:-------- | | Que |**Can you tell me about NameNode? What happens if NameNode crashes or comes to an end?**| | Ans |<li>It is the central node of the Hadoop Distributed File System (HDFS), and it does not store actual data. It stores metadata. For example, the data being stored in DataNodes on which rack and which DataNode the information is stored. It tracks the different files present in clusters.</li><li>Generally, there is one NameNode, so when it crashes, the system may not be available and there will not any data loss.</li>| | # | 15 | |:---:|:-------- | | Que |**Are you familiar with the concepts of Block and Block Scanner in HDFS?**| | Ans |<li>Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 128MB.</li><li>Block Scanner – tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors.</li>| | # | 16 | |:---:|:-------- | | Que |**What happens when Block Scanner detects a corrupted data block?**| | Ans |<li>First, the DataNode will report about the corrupted block to the NameNode.Then, NameNode will start the process of creating a new replica using the correct replica of the corrupted block present in other DataNodes.</li><li>The corrupted data block will not be deleted until the replication count of the correct replicas matches with the replication factor (3 by default).</li>| | # | 17 | |:---:|:-------- | | Que |**What are the two messages that NameNode gets from DataNode?**| | Ans |<li>**Heartbeat:** This message signals that DataNode is still alive. Periodic receipt of Heartbeat is vey important for NameNode to decide whether to use a DataNode or not.</li><li>**Block Report:** This is a list of all the data blocks hosted on a DataNode. With this report, NameNode gets information about what data is stored on a specific DataNode.</li>| | # | 18 | |:---:|:-------- | | Que |**Can you elaborate on Reducer in Hadoop MapReduce? Explain the core methods of Reducer?**| | Ans |<li>Reducer is the second stage of data processing in the Hadoop Framework. The Reducer processes the data output of the mapper and produces a final output.</li><li>***The Reducer has 3 phases:***</li><li>**Shuffle:** The output from the mappers is shuffled and acts as the input for Reducer.</li><li>**Sorting** is done simultaneously with shuffling, and the output from different mappers is sorted.</li><li>**Reduce:** Reduces aggregates the key-value pair and gives the required output.</li><li>***There are 3 core methods in Reducer:***</li><li>**Setup:** It configures various parameters like input data size.</li><li>**Reduce:** In this method, a task is defined for the associated key.</li><li>**Cleanup:** This method cleans temporary files at the end of the task.</li>| | # | 19 | |:---:|:-------- | | Que |**How can you deploy a big data solution?**| | Ans |<li>**Data Ingestion:** Extraction of data using data sources like RDBMS, Salesforce, MySQL.</li><li>**Data storage:** The extracted data would be stored in an HDFS or NoSQL database.</li><li>**Data processing:** Deploying the solution using processing frameworks like MapReduce and Spark.</li>| | # | 20 | |:---:|:-------- | | Que |**Which Python libraries would you utilize for proficient data processing?**| | Ans |<li>**NumPy** as it is utilized for efficient processing of arrays of numbers</li><li>**Pandas** which is great for statistics and data preparation for machine learning work.</li>| | # | 21 | |:---:|:-------- | | Que |**Can you differentiate between list and tuples?**| | Ans |<li>Lists are mutable and can be edited, but Tuples are immutable and cannot be modified.</li>| | # | 22 | |:---:|:-------- | | Que |**How can you deal with duplicate data points in an SQL query?**| | Ans |<li>The use of SQL keywords DISTINCT, UNIQUE and GROUP BY with HAVING to reduce duplicate data points</li>| | # | 23 | |:---:|:-------- | | Que |**Did you ever work with big data in a cloud computing environment?**| | Ans || | # | 24 | |:---:|:-------- | | Que |**How can data analytics help the business grow and boost revenue?**| | Ans |<li>The advantages of data analytics to boost revenue, improve customer satisfaction, and increase profit.</li><li>Data analytics helps in setting realistic goals and supports decision making.</li>| | # | 25 | |:---:|:-------- | | Que |**Relational vs Non-Relational Databases**| | Ans |<li>**Relational databases** use tables that are all connected to each other. </li><li>**Non-relational databases** are document-oriented which are responsible for a single type of data, they can store information under different categories, which all depend on different commands.</li>**Relational_DB_Example:**![](https://i.imgur.com/811noKa.jpg)**Non-Relational_DB_Example:**![](https://i.imgur.com/hpXb2ZK.jpg)<em>[Reference](https://jelvix.com/blog/relational-vs-non-relational-database)</em>| | # | 26 | |:---:|:-------- | | Que |**SQL Aggregation Functions**| | Ans |<li>Perform a mathematical operation on a result set. Examples AVG, COUNT, MIN, MAX, and SUM. Often, you’ll need GROUP BY and HAVING clauses to complement these aggregations.</li>| | # | 27 | |:---:|:-------- | | Que |**Cache Databases**| | Ans |<li>Cache databases hold frequently accessed data. They live alongside the main SQL and NoSQL databases. Their aim is to alleviate load and serve requests faster.<li>It can be partitioned and scaled according to your needs, but it’s typically much smaller in size than your main database.</li><li>***How It Works:*** When a request comes in, it first check the cache database, then the main database. This way, you can prevent any unnecessary and repetitive requests from reaching the main database’s server.</li><li>*Example:* Redis</li>| | # | 28 | |:---:|:-------- | | Que |**ETL Challenges**| | Ans |<li>Heavy data loads</li><li>Long-running, inefficient queries</li><li>Poorly coded mappings</li><li>Incorrect design of source and target systems</li><em>[Reference](https://www.datavail.com/blog/4-issues-that-can-negatively-affect-your-etl-processes/)</em>| | # | 29 | |:---:|:-------- | | Que |**Big Data Design Patterns**| | Ans ||

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully