Key Components of Modern Data Stacks and Infrastructure

# Key Components of Modern Data Stacks and Infrastructure ![](https://lh5.googleusercontent.com/vpgNIezn1Tq6-MDjZ_cnCeDSPf4QRKs7rvgaipPuDdWpVxKspWigryrhGIrYRcy4p1NevQ9gB9NBWUrBCe-GzrATW3trO4UJoRIokQxAy_gxNRj9XmQ6QiSkvcfjhJwoMUlZMh6zD9w) Image by Author Before the advent of big data, data was mostly handled on premises and most applications used a single tool for all their data needs. Neither was the quantity of data huge nor were the number of requests per second alarmingly high. But the situation has changed completely now. Many applications require a multitude of tools to deliver value to users. In this article, we will be discussing the need for a modern data stack and their key components. Why Do We Require a Modern Data Stack? -------------------------------------- To answer this question let's think about the [data architecture of Uber](https://eng.uber.com/uber-big-data-platform/). Being the huge data company it is, Uber handles hundreds of thousands of requests for food delivery and cab rides per second. Uber has offline models, online models with either streaming or batch processing capabilities, databases with driver and customer location information, etc. Whenever we raise a cab ride request, Uber goes through millions of driver records and finds drivers nearest to us. This is done in an extremely efficient manner. It goes without saying that all these utilities can't be handled by a single tool. For offline models, Uber uses HDFS via Spark and for online models it uses Cassandra, Kafka, and HDFS. This is because the requirements are different for online and offline models. In the offline model, the size of the incoming data is huge but predictions don't have to be given in real time. On the other hand, for online models the data is comparatively smaller but the predictions have to happen in real time. This is the reason we require a [modern data stack](https://www.fivetran.com/blog/what-is-the-modern-data-stack) that handles different use cases in the most efficient manner. ![](https://lh4.googleusercontent.com/fVRYbNFRm2f6J-KynP-y8nRDRW7E9oJjLcJQ-6l7Ji2n6-LCmK2BZQnhydT0APjnr-N--WK2DpZh5gVwKWBCycSXmnF6LEBcSatGivcF7o1mJqWhWoaN7_OsHvJUh2qQkUSATXlVP7g) [Source](https://allbigdatathings.blogspot.com/2019/04/uber-data-architecture.html) Components of Modern Data Stacks -------------------------------- The main objective of a modern data stack is to enable analysts and engineers to comb through piles of data and identify areas of business opportunities. The advent of cloud-based data solutions have made this objective easily attainable. Cloud-based solutions are cheaper and also very easy to configure as most of the complex concepts are abstracted. The following are the key components that constitute the modern data stack. ### Data Pipeline The data pipeline is responsible for the transformation and movement of data from one source to another. The source can be a database, a data warehouse, an analytics application, a payment processing platform, or a machine learning model. The output of one transformation process is given as an input for another. Usually, the data pipeline functions as a series of events but parallel needs can also be configured based on one's need. A data pipeline must have a lot of inbuilt connectors and must possess the ability to create a custom connector. The reach of the data pipeline depends on the number of tools it can connect to. A robust data pipeline is a key component of the modern data stack as it saves us from a lot of manual tasks. The inbuilt checks in the data pipeline makes sure that the transformation is successful. Additionally, on failure the pipeline can be configured to rerun a number of times. ETL, ELT and [reverse ETL](https://www.firebolt.io/glossary-items/reverse-etl) are all types of data pipelines. The orchestration capabilities of a data pipeline provide a centralized view of the whole movement of data. This enables easy administration and monitoring. Data drift can be detected by integrating statistical tests to the pipeline. This is an excellent use case as we can look into the reason for drift and make timely business decisions. ![](https://lh3.googleusercontent.com/hBs8z6HQJIPYKDeDzbloOi_aP4tmQJShhQAGCojjoQA7HbRlXlOSRvaKWM7ZYb1jJm4GJKCD9PNpTO9yXqdk2gmz93qv6_zP_YEHXbHWEeia-1L3qGNwuWksC0hPfNcSO1RB6BHKBjw)[Source](https://www.stonebranch.com/blog/automate-big-data-pipelines-centralized-orchestration) ### Cloud Data Warehouse and Data Lake Data warehouses and data lakes are used for storing big data. A cloud data lake consists of raw data in an unorganized format. Cloud data lakes can be used in cases where we are required to store "good to have data"---data that we might use in the future or data that does not have any structure. They are costly and should be only accessed by data scientists or engineers who are capable of transforming the data into a structured format. On the other hand, a [cloud data warehouse](https://www.firebolt.io/blog/cloud-data-warehouse) stores information in a structured format. Since it stores data in a structured format, it can be accessed by non-technical users as well. Data from a warehouse can be directly fed into business intelligence and visualization tools. Cloud data warehouses are cheap as they only have transformed data. Based on your needs, you can choose either of them or both. The data lake house is a novel concept that contains the flexibility of data lakes and the structural integrity of data warehouses. ### Data Versioning Tool [Data versioning](https://towardsdev.com/combining-data-versioning-with-big-data-storage-14f25c3f9b12) is the process of creating a unique reference for a subset of incoming data. Data is stored along with metadata, which includes a timestamp, the source, and the last edit information. Data versioning tools are a recent addition to the modern data stack. This was mainly due to the paradigm shift of starting to treat data as a tool. Reproducibility is an extremely important aspect of machine learning. By versioning data along with metadata, we can reproduce the results of experiments. Versioning data improves collaboration between teams, thereby breaking silos. Data versioning helps us meet audit and compliance requirements efficiently. ### Data Visualization Platform We ingest data from multiple sources, transform billions of records, and store them in a specialized data infrastructure for the sole purpose of getting business insights from the data. The data has to be presented in such a way that users can get insights from it as easily as possible. Visualization is the preferred method since humans are better at understanding images to numbers. The plots for visualization should not be complex as non-technical users might use it. This issue is solved by modern data visualization platforms as they employ a drag and drop strategy, which offers drill down and group by functionalities. ### Real-Time and Batch Ingestion Real-time or streaming data ingestion involves getting data once it is created or has approached a source. Streaming tools must be able to read and write events concurrently and must be scalable on demand. In some cases, they might have to function as databases as well for a short period. They must be fault tolerant and must avoid failures. Batch ingestion is the most common form of data ingestion. The data is ingested in batches at periodic intervals. The size of the batch will usually be large. Micro batch ingestion is a type of ingestion that is used interchangeably with real-time ingestion. However, it differs from real-time ingestion in that it takes data in small batches at very frequent intervals. Due to its frequency, the ingestion appears to be in real time. Conclusion ---------- Storage, orchestration, transformation, monitoring, business intelligence, streaming and batch processing are the common operations in a data intensive application. So, based on the use case in hand, the tool has to be chosen. The data tools have a predefined schema and perform only a specific function. These tools integrated together constitute the modern data stack.