Data Modelling is the process of creating data models for storing data in the database. It represents the rules and relationships between various data objects. It helps in efficiently creating, organizing, maintaining, retrieving and visualizing the data for enforcing the business rules, policies and compliances on the data. In this article, we will see the most commonly asked interview questions about data modelling for both freshers and experienced professionals.
Data modelling is defined as the process of creating a model that represents the data and the relationship between different data to store it in a database. It can also be called database modelling and this skill is useful across all domains like data engineering, data science, software development etc to prepare, analyse and process the data by continuously reorganizing, restructuring and optimizing it for fitting the needs of the company/organization.
Data modelling helps professionals from different domains like software development, business analytics, data science, etc to view the data, understand the relationship between the data objects in a data warehouse or a database. Following are the main benefits of this process:
A data model is an information schema that sorts and normalizes different information components and establishes relations between them. These models subsequently become tables in the database which can be retrieved and processed based on the organization's requirements. There are three types of data models, they are:
Dimensions represent the set of unique values that are useful for identifying and categorizing the data from the data warehouse.
Granularity represents the information level stored in a database table. It can be high or low level with tables that contain transaction-level or fact table respectively. Granularity is also a measure of smallest set of data which can be independently processed by any component/task/application.
Data sparsity refers to the number of empty cells in the database. It represents how much data is available for a particular dimension in a data model. Inadequate information leads to large space consumption for saving aggregations.
Hashing is a technique that helps to search index values to retrieve required data. It is used for calculating the direct location of data using indexing.
Database Management System (DBMS) is software consisting of a group of programs that manipulates the database for storing and retrieving the user data.
OLTP stands for Online Transaction Processing which is an operational system supporting transaction-oriented applications in 3-tier architecture. It focuses on query processing, data integrity maintenance, transaction effectiveness per second in multi-access environments.
Some of the examples of OLTP systems are:
OLAP stands for On-Line Analytical Processing and it is a type of technology that authorizes managers, executives and analysts for gaining insights in a faster, secure, consistent and interactive manner. OLAP is used in Intelligent solutions involving Planning, Budgeting, Analysis, Forecasting, Simulation Models etc. OLAP helps clients to perform analysis considering multiple dimensions and helps in providing insights that help in better decision making.
A data mart is a condensed data warehouse version (a subset of data warehouse) designed for a specific unit or a department like Marketing, Sales, HR, finance etc in an organization. There can be different data marts specific to individual departments which can interact with the data warehouse for information processing as shown in the image below.
Metadata is the information about data that tells what data kind is stored in the system, what is the purpose and to whom it is intended. There are different types of metadata depending on the purpose, they are:
A relational database is a digital database system that is based on a relational data model and is useful for storing data in tables. Oracle, Teradata, MySQL, PostgreSQL, SQLite etc are types of relational databases. The data is transformed into rows and columns which is standardized and can be queried with SQL. RDBMS helps to add/ alter tables and column data into the database systematically by maintaining data integrity.
Cardinality represents the level of uniqueness of data values present in a column. The high value of cardinality tells us that the column has a large percentage of unique values. Low cardinality indicates that the data is repeated (duplicated) in the column.
The process of structuring a relational database based on a series of normal forms for reducing data redundancy and increasing data integrity is called Normalization. It divides larger tables into smaller ones and links them using references/relationships. This helps in reducing data repetition and ensuring it is stored logically. The following image represents the process of normalization:
Denormalization is the opposite of normalization which helps in improving the read performance by compensating the write performance by adding redundant data copies. It has duplicate data entries in the database for increasing the performance of read-heavy loads.
NoSQL database is a type of database that does not require a fixed schema. It is non-relational and non-relational and usually avoids joins in preference of schema variation for achieving better performance. It is also very easy to scale based on the application's needs. NoSQL is used for storing huge data in a distributed manner. NoSQL Databases are of four types:
OLTP | OLAP |
---|---|
OLTP stands for Online Transactional System | OLAP stands for Online Analysis and Data retrieving Process |
OLTP is dealing with a huge number of short online transactions. | OLAP deals with large data volume. |
Makes use of traditional DMBS. | Makes use of huge data warehouses. |
Tables in the databases are normalized. | Tables are not normalized here. |
The response time is in the range of milliseconds. | Takes seconds to minutes to usually get a response. |
Designed for real-time business applications. | Designed for data analysis measured using category and attributes. |
Category | SQL | NoSQL |
---|---|---|
Model | Follows relational model. | Follows non-relational model. |
Data | Deals with structured data. | Deals with semi-structured data |
Flexibility | SQL follows strict schema. | NoSQL deals with dynamic schema and is very flexible. |
Transactions | Follows ACID (Atomicity, Consistency, Isolation, Durability) properties. | Follows BASE (Basic Availability, Soft-state, Eventual consistency) properties. |
Since a null value is not equal to another null value, inserting two null values on a column with a unique constraint does not throw any error. It simply inserts the data.
There are 5 phases in data modelling:
A schema usually represents the data structure illustration and relationships between different structures. A star schema constitutes a central fact table and different dimension tables connected to it. The primary key of the dimension tables is a foreign key in the central fact table. This schema is called a star schema because the entity-relationship diagram can be visualized as a star whose points diverge from the central fact table as shown below:
The central fact table has two column sets- one set belongs to the fact table and the others are foreign keys to the dimension tables. Dimensions symbolize one or more attributes/hierarchies that categorize data. If there are no hierarchies, they are called lists or flat dimensions. The dimensional tables are smaller in size when compared to the fact table.
For example, The fact tables can store the entire sales information and the dimension tables store data about clients, products, geographic regions, channels etc that are linked as foreign key references to the Sales information fact table.
A snowflake schema is a schema that has one or more dimension tables that do not connect directly to the fact table but should be interacting with it via other dimension tables. The process of normalizing dimension tables in STAR schema to resemble a snowflake-like structure with the fact table at the centre is called snowflaking. It is represented in the image below:
The tables in the snowflake schema are normalized to 3NF form and each dimension level in the schema represents one level in the hierarchy. The central fact table is linked to multiple dimension tables that can be linked to other dimension tables via many-to-many or many-to-one relationships. The schema is useful for enhancing the query performance as there are minimum disk requirements and the joining of smaller lookup tables.
A relationship is normally used for connecting parent and child tables. There are 3 types of critical relationships in a data model, they are:
There are two types of visual techniques in data modelling, they are:
CAP(Consistency-Availability-Partition Tolerance) theorem says that a distributed system cannot guarantee C, A and P simultaneously. It can at max provide any 2 of the 3 guarantees.
The following image represents what databases guarantee what aspects of the CAP Theorem simultaneously. We see that RDBMS databases guarantee consistency and Availability simultaneously. Redis, MongoDB, Hbase databases guarantee Consistency and Partition Tolerance. Cassandra, CouchDB guarantees Availability and Partition Tolerance.
Recursive relationships occur when there is a relationship between an entity and itself. These relationships are complex and would require more complex approaches of mapping the data to a schema. Consider an instance where a doctor is marked as a care provider in the health care database. Now if the doctor himself falls sick, he needs to go to another doctor as a patient which results in recursive relationships. To accommodate this, we add a foreign key to the health centre's number in every patient's record. Care needs to be taken for ensuring that the recursion has an exit path in such entity relations.
Dimensional Modeling is a technique of data structuring performed for optimizing the data storage and faster data retrieval in the data warehouse. It consists of 2 types of tables - fact tables and dimension tables. The fact tables are used for storing different transactional data along with the foreign keys from the dimension tables qualifying the data.
The purpose of Dimensional modelling is to achieve faster data retrieval and not the highest degree of normalization.
There are 5 types of dimensions, they are:
Reverse engineering is the process of building a data model from an existing database. This is more beneficial in cases when an organization has an existing outdated database solution, then while performing tech modernization, instead of developing solutions from scratch, it becomes more efficient to reconstruct the data model from an existing system and make tweaks to that data model as per the vision and apply that tweak to the current system. The steps in reverse engineering are summarised as follows:
The third normal form (3NF) is used for preventing data duplication and data abnormalities. A relation is said to have met the third normal form when there are no transitive dependencies for non-prime attributes as present in the second normal form. A 3NF relation is said to have achieved if any one of the below conditions are met for every non-trivial functional dependency A->B:
For a table to achieve 3NF form, it should first be in 1NF (First Normal Form) and achieve 2NF (Second Normal Form). The rows in the table should be dependent only on the keys. If the contents of fields apply to more than 1 primary key, then they should be put in another table.
For instance, if a patient's records have a doctor's phone number stored with the patient's details, it doesn't meet the 3NF standards. Because the doctor's number should be part of the doctor table for avoiding duplication in the patient table because a doctor can have more than one patient.
Some of the mistakes encountered in data modelling are:
Data modelling is the process of developing a data model for storing in the database as per the business requirements. It ensures consistency and ensures naming conventions, semantics and security is followed along with ensuring the data quality. In the current era of technologies, data has become the new oil. It is very much important to ensure that the data is structured, developed, utilised and presented effectively for the growth of the organization. Due to this, data modelling has gained immense popularity and importance among database architects, software developers and business analysts. In this article, we have seen the most commonly asked interview questions in the field of data modelling for both freshers and experienced professionals.
https://intellipaat.com/blog/data-modeling-tutorial-for-beginners/