Try   HackMD

ATNBD - exemplary questions in a nutshell

1. Basic DWH definitions.

  • data warehouse - a data management system that is designed to enable and support business intelligence activities within an organization
  • dimension - a single piece of information about a given event
  • fact - a single event described by dimensions
  • dimension table - a table containing dimensions
  • fact table - a table containing facts
  • measure - a numerical attribute in a fact table that can be aggregated
  • OLAP cube - a multidimensional data structure optimized for very quick data analysis

2. Comparison of star schema vs snowflake schema.

In a star schema each logical dimension is denormalized into one table,
while in a snowflake, at least some of the dimensions are normalized.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Star Schema Snowflake Schema
denormalized structure normalized structure
high redundancy low redundancy
uses more space uses less space
simple design complex design
easy to understand hard to understand
simple joins complex joins
faster query excecution slower query excecution
faster cube processing slower cube processing

3. Characteristics of a multidimensional model.

The multidimensional data model is an integral part of On-Line Analytical Processing. Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries during interactive sessions, not during batch jobs that run overnight. And because OLAP is also analytic, the queries are complex. The multidimensional data model is designed to solve complex queries in real time.

It is composed of logical cubes, measures, dimensions, hierarchies, levels, and attributes.

The simplicity of the model is inherent because it defines objects that represent real-world business entities. Analysts know which business measures they are interested in examining, which dimensions and attributes make the data meaningful, and how the dimensions of their business are organized into levels and hierarchies.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

4. Hive - DWH on Hadoop.

The Apache Hive™ data warehouse software is a project built on top of Apache Hadoop that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Hive is NOT:
  • a relational database,
  • a design for OnLine Transaction Processing (OLTP),
  • a tool for real-time queries and row-level updates.
Features of Hive:
  • It stores schema in a database and processed data in HDFS.
  • It is designed for OLAP.
  • It provides SQL type language for querying called HiveQL.
  • It is familiar, fast, scalable, and extensible.

5. Hive transactions.

Hive Transactions - documentation

TL;DR

With the addition of transactions in Hive 0.13 it is now possible to provide full ACID semantics at the row level, so that one application can add rows while another reads from the same partition without interfering with each other. If a table is to be used in ACID writes (insert, update, delete) then the table property "transactional=true" must be set on that table. Note, once a table has been defined as an ACID table via TBLPROPERTIES ("transactional"="true"), it cannot be converted back to a non-ACID table, i.e., changing TBLPROPERTIES ("transactional"="false") is not allowed.

Key characteristics:
  • Only ORC file format is supported.
  • By default transactions are configured to be off.
  • Tables must be bucketed to make use of these features.
  • External tables cannot be made ACID tables since the changes on external tables are beyond the control of the compactor.
  • Reading/writing to an ACID table from a non-ACID session is not allowed.
  • LOAD DATA statement is not supported with transactional tables.
  • BEGIN, COMMIT, and ROLLBACK are not yet supported.

6. Hive Metastore and its architecture.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Hive architecture

The figure above shows the major components of Hive and its interactions with Hadoop.
The five main components of Hive are:

UI – The user interface for users to submit queries and other operations to the system.

Driver – The component which receives the queries.

Compiler – The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the Metastore.

Metastore – The component that stores all the information about the structure of various tables like their schema (column types), location on HDFS, and partitions in a relational database.

Execution Engine – The component which executes the execution plan created by the Compiler on the appropriate system components.

The figure also shows how a typical query flows through the system:
  1. The Hive UI sends the query to the Driver to execute.
  2. The Driver takes the help of the query Compiler that parses the query to check the syntax and query plan or the requirement of query.
  3. The Compiler sends metadata request to the Metastore.
  4. The Metastore sends metadata as a response to the Compiler.
  5. The Compiler checks the requirements and resends the plan to the Driver. Up to here, the parsing and compiling of a query is complete. The plan generated by the Compiler is a DAG of stages with each stage being either a MapReduce job, a metadata operation or an operation on HDFS.
  6. The Driver sends the execute plan to the Execution Engine.
  7. The Execution Engine submits these stages as MapReduce jobs to appropriate Hadoop components.
  8. The Execution Engine receives the results from Data Nodes.
  9. The Execution Engine sends those results to the Driver.
  10. The Driver sends the results to the Hive UI.

Metastore

The Metastore provides two important but often overlooked features of a data warehouse:
  1. Without the data abstraction provided in Hive, a user has to provide information about data formats, extractors and loaders along with the query. In Hive, this information is given during table creation and reused every time the table is referenced. This is very similar to the traditional warehousing systems.
  2. The second functionality, data discovery, enables users to discover and explore relevant and specific data in the warehouse. Other tools can be built using this metadata to expose and possibly enhance the information about the data and its availability. Hive accomplishes both of these features by providing a metadata repository that is tightly integrated with the Hive query processing system so that data and metadata are in sync.
The Metastore can be configured to be used in a couple of ways:
  1. In embedded mode, the Hive client directly connects to an underlying Metastore using JDBC. By default a Derby database is used which supports only one user. Embedded mode is not really a practical solution but works well for unit tests.
  2. In local mode, the Hive Metastore service and Metastore database run in separate processes. It supports many Hive sessions i.e. many users can use the Metastore at the same time. We can achieve by using any JDBC compliant like MySQL which runs in a separate JVM or different machines.
  3. In remote mode, the Metastore is a Thrift service. The Hive Metastore service runs in its own separate JVM process.
The Hive Metastore supports five databases:
  • Derby
  • MSSQL
  • MySQL
  • Oracle
  • Postgres

7. Hive tables.

Key difference

When you drop a table:

  • if it is managed, Hive deletes both data and metadata,
  • if it is external, Hive deletes only metadata.
Comparison
Managed Tables External Tables
on drop - Hive deletes both data and metadata on drop - Hive deletes only metadata
by default need 'external' keyword
can accidentally delete data protect data against accidental drops
on load - Hive moves data into the warehouse on load - Hive uses data present in remote locations
Hive manages and controls data Hive does not manage and control data
for temporary data for long-term data
truncate, ACID, caching, materialized views work they do not work
Conclusion
  • If you want to check whether an existing table is managed or external, you can use the 'describe formatted <table_name>;' command.
  • Managed tables are like normal database tables in which we can store data.
  • External tables describe the metadata (schema) on external files. Their files can be accessed and managed by processes outside of Hive. They can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. Hive does not manage these data.

8. Hive partitioning and bucketing.

Data in Apache Hive can be categorized into tables, partitions, and buckets. Both partitioninig and bucketing distribute data. Hive organizes tables into partitions and it subdivides partitions into buckets.

Partitioning

Description

Apache Hive organizes tables into partitions for grouping same type of data together based on a column or partition key. Each table in Hive can have one or more partition keys to identify a particular partition. Thanks to using partitions, we can make it faster to do queries on slices of the data.

A partition in Hive is nothing but just a subdirectory in the table directory. For example, we have data of three departments in our table - Technical, Marketing and Sales. Thus, we will have three partitions in total for each of the departments and every department is in a separate subdirectory under the table directory. Queries regarding the Technical employee would only have to look through the data present in the Technical partition. It reduces the query latency by scanning only relevant partitioned data instead of the whole data set.

Partitioning data is often used for distributing load horizontally. This has performance benefit and helps in organizing data in a logical fashion. For example, if we are dealing with a large employee table and often run queries that restrict to a particular country or department, we can partition the Hive table for a faster response.

However, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.

Query add-on

PARTITIONED BY (partition1 data_type, partition2 data_type);

Bucketing

Description

In Hive tables partitions may be subdivided into many buckets based on the hash function of a column in the table to give extra structure to the data that may be used for more efficient queries. The hash function depends on the type of the bucketing column. We can use bucketing when the implementation of partitioning becomes difficult. It is a technique for decomposing data sets into more manageable parts that provide faster query response. We can define a number of buckets during table creation but loading equal volume of data has to be done manually.

For example, suppose a table using date as the top-level partition and employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into the user-defined number of buckets.

Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id.

The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins.

Bucketing needs to be enabled:
set hive.enforce.bucketing = true;

For dynamic bucketing, the settings below need to be enabled:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

Query add-on

[PARTITIONED BY (partition1 data_type, partition2 data_type, )]
CLUSTERED BY (column_name1, column_name2, )
[SORTED BY (column_name [ASC|DESC], )]
INTO <num_buckets> BUCKETS;

9. Apache Kylin.

Apache Kylin™ is an open source distributed analytical engine designed to provide OLAP (Online Analytical Processing) capability in the big data era.

By renovating the multi-dimensional cube and precalculation technology on Hadoop and Spark, Kylin is able to achieve near constant query speed regardless of the ever-growing data volume. Reducing query latency from minutes to sub-second, Kylin brings online analytics back to big data.

Apache Kylin™ lets you query billions of rows at sub-second latency in 3 steps:
  1. Identify a Star/Snowflake Schema on Hadoop.
  2. Build Cube from the identified tables.
  3. Query using ANSI-SQL and get results in sub-second, via ODBC, JDBC or RESTful API.

Apache Kylin™ can also integrate with your favorite BI tools like Tableau and PowerBI etc., to enable BI on Hadoop.

10. Kafka and Hive integration.

Description

Hive Kafka Storage Handler allows users to connect / analyze / transform Kafka topics via Hive. There are two table properties that must be set:

  1. kafka.topic - the Kafka topic to connect to
  2. kafka.bootstrap.servers - the Kafka broker connection string

The workflow is as follows:

  1. The user will create an external table that is a view over one Kafka topic.
  2. The user will be able to run any SQL query including write back to the same table or different Kafka backed table.
Query add-on

STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES (
"kafka.topic" = "test-topic",
"kafka.bootstrap.servers" = "localhost:9092");

Both properties are mandatory. On the write path if such a topic does not exist the topic will be created if Kafka broker admin policy allows for auto topic creation.

By default the serializer and deserializer is JSON.

11./12. Apache Kafka.

Apache Kafka® is a distributed streaming platform that:
  1. Publishes and subscribes to streams of records, similar to a message queue or enterprise messaging system.
  2. Stores streams of records in a fault-tolerant durable way.
  3. Processes streams of records as they occur.
Kafka is used for these broad classes of applications:
  1. Building real-time streaming data pipelines that reliably get data between systems or applications.
  2. Building real-time streaming applications that transform or react to the streams of data.
Kafka architecture

Elements of Kafka architecture:
  • producer - an application that sends messages to the topic
  • consumer - an application that receives messages from the topic
  • broker - a node/server in a Kafka cluster, containing topics
  • topic - a category name to which messages are stored and published
  • message - any information (data) that is sent from the producer to the consumer through Apache Kafka
  • partition - space that parallelize the topic by splitting the data across multiple brokers
  • log - a queue inside the partition in which messages are stored
  • consumer group - a group of consumers reading from a specific topic
Consumer group explanation

Consumers read from any single partition, allowing you to scale throughput of message consumption in a similar fashion to message production. Consumers can also be organized into consumer groups for a given topic — each consumer within the group reads from a unique partition and the group as a whole consumes all messages from the entire topic. If you have more consumers than partitions then some consumers will be idle because they have no partitions to read from. If you have more partitions than consumers then consumers will receive messages from multiple partitions. If you have equal numbers of consumers and partitions, each consumer reads messages in order from exactly one partition.

The following picture from the Kafka documentation describes the situation with multiple partitions of a single topic. Server 1 holds partitions 0 and 3 and Server 2 holds partitions 1 and 2. We have two consumer groups, A and B. A is made up of two consumers and B is made up of four consumers. Consumer Group A has two consumers of four partitions — each consumer reads from two partitions. Consumer Group B, on the other hand, has the same number of consumers as partitions and each consumer reads from exactly one partition.

13. Parts of MR - mapper, combiner, reducer.

MapReduce is a processing technique and a program model for distributed computing based on Java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.

  1. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs).
  2. Reduce takes the output from the Mapper as an input and combines those data tuples into a smaller set of tuples.

The reduce tasks are always performed after the map jobs.

Mapper

Mapper task is the first phase of processing that processes each input record and generates an intermediate key-value pair. Hadoop Mapper store intermediate-output on the local disk. The output key-value pairs can be completely different from the input pair. Before writing the output for each Mapper task, the partitioning of the output takes place on the basis of the key and then sorting is done. This partitioning specifies that all the values for each key are grouped together.

Mapper only understands key-value pairs of data, so before passing data to it, data should be first converted into key-value pairs.

Combiner

Combiner, also known as a semi-reducer, is an optional class that is used in between the Mapper and the Reducer to reduce the volume of data transfer between them. It operates by accepting the inputs from the Mapper class and thereafter passing the output key-value pairs to the Reducer class.

The main function of the Combiner is to summarize the Mapper output records with the same key as the output of the Mapper is usually large. The output (key-value collection) of the Combiner will be sent over the network to the actual Reducer task as an input.

Reducer

Reducer takes the output of the Mapper (or Combiner; any intermediate key-value pair) and processes each of them to generate the final output of the MapReduce task which is then stored in HDFS. It usually does the final aggregation and summation.

Hadoop Reducer takes a set of an intermediate key-value pair produced by the Mapper as the input and runs a Reducer function on each of them. One can aggregate, filter, and combine this data (key, value) in a number of ways for a wide range of processing. Reducer first processes the intermediate values for particular key generated by the map function and then generates the output.

14. Apache Pig.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist. Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
  • Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.
Relational operators
Which you used:
DISTINCT FILTER FOREACH GROUP JOIN
LIMIT LOAD ORDER BY STORE UNION
Which you did not:
ASSERT COGROUP CROSS CUBE DEFINE
IMPORT NATIVE RANK SAMPLE SPLIT STREAM

15./16./17. Wide column databases (Apache HBase).

Column-Oriented Databases (NoSQL)

Wide column stores, also called extensible record stores, store data in records with an ability to hold very large numbers of dynamic columns. Their architecture uses persistent, sparse matrix, multi-dimensional mapping (row-value, column-value, and timestamp) in a tabular format meant for massive scalability (over and above the petabyte scale). Column Family stores do not follow the relational model, and they aren’t optimized for joins. Since the column names as well as the record keys are not fixed, and since a record can have billions of columns, wide column stores can be seen as two-dimensional key-value stores. They share with document stores the characteristic of being schema-free, however, the implementation is very different. Several columns make a column family with multiple rows and the rows may not have the same number of columns. In simple terms, the information stored in several rows in an ordinary relational database can fit in one column in a columnar database. The columns can also have different names and datatypes.

Wide column stores must not be confused with the column oriented storage in some relational systems. This is an internal concept for improving the performance of an RDBMS for OLAP workloads and stores the data of a table not record after record but column by column.

The most popular examples of wide column stores are Apache Cassandra, Apache HBase, Microsoft Azure Cosmos DB, Cloud Bigtable.

Column-oriented databases are used for:
  • sensor logs (IOT)
  • user preferences
  • geographic information
  • reporting systems
  • time series data
  • logging and other write heavy applications

Wide Column databases are not the preferred choice for applications with ad-hoc query patterns, high level aggregations and changing database requirements. This type of data store does not keep good data lineage.

Data model of wide column stores

In Column Family databases data is stored in cells of columns, grouped into column families. These databases are implemented as multidimensional nested sorted map of maps. The innermost map constitutes a version of the data identified by a timestamp and stored in a cell. A cell is mapped to a column which in turn is mapped to a column family. A set of column families are identified using a row key.
Read and write is performed using the row key on sets of columns. These columns are stored as a continuous entry on the disk enhancing performance.

Columns store databases use a concept called a keyspace. A keyspace is kind of like a schema in the relational model. The keyspace contains all the column families (kind of like tables in the relational model), which contain rows, which contain columns.

  • keyspace
    • column family
      • row key
        • column
Some key benefits of columnar databases include:
  • Compression - Column stores are very efficient at data compression and/or partitioning.
  • Aggregation queries - Due to their structure, columnar databases perform particularly well with aggregation queries (such as SUM, COUNT, AVG, etc).
  • Scalability - Columnar databases are very scalable. They are well suited to massively parallel processing (MPP), which involves having data spread across a large cluster of machines – often thousands of machines.
  • Fast to load and query - Columnar stores can be loaded extremely fast. A billion row table could be loaded within a few seconds. You can start querying and analysing almost immediately.

Key benefits of column store databases include faster performance in load, search, and aggregate functions. They are scalable and can read billions of records in seconds. They are also efficient in data compression and partitioning.

Moreover, each column does not span beyond its row. Wide columnar databases are mainly used in highly analytical and query-intensive environments. This includes areas where large volumes of data items require aggregate computing. For example, wide columnar databases are suitable for data mining, business intelligence (BI), data warehouses, and decision support.

Apache HBase

Apache HBase™ is an open-source, distributed, scalable NoSQL big data store that allows billions of rows of big data access in seconds.

HBase is a column-oriented NoSQL database. Although it looks similar to a relational database which contains rows and columns, it is not a relational database. Relational databases are row-oriented while HBase is column-oriented. It means that entries in a column are stored in contiguous locations on disks.

When we need to process and analyze a large set of semi-structured or unstructured data like in Online Analytical Processing such as data mining, data warehousing etc., we use column-oriented approach. Whereas Online Transactional Processing such as banking and finance domains which handle structured data and require transactional properties (ACID properties) use row-oriented approach.

HBase data model

  • tables - Data is stored in a table format that consists of multiple rows. But here tables are in the column-oriented format.
  • row keys - Row keys are used to search records which make searches fast. Rows are sorted alphabetically by the row key as they are stored. For this reason, the design of the row key is very important. The goal is to store data in such a way that related rows are near each other.
  • column families - Various columns are combined in a column family. These column families are stored together which makes the searching process faster because data belonging to same column family can be accessed together in a single seek.
  • columns - Consists of a column family and a column qualifier. Each column’s name is known as its column qualifier.
  • column qualifiers - A column qualifier is added to a column family to provide the index for a given piece of data.
  • cells - The data is dumped into cells which are specifically identified by a row key and column qualifiers. A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value’s version.
  • timestamps - A combination of date and time. Whenever data is stored, it is stored with its timestamp. This makes easy to search for a particular version of data.
In a more simple and understanding way, we can say that:
  • HBase is a set of tables, where
  • each table has column families and rows and
  • row key acts as a primary key.
  • Any access to HBase tables uses this primary key and
  • each column qualifier denotes the attribute corresponding to the object which resides in the cell.
Key features
  • Linear and modular scalability
  • Strictly consistent reads and writes
  • Automatic and configurable sharding of tables
  • Automatic failover support between RegionServers
  • Convenient base classes for backing Hadoop MapReduce jobs with HBase tables
  • Easy to use Java API for client access
  • Extensible jruby-based (JIRB) shell
HBase commands
Command Description
create <'tablename'>, <'columnfamilyname'> create a table in HBase with the specified name given according to the dictionary or specifications as per column family
list display all the tables that are present or created in HBase
describe <'tablename'> give more information about column families present in the mentioned table
disable <'tablename'> start disabling the named table
enable <'tablename'> start enabling the named table
drop <'tablename'> drop the table present in HBase
alter <'tablename'>, NAME=> <'columnfamilyname'>, VERSIONS=> <'versions'> alter the column family schema
put <'tablename'>, <'rowname'>, <'columnname'>, <'value'> put a cell 'value' at defined or specified table or row or column
get <'tablename'>, <'rowname'> get a row or cell contents present in the table
delete <'tablename'>, <'rowname'>, <'columnname'> delete cell value at defined table of row or column
truncate <'tablename'> truncate the table; the schema will be present but not the records
scan <'tablename'> scan the entire table and display the table contents

We can pass several optional specifications to the scan command to get more information about the tables present in the system. Scan specifications may include many attributes:

Scan attributes:
TIMERANGE, FILTER, TIMESTAMP, LIMIT, MAXLENGTH, COLUMNS, CACHE, STARTROW, ENDROW

scan 'mytable', {STARTROW => 'abc', ENDROW => 'abd'}

18. Query optimizer.

The query optimizer (called simply the optimizer) is built-in database software that determines the most efficient method for a SQL statement to access requested data.

The optimizer attempts to generate the most optimal execution plan for a SQL statement.

It chooses the plan with the lowest cost among all considered candidate plans. The optimizer uses available statistics to calculate the cost. For a specific query in a given environment, the cost computation accounts for factors of query execution such as I/O, CPU, and communication.

For example, a query might request information about employees who are managers. If the optimizer statistics indicate that 80% of employees are managers, then the optimizer may decide that a full table scan is most efficient. However, if statistics indicate that very few employees are managers, then reading an index followed by a table access by rowid may be more efficient than a full table scan.

Because the database has many internal statistics and tools at its disposal, the optimizer is usually in a better position than the user to determine the optimal method of statement execution.

For this reason, all SQL statements use the optimizer.