In a star schema each logical dimension is denormalized into one table,
while in a snowflake, at least some of the dimensions are normalized.
Star Schema | Snowflake Schema |
---|---|
denormalized structure | normalized structure |
high redundancy | low redundancy |
uses more space | uses less space |
simple design | complex design |
easy to understand | hard to understand |
simple joins | complex joins |
faster query excecution | slower query excecution |
faster cube processing | slower cube processing |
The multidimensional data model is an integral part of On-Line Analytical Processing. Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries during interactive sessions, not during batch jobs that run overnight. And because OLAP is also analytic, the queries are complex. The multidimensional data model is designed to solve complex queries in real time.
It is composed of logical cubes, measures, dimensions, hierarchies, levels, and attributes.
The simplicity of the model is inherent because it defines objects that represent real-world business entities. Analysts know which business measures they are interested in examining, which dimensions and attributes make the data meaningful, and how the dimensions of their business are organized into levels and hierarchies.
The Apache Hive™ data warehouse software is a project built on top of Apache Hadoop that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
Hive Transactions - documentation
With the addition of transactions in Hive 0.13 it is now possible to provide full ACID semantics at the row level, so that one application can add rows while another reads from the same partition without interfering with each other. If a table is to be used in ACID writes (insert, update, delete) then the table property "transactional=true" must be set on that table. Note, once a table has been defined as an ACID table via TBLPROPERTIES ("transactional"="true"), it cannot be converted back to a non-ACID table, i.e., changing TBLPROPERTIES ("transactional"="false") is not allowed.
UI – The user interface for users to submit queries and other operations to the system.
Driver – The component which receives the queries.
Compiler – The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the Metastore.
Metastore – The component that stores all the information about the structure of various tables like their schema (column types), location on HDFS, and partitions in a relational database.
Execution Engine – The component which executes the execution plan created by the Compiler on the appropriate system components.
When you drop a table:
Managed Tables | External Tables |
---|---|
on drop - Hive deletes both data and metadata | on drop - Hive deletes only metadata |
by default | need 'external' keyword |
can accidentally delete data | protect data against accidental drops |
on load - Hive moves data into the warehouse | on load - Hive uses data present in remote locations |
Hive manages and controls data | Hive does not manage and control data |
for temporary data | for long-term data |
truncate, ACID, caching, materialized views work | they do not work |
Data in Apache Hive can be categorized into tables, partitions, and buckets. Both partitioninig and bucketing distribute data. Hive organizes tables into partitions and it subdivides partitions into buckets.
Apache Hive organizes tables into partitions for grouping same type of data together based on a column or partition key. Each table in Hive can have one or more partition keys to identify a particular partition. Thanks to using partitions, we can make it faster to do queries on slices of the data.
A partition in Hive is nothing but just a subdirectory in the table directory. For example, we have data of three departments in our table - Technical, Marketing and Sales. Thus, we will have three partitions in total for each of the departments and every department is in a separate subdirectory under the table directory. Queries regarding the Technical employee would only have to look through the data present in the Technical partition. It reduces the query latency by scanning only relevant partitioned data instead of the whole data set.
Partitioning data is often used for distributing load horizontally. This has performance benefit and helps in organizing data in a logical fashion. For example, if we are dealing with a large employee table and often run queries that restrict to a particular country or department, we can partition the Hive table for a faster response.
However, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
PARTITIONED BY (partition1 data_type, partition2 data_type);
In Hive tables partitions may be subdivided into many buckets based on the hash function of a column in the table to give extra structure to the data that may be used for more efficient queries. The hash function depends on the type of the bucketing column. We can use bucketing when the implementation of partitioning becomes difficult. It is a technique for decomposing data sets into more manageable parts that provide faster query response. We can define a number of buckets during table creation but loading equal volume of data has to be done manually.
For example, suppose a table using date as the top-level partition and employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into the user-defined number of buckets.
Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id.
The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins.
Bucketing needs to be enabled:
set hive.enforce.bucketing = true;
For dynamic bucketing, the settings below need to be enabled:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
[PARTITIONED BY (partition1 data_type, partition2 data_type, …)]
CLUSTERED BY (column_name1, column_name2, …)
[SORTED BY (column_name [ASC|DESC], …)]
INTO <num_buckets> BUCKETS;
Apache Kylin™ is an open source distributed analytical engine designed to provide OLAP (Online Analytical Processing) capability in the big data era.
By renovating the multi-dimensional cube and precalculation technology on Hadoop and Spark, Kylin is able to achieve near constant query speed regardless of the ever-growing data volume. Reducing query latency from minutes to sub-second, Kylin brings online analytics back to big data.
Apache Kylin™ can also integrate with your favorite BI tools like Tableau and PowerBI etc., to enable BI on Hadoop.
Hive Kafka Storage Handler allows users to connect / analyze / transform Kafka topics via Hive. There are two table properties that must be set:
The workflow is as follows:
STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES (
"kafka.topic" = "test-topic",
"kafka.bootstrap.servers" = "localhost:9092");
Both properties are mandatory. On the write path if such a topic does not exist the topic will be created if Kafka broker admin policy allows for auto topic creation.
By default the serializer and deserializer is JSON.
Consumers read from any single partition, allowing you to scale throughput of message consumption in a similar fashion to message production. Consumers can also be organized into consumer groups for a given topic — each consumer within the group reads from a unique partition and the group as a whole consumes all messages from the entire topic. If you have more consumers than partitions then some consumers will be idle because they have no partitions to read from. If you have more partitions than consumers then consumers will receive messages from multiple partitions. If you have equal numbers of consumers and partitions, each consumer reads messages in order from exactly one partition.
The following picture from the Kafka documentation describes the situation with multiple partitions of a single topic. Server 1 holds partitions 0 and 3 and Server 2 holds partitions 1 and 2. We have two consumer groups, A and B. A is made up of two consumers and B is made up of four consumers. Consumer Group A has two consumers of four partitions — each consumer reads from two partitions. Consumer Group B, on the other hand, has the same number of consumers as partitions and each consumer reads from exactly one partition.
MapReduce is a processing technique and a program model for distributed computing based on Java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The reduce tasks are always performed after the map jobs.
Mapper task is the first phase of processing that processes each input record and generates an intermediate key-value pair. Hadoop Mapper store intermediate-output on the local disk. The output key-value pairs can be completely different from the input pair. Before writing the output for each Mapper task, the partitioning of the output takes place on the basis of the key and then sorting is done. This partitioning specifies that all the values for each key are grouped together.
Mapper only understands key-value pairs of data, so before passing data to it, data should be first converted into key-value pairs.
Combiner, also known as a semi-reducer, is an optional class that is used in between the Mapper and the Reducer to reduce the volume of data transfer between them. It operates by accepting the inputs from the Mapper class and thereafter passing the output key-value pairs to the Reducer class.
The main function of the Combiner is to summarize the Mapper output records with the same key as the output of the Mapper is usually large. The output (key-value collection) of the Combiner will be sent over the network to the actual Reducer task as an input.
Reducer takes the output of the Mapper (or Combiner; any intermediate key-value pair) and processes each of them to generate the final output of the MapReduce task which is then stored in HDFS. It usually does the final aggregation and summation.
Hadoop Reducer takes a set of an intermediate key-value pair produced by the Mapper as the input and runs a Reducer function on each of them. One can aggregate, filter, and combine this data (key, value) in a number of ways for a wide range of processing. Reducer first processes the intermediate values for particular key generated by the map function and then generates the output.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist. Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:
Wide column stores, also called extensible record stores, store data in records with an ability to hold very large numbers of dynamic columns. Their architecture uses persistent, sparse matrix, multi-dimensional mapping (row-value, column-value, and timestamp) in a tabular format meant for massive scalability (over and above the petabyte scale). Column Family stores do not follow the relational model, and they aren’t optimized for joins. Since the column names as well as the record keys are not fixed, and since a record can have billions of columns, wide column stores can be seen as two-dimensional key-value stores. They share with document stores the characteristic of being schema-free, however, the implementation is very different. Several columns make a column family with multiple rows and the rows may not have the same number of columns. In simple terms, the information stored in several rows in an ordinary relational database can fit in one column in a columnar database. The columns can also have different names and datatypes.
The most popular examples of wide column stores are Apache Cassandra, Apache HBase, Microsoft Azure Cosmos DB, Cloud Bigtable.
Wide Column databases are not the preferred choice for applications with ad-hoc query patterns, high level aggregations and changing database requirements. This type of data store does not keep good data lineage.
In Column Family databases data is stored in cells of columns, grouped into column families. These databases are implemented as multidimensional nested sorted map of maps. The innermost map constitutes a version of the data identified by a timestamp and stored in a cell. A cell is mapped to a column which in turn is mapped to a column family. A set of column families are identified using a row key.
Read and write is performed using the row key on sets of columns. These columns are stored as a continuous entry on the disk enhancing performance.
Columns store databases use a concept called a keyspace. A keyspace is kind of like a schema in the relational model. The keyspace contains all the column families (kind of like tables in the relational model), which contain rows, which contain columns.
Key benefits of column store databases include faster performance in load, search, and aggregate functions. They are scalable and can read billions of records in seconds. They are also efficient in data compression and partitioning.
Moreover, each column does not span beyond its row. Wide columnar databases are mainly used in highly analytical and query-intensive environments. This includes areas where large volumes of data items require aggregate computing. For example, wide columnar databases are suitable for data mining, business intelligence (BI), data warehouses, and decision support.
Apache HBase™ is an open-source, distributed, scalable NoSQL big data store that allows billions of rows of big data access in seconds.
HBase is a column-oriented NoSQL database. Although it looks similar to a relational database which contains rows and columns, it is not a relational database. Relational databases are row-oriented while HBase is column-oriented. It means that entries in a column are stored in contiguous locations on disks.
When we need to process and analyze a large set of semi-structured or unstructured data like in Online Analytical Processing such as data mining, data warehousing etc., we use column-oriented approach. Whereas Online Transactional Processing such as banking and finance domains which handle structured data and require transactional properties (ACID properties) use row-oriented approach.
Command | Description |
---|---|
create <'tablename'>, <'columnfamilyname'> | create a table in HBase with the specified name given according to the dictionary or specifications as per column family |
list | display all the tables that are present or created in HBase |
describe <'tablename'> | give more information about column families present in the mentioned table |
disable <'tablename'> | start disabling the named table |
enable <'tablename'> | start enabling the named table |
drop <'tablename'> | drop the table present in HBase |
alter <'tablename'>, NAME=> <'columnfamilyname'>, VERSIONS=> <'versions'> | alter the column family schema |
put <'tablename'>, <'rowname'>, <'columnname'>, <'value'> | put a cell 'value' at defined or specified table or row or column |
get <'tablename'>, <'rowname'> | get a row or cell contents present in the table |
delete <'tablename'>, <'rowname'>, <'columnname'> | delete cell value at defined table of row or column |
truncate <'tablename'> | truncate the table; the schema will be present but not the records |
scan <'tablename'> | scan the entire table and display the table contents |
We can pass several optional specifications to the scan command to get more information about the tables present in the system. Scan specifications may include many attributes:
scan 'mytable', {STARTROW => 'abc', ENDROW => 'abd'}
The query optimizer (called simply the optimizer) is built-in database software that determines the most efficient method for a SQL statement to access requested data.
It chooses the plan with the lowest cost among all considered candidate plans. The optimizer uses available statistics to calculate the cost. For a specific query in a given environment, the cost computation accounts for factors of query execution such as I/O, CPU, and communication.
For example, a query might request information about employees who are managers. If the optimizer statistics indicate that 80% of employees are managers, then the optimizer may decide that a full table scan is most efficient. However, if statistics indicate that very few employees are managers, then reading an index followed by a table access by rowid may be more efficient than a full table scan.
Because the database has many internal statistics and tools at its disposal, the optimizer is usually in a better position than the user to determine the optimal method of statement execution.