The way queries are scheduled and executed in a database system can deeply affect the performance of the database. This article briefly outlines the content related to execution in the query process, hoping to help everyone better understand the working principle of query engine.

Basic Process of Querying

First, let's review the basic process of querying together.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

As shown in the above figure, queries often need to go through several stages:

  • Parse SQL syntax to form an AST (Abstract Syntax Tree).
  • Perform semantic analysis on it through Binder and generate an initial Logical Plan.
  • After obtaining the initial Logical Plan, optimizer will rewrite and optimize it, finally generating an executable Physical Plan.
  • After Optimizer generates Physical Plan, translate it into an executable Pipeline.
  • The Pipeline is then handed over to Processor for calculation.

Volcano Model

The Volcano Model was proposed and well-known in 1990 with the publication of "Volcano, an Extensible and Parallel Query Evaluation System".

Volcano is a row-based stream iteration model that is simple yet elegant. The control commands for pulling data are passed sequentially from the top-level operator to the bottom of the execution tree, opposite to the direction of data flow.

Advantages

  • Abstracts Data Streams and provides a series of standard interfaces, fully decoupling between operators. The model is simple and easy to extend.
  • The framework completes the overall process of operator combination and data processing, so that operator implementation only needs to focus on data processing flow. It is isolated from execution strategies and has strong flexibility.

Disadvantages

  • Pulling data using a pull model requires additional control operations for data flow between operators, resulting in a large number of redundant control instructions.
  • The iterator model means that there are many next() calls between operators. Nested virtual functions are not friendly to branch prediction, which can disrupt CPU pipelines and cause cache and TLB invalidation.

Summary

Today's memory capacity has grown rapidly such that data can be stored directly in memory; load has shifted from IO bound to memory bound; while single-core efficiency faces bottlenecks, multi-core capabilities become increasingly important requiring more attention on CPU execution efficiency. Vectorized execution/compiled execution methods have begun to shine.

Although limited by hardware environments at that time (insufficient parallelism capability for CPUs, small memory capacity with low IO efficiency), the design of Volcano Model still deserves reference as its shadow can still be seen in some state-of-the-art executor solutions today.

Morsel-Driven Parallelism

The executor part of Databend mainly draws on the paper Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age .

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Targeting Multi-core Architecture

"Morsel" means "small piece", which implies that tasks will be decomposed into a series of operators with appropriate size and dynamic adjustability, such as expression calculation and aggregation function calculation. Morsel-Driven Parallelism provides an adaptive scheduling and execution scheme to determine task parallelism at runtime, execute operations in a pipeline manner, and minimize cross-domain data access while implementing load balancing by scheduling strategy to ensure data localization.

Reasonable Task Distribution

Efficient query execution requires collaboration among different departments and various components like automobile assembly lines.

Therefore, it is necessary to introduce a dispatcher to allocate computing resources for parallel pipelines control. The dispatcher maintains Pipeline Jobs passed from each query. Each task corresponds to a sub-plan of the query that needs processing at the underlying level. When requesting thread pool distribution tasks, the dispatcher follows scheduling strategies based on factors such as task execution status and resource usage to decide when CPU should execute which transform.

Morsel Wise

The research on Morsel-Driven Parallelism not only focuses on improving execution frameworks but also covers parallel optimization algorithms such as Hash Join, Grouping/Aggregation, and Sorting.

Under the guidance of Morsel Wise thinking, the Morsel-Driven Parallelism execution framework solves problems related to load balancing in multi-core era including thread synchronization issues during local memory access along with elastic resource scheduling.

Column-Based Storage and Vectorized Execution

Vectorized execution has become popular since MonetDB/X100 (Vectorwise), and the paper "MonetDB/X100: Hyper-Pipelining Query Execution" has become a must-read. After 2010, most OLAP systems organize data based on column-based storage.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The left figure can be regarded as a sample of column-based storage, where the data in the same column forms a continuous structure in memory.

Column-Based Storage

OLAP systems usually need to process queries involving large amounts of data. Adopting a column-based storage scheme has natural advantages in improving IO efficiency:

  • Only read the required columns without reading other columns through IO, thus avoiding unnecessary IO overhead.
  • There are often many duplicate items in the same column of data, so compression rates will be very high, further saving IO resources.

Vectorized Execution

The advantage of vectorized execution is that it can fully utilize CPU cache and design more efficient analysis query engines:

  • Data is continuous in memory; because it achieves on-demand reading, it can also reduce unnecessary cache occupancy.
  • Reduce the amount of data that needs to be passed when processing data and spread out the overhead incurred by different operators.

SIMD Optimization

When talking about vectorized execution, SIMD (Single Instruction Multiple Data) cannot be avoided. The traditional way is to query instruction sets and then manually write instructions; however, Databend uses the following methods:

  • Use Rust language standard library std::simd to provide abstract encapsulation about SIMD instructions for writing easy-to-understand code.
  • Automatic vectorization optimizes code logic by reducing branch prediction within loops and making full use of compiler capabilities.

FAQs about Databend Query Execution

The following content is compiled from a conversation between @fkuner and @zhang2014.

How does Databend ensure numa-local?

Answer: Numa-local is core-exclusive in the aggregator processor of Databend. The 1:1 correspondence between pipeline size and executor worker size is also for numa local. When scheduling, thread switching is avoided as much as possible. A task is scheduled from start to finish, and newly generated branch tasks are placed in global scheduling.

How are pipelines scheduled if they need to wait for IO?

Answer: Databend schedules pipelines by perceiving data status. If the data is not ready, it will not be scheduled. As for IO, it will be scheduled to the global io runtime and blocked waiting through Rust Async.

What is the relationship between tasks, pipelines, and processors?

Answer: The model in the paper is that one task processes one pipeline, while one pipeline can consist of multiple processors. In Databend, task stealing can be done at the processor level so that when a task can be split into processor-level segments, scheduling becomes more flexible. Although processors are scheduled in the scheduler, this processor corresponds specifically to a data block during runtime similar to job splitting in Pipeline mentioned in the paper.

How does Databend handle numa-local bias scheduling?

Answer: Ideally execution threads should not interfere with each other but considering that there may be skewness among tasks; when one of them completes early then that thread may steal remaining tasks to speed up overall execution flow. During scheduling an executor has a local context where no sharing exists among threads.