The way queries are scheduled and executed in a database system can deeply affect the performance of the database. This article briefly outlines the content related to execution in the query process, hoping to help everyone better understand the working principle of query engine.
First, let's review the basic process of querying together.
As shown in the above figure, queries often need to go through several stages:
The Volcano Model was proposed and well-known in 1990 with the publication of "Volcano, an Extensible and Parallel Query Evaluation System".
Volcano is a row-based stream iteration model that is simple yet elegant. The control commands for pulling data are passed sequentially from the top-level operator to the bottom of the execution tree, opposite to the direction of data flow.
Today's memory capacity has grown rapidly such that data can be stored directly in memory; load has shifted from IO bound to memory bound; while single-core efficiency faces bottlenecks, multi-core capabilities become increasingly important requiring more attention on CPU execution efficiency. Vectorized execution/compiled execution methods have begun to shine.
Although limited by hardware environments at that time (insufficient parallelism capability for CPUs, small memory capacity with low IO efficiency), the design of Volcano Model still deserves reference as its shadow can still be seen in some state-of-the-art executor solutions today.
The executor part of Databend mainly draws on the paper Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age .
"Morsel" means "small piece", which implies that tasks will be decomposed into a series of operators with appropriate size and dynamic adjustability, such as expression calculation and aggregation function calculation. Morsel-Driven Parallelism provides an adaptive scheduling and execution scheme to determine task parallelism at runtime, execute operations in a pipeline manner, and minimize cross-domain data access while implementing load balancing by scheduling strategy to ensure data localization.
Efficient query execution requires collaboration among different departments and various components like automobile assembly lines.
Therefore, it is necessary to introduce a dispatcher to allocate computing resources for parallel pipelines control. The dispatcher maintains Pipeline Jobs passed from each query. Each task corresponds to a sub-plan of the query that needs processing at the underlying level. When requesting thread pool distribution tasks, the dispatcher follows scheduling strategies based on factors such as task execution status and resource usage to decide when CPU should execute which transform.
The research on Morsel-Driven Parallelism not only focuses on improving execution frameworks but also covers parallel optimization algorithms such as Hash Join, Grouping/Aggregation, and Sorting.
Under the guidance of Morsel Wise thinking, the Morsel-Driven Parallelism execution framework solves problems related to load balancing in multi-core era including thread synchronization issues during local memory access along with elastic resource scheduling.
Vectorized execution has become popular since MonetDB/X100 (Vectorwise), and the paper "MonetDB/X100: Hyper-Pipelining Query Execution" has become a must-read. After 2010, most OLAP systems organize data based on column-based storage.
The left figure can be regarded as a sample of column-based storage, where the data in the same column forms a continuous structure in memory.
OLAP systems usually need to process queries involving large amounts of data. Adopting a column-based storage scheme has natural advantages in improving IO efficiency:
The advantage of vectorized execution is that it can fully utilize CPU cache and design more efficient analysis query engines:
When talking about vectorized execution, SIMD (Single Instruction Multiple Data) cannot be avoided. The traditional way is to query instruction sets and then manually write instructions; however, Databend uses the following methods:
The following content is compiled from a conversation between @fkuner and @zhang2014.
Answer: Numa-local is core-exclusive in the aggregator processor of Databend. The 1:1 correspondence between pipeline size and executor worker size is also for numa local. When scheduling, thread switching is avoided as much as possible. A task is scheduled from start to finish, and newly generated branch tasks are placed in global scheduling.
Answer: Databend schedules pipelines by perceiving data status. If the data is not ready, it will not be scheduled. As for IO, it will be scheduled to the global io runtime and blocked waiting through Rust Async.
Answer: The model in the paper is that one task processes one pipeline, while one pipeline can consist of multiple processors. In Databend, task stealing can be done at the processor level so that when a task can be split into processor-level segments, scheduling becomes more flexible. Although processors are scheduled in the scheduler, this processor corresponds specifically to a data block during runtime similar to job splitting in Pipeline mentioned in the paper.
Answer: Ideally execution threads should not interfere with each other but considering that there may be skewness among tasks; when one of them completes early then that thread may steal remaining tasks to speed up overall execution flow. During scheduling an executor has a local context where no sharing exists among threads.