# ORCA Correlated Statistics
## Motivation:
Accurate cardinality estimation helps to drive better plan selection (e.g. join ordering decisions). Basic statistics in Postgres describes a single column and does not account for a synergetic relationship between columns.
Let's consider the following example:
```sql=
CREATE TABLE t (a INT, b INT);
INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
ANALYZE t;
```
The data consists of 100 duplicates of 100 unique values where a and b are always equal. Let's examine the cardinality accuracy for a scan. If the predicate is on a single column, our estimation of 100 rows gathered is perfect.
```
test=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..431.18 rows=100 width=8) (actual rows=100 loops=1)
-> Seq Scan on t (cost=0.00..431.18 rows=34 width=8) (actual rows=100 loops=1)
Filter: (a = 1)
Rows Removed by Filter: 3700
Optimizer: Pivotal Optimizer (GPORCA)
```
However, if the predicate is on multiple columns, our estimation is off by two orders of magnitude.
```
test=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 and b = 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..431.29 rows=2 width=8) (actual rows=100 loops=1)
-> Seq Scan on t (cost=0.00..431.29 rows=1 width=8) (actual rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 3700
Optimizer: Pivotal Optimizer (GPORCA)
```
The issue here is that basic statistics cannot capture relataionships between columns and thus it assumes independence. In this example where a=1 is true of 1% the rows and b=1 is also true of 1% of the rows, then calculations assuming independence would apply 1% twice and estimate 0.1% of 10,000 rows overall (~1 row, though in this case we estimated 2).
Postgres introduced extended statistics in version 10 to address this issue. The goal of this proposal is to update ORCA to leverage the same framework in order to gain the benefits of extended statistics through the same interface as PLANNER.
## Scope:
There are 3 kinds of extended statistics supported in Postgres 12 (GPDB 7 base version). We should design a format that can store each of these types and that is extensible for future potential types.
1. "Functional dependencies" describes the extent to which the value of column A can be used to predict the value of column B in any given row.
```sql=
CREATE STATISTICS stts (dependencies) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
```
2. "Multivariate N-distinct counts" describes the number of distinct values when combining more than 1 column.
```sql=
CREATE STATISTICS stts2 (ndistinct) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
```
3. "Multivariate MCV lists" describes the most column values when combining more than 1 column.
```sql=
CREATE STATISTICS stts3 (mcv) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
```
Notice that in this case _multivariate MCV lists_ handles the same query as _functional dependencies_. However, there are pros/cons to each. MCV lists can handle more operators (not just equality). However, underlying representation also requires more complexity.
There are disadvantages to enabling too many extended statistics, namely that it can lead to catalog bloat and increase the optimzation time with little or no benefit to execution time. This becomes a problem for the database administrator to manage.
## Component Design:
### Update gpdbwrappers
Parse pg_statistic_ext and pg_statistic_ext_data
```
test=# SELECT stxname, stxkeys, stxdndistinct, stxddependencies
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stts';
stxname | stxkeys | stxdndistinct | stxddependencies
---------+---------+---------------+------------------------------------------
stts | 1 2 | | {"1 => 2": 1.000000, "2 => 1": 1.000000}
(1 row)
test=# SELECT stxname, stxkeys, stxdndistinct, stxddependencies
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stts2';
stxname | stxkeys | stxdndistinct | stxddependencies
---------+---------+---------------+------------------------------------------
stts2 | 1 2 | {"1, 2": 100} |
test=# SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';
index | values | nulls | frequency | base_frequency
-------+----------+-------+-----------+----------------
0 | {0, 0} | {f,f} | 0.01 | 0.0001
1 | {1, 1} | {f,f} | 0.01 | 0.0001
...
```
### Update to DXL format
Store correlated statistic information.
#### Current Format:
```xml=
<dxl:ColumnStatistics Mdid="1.57361.1.0.1" Name="b" Width="4.000000" NullFreq="0.000000" NdvRemain="0.000000" FreqRemain="0.000000" ColStatsMissing="false">
<dxl:StatsBucket Frequency="0.010000" DistinctValues="1.000000">
<dxl:LowerBound Closed="true" TypeMdid="0.23.1.0" Value="0"/>
<dxl:UpperBound Closed="true" TypeMdid="0.23.1.0" Value="0"/>
</dxl:StatsBucket>
...
</dxl:ColumnStatistics>
```
#### Proposed extensions
Extended Statistics Info Metadata (CDXLExtStatsInfo, CMDExtStatInfo):
```xml=
<dxl:RelationExtendedStatistics Mdid="10.434199.1.0" Name="t">
<dxl:ExtendedStatisticsInfo Oid="434202" Name="stts" Keys="1,2" Kind="FunctionalDependencies"/>
...
</dxl:RelationExtendedStatistics>
```
Functional Dependencies (CDXLExtStats, CMDDependency):
```xml=
<dxl:ExtendedStatistics Mdid="9.434202.1.0" Name="stts">
<dxl:MVDependencyList>
<dxl:MVDependency Degree="0.989901" From="1" To="2"/>
<dxl:MVDependency Degree="0.989901" From="2" To="1"/>
</dxl:MVDependencyList>
</dxl:ExtendedStatistics>
```
Multivariate N-Distinct Counts (CDXLExtStats, CMDNDistinct):
```xml=
<dxl:ExtendedStatistics Mdid="XXX" Name="stts2">
<dxl:StatsNDistincts>
<dxl:StatNDistinct Relationship="1,2" Count="100"/>
</dxl:StatsNDistincts>
</dxl:ExtendedStatistics>
```
Multivariate MCV Lists (CDXLExtStats, CMDMCV):
```xml=
<dxl:ExtendedStatistics Mdid="XXX" Name="stts3">
...
<dxl:StatsMCVLists>
<dxl:StatMCVList Index="0" Values="{0,0}" Nulls="{f,f}" Frequency="0.01" BaseFrequency="0.0001"/>
<dxl:StatMCVList Index="1" Values="{1,1}" Nulls="{f,f}" Frequency="0.01" BaseFrequency="0.0001"/>
<dxl:StatMCVList Index="2" Values="{2,2}" Nulls="{f,f}" Frequency="0.01" BaseFrequency="0.0001"/>
...
</dxl:StatsMCVLists>
</dxl:ExtendedStatistics>
```
Note: It is possible for a EXTENDED STATISTIC object to represent multiple extended types. For example:
```sql
CREATE STATISTICS sttsZ (dependencies, ndistinct, mcv) ON a, b FROM t;
```
In this case, the DXL would combine the types and look like:
```xml=
<dxl:ExtendedStatistics Mdid="XXX" Name="sttsZ">
<dxl:MVDependencyList>
<dxl:MVDependency Degree="0.989901" From="1" To="2"/>
<dxl:MVDependency Degree="0.989901" From="2" To="1"/>
</dxl:MVDependencyList>
<dxl:StatsNDistincts>
<dxl:StatNDistinct Relationship="1,2" Count="100"/>
</dxl:StatsNDistincts>
<dxl:StatsMCVLists>
<dxl:StatMCVList Index="0" Values="{0,0}" Nulls="{f,f}" Frequency="0.01" BaseFrequency="0.0001"/>
<dxl:StatMCVList Index="1" Values="{1,1}" Nulls="{f,f}" Frequency="0.01" BaseFrequency="0.0001"/>
<dxl:StatMCVList Index="2" Values="{2,2}" Nulls="{f,f}" Frequency="0.01" BaseFrequency="0.0001"/>
...
</dxl:StatsMCVLists>
</dxl:ExtendedStatistics>
```
### Update CStatistic class
Update to store correlated statistic metadata information. See `CMDAccessor::Pstats()` collect and pass stats info into `CStatistics`. `CStatistics` also needs to store colid to attno mapping because the catalog stores everything as attno, so ORCA must translate between colid to attno.
### Update CFilterStatsProcessor
Use new information in CStatistic to calculate a more accurate scale factor. See following functions...
`CFilterStatsProcessor::MakeStatsFilter()`
`CFilterStatsProcessor::MakeHistHashMapConjFilter()`
`CScaleFactorUtils::CalcScaleFactorCumulativeConj()`
...and trace `scale_factor`. As a result we should produce more accurate `rows_filter` cardinality estimates.
`MakeHistHashMapConjFilter()` computes scale factors. Update to consider correlated statistics to produce scale factors (essentially a.k.a. selectivity) in `CExtendedStatsProcessor::ApplyExtendedStatistics` See clauselist_selectivity
#### Notes:
`ApplyExtendedStatistics()` is ORCA version of `dependencies_clauselist_selectivity()` and its helper functions in dependencies.c. ORCA cannot call the function directly because of DXL abstraction from core backend Postgres code.
## Testing:
There should be no regressions since extended statistics is an opt in feature.
Places to look for (performance) test cases:
* Search for customer use cases
* Check perf pipelines for scenarios (TPC-H and TPC-DS)
* Think of scenarios that highlight the feature
Tests will likely be performing joins/aggregates over highly correlated data sets using the correlated columns.
## Future Work:
It is highly conceivable that there exists MPP specific statistical data from which GPDB query optimizers would benefit. For example, data distribution - track the top N unevenly distributed values across segments to better handle data skew.
For partitioned tables, how should extended statistics be captured from leaf partitions and summarized in the root (e.g. sampling vs merging)? Thread [3] may be of interest.
## References:
[1] https://www.postgresql.org/docs/current/planner-stats.html
[2] https://www.postgresql.org/docs/current/multivariate-statistics-examples.html
[3] https://www.postgresql.org/message-id/flat/c3162511-0e80-039e-765c-1d02d0fd0bab%40enterprisedb.com