# SoK: Attestation aggregation
###### tags: `algorithms`
Author(s): Victor Farazdagi (Prysmatic Labs)
Date: Apr 12, 2020
:::danger
:exclamation: If you want to suggest sth, please, feel free to contribute to the [ethresear.ch topic](https://ethresear.ch/t/attestation-aggregation-heuristics/7265).
:::
[TOC]
## 1. Background
Let's start from some relevant pre-requisite information:
- [BLS signature scheme](https://tools.ietf.org/html/draft-irtf-cfrg-bls-signature-00) on which the whole
aggregation process is based: it is precisely because we can produce compressed BLS signatures, aggregation
of non-overlapping attestations is possible.
- There is an amazing [note on Ethereum 2.0 attestation aggregation strategies](https://notes.ethereum.org/@hww/aggregation)
which has the best coverage of current efforts on the problem of attestation aggregation. Strategies
listed there are not essentially algorithms to merge/pack attestations, but more of a higher level approaches
(which include changing of a way how validators' overlay network is constructed and used).
- There's also a 1-pager design of [Better Aggregation Inclusion](https://hackmd.io/@prestonvanloon/better_aggregation_inclusion)
by Preston and Terence.
## 2. Objective
In order to increase profitability, attestations must be aggregated in a way to cover as many individual
attestors as possible.
In terms of expected input/output here is what we have:
- Within the scope of this doc, an attestation is a bitlist of length equal to number of validators
in a given committee, where each bit represents a single validator and there may be one or more of bits set
per attestation.
- Array of attestations is represented by incidence matrix, where rows are attestation instances, and
each column represents a validator inclusion status across all known attestations.
- Given a list of attestations, we are expected to aggregate all the non-overlapping ones, ideally having an aggregation
which includes all the possible attesting participants.
```
Sample attestations:
A0 = [
[0 0 0 1 0], // A
[0 0 1 0 0], // B
[0 1 0 1 0] // C
]
List can be transformed into:
A1 = [
[0 0 1 1 0], // A + B
[0 1 0 1 0] // C
]
Or, even better:
A2 = [
[0 0 0 1 0], // A
[0 1 1 1 0] // B + C
]
```
At the moment, Prysm uses $O(mn^2)$ iterations to search for possible aggregation pairs: the $O(n^2)$ is for pairing
attestations and another $O(m)$ to check for overlaps within each pair ($m$ is number of
validators in a given committee).
It should be noted that, considering the fact that we are dealing with NP-Hard problem here, the runtime is
not that bad, at all!
When it comes to performance of such a traversal (that's how optimal the resultant aggregation is), then
it solely depends on order in which attestations are presented: once a pair of non-overlapping attestations
is located they are immediately merged, without any decision-making whether there's a more profitable
aggregation or not.
Here is a bit contrived example of such a suboptimal case:
```
Given the following attestations:
A0 = [
[0 0 1 0 0],
[0 0 0 1 0],
[0 0 0 0 1],
[1 1 0 0 1],
]
Optimal solution (merge: rows 1, 2 and 4; discard: row 3):
A_OPT = [
[1 1 1 1 1],
]
Solution produced by the current solver:
A1 = [
[0 0 1 1 1],
[1 1 0 0 1],
]
```
So, the main objective is to **find an approximation algorithm that will result in a solution
which is more close to the optimal one**.
:::info
Algorithms are analyzed using $\alpha$-approximation method, where $0 \le \alpha \le 1$ is the approximation ratio, and
an algorithm with a given value of $\alpha$ produces solution which is at least $\alpha$ times the optimal
value.
In our problem, solutions are scored by their cardinality: the more participants we have
within a single aggregated item the better, with the maximum possible size equal to all the validators in
a given committee.
:::
When it comes to run-time efficiency, the aim is to make sure that aggregation doesn't become a bottleneck
in node's processing (it should be able to process thousands of attestations in sub-second runs).
This will be tackled via benchmarking alternative approaches.
## 3. Formal Problem Statement
### 3.1. Definition
**Def (Attestation Aggregation):**
==$AA(U, S, k) \to S'$==
Let $U$ be a finite set of objects, where $|U| = n$. Furthermore, let
$S = \{S_1, ..., S_m | S_i \subseteq 2^U\}$ be a collection of its subsets, where
$\bigcup_{i = 1}^m S_i = U$ i.e. all $u \in U$ are present in one of elements of $S$.
Then, **Attestation Aggregation** (AA) is the problem of finding $S' \subseteq S$ that
covers at least $k \in [1..n]$ elements from $U$, and sets in $S'$ are disjoint:
$$
|\bigcup\limits_{T \in S'}T| \ge k
$$
and
$$
S'_i \cap S'_j = \emptyset, \forall i, j \in [1..m], i \ne j
$$
Ideally, we want $\bigcup\limits_{T \in S'}T$ to have maximum cardinality, that's $k = |U|$,
and all $u \in U$ are covered by $S'$:
$$
|\bigcup_{T \in S'}T| = |U|
$$
Since BLS doesn't allow merging overlapping signatures, there's that additional constraint of making sure
that all elements of $S'$ are pairwise disjoint.
To summarize: given a family of sets $S$, we need to find a subfamily of disjoint sets $S'$, which have the same
(or close to same) union as the original family.
The problem is NP-Complete and only allows for logarithmic-factor
polynomial-time approximation.
### 3.2. Comparison to Known Problems
#### 3.2.1. Attestation Aggregation (AA) vs Minimum Set Cover (SC)
In the MSC we have the very same input set system $(U, S)$, but our target $S'$ is a
bit different: we want to find a full cover of $U$ with minimal $|S'|$.
With AA, partial (if still maximal)
coverage is enough, there's no constrains on cardinality of $S'$, and all elements of $S'$ are pairwise
disjoint.
#### 3.2.2. Attestation Aggregation (AA) vs Exact Cover (EC)
Again, we start from the same set system $(U, S)$, and the EC matches the ideal case of our problem
when there exists an optimal solution within a given input $S$. So, if input list of attestations
form (by itself or as any combination of its subsets) a full partition of $U$, the resultant $S'$ for
both EC and AA coincide.
There is on important difference in AA: it allows for partial covers.
#### 3.2.3. Attestation Aggregation (AA) vs Maximum Coverage (MC)
In the [MC](https://en.wikipedia.org/wiki/Maximum_coverage_problem) problem, we want to
find up to $k$ subsets that cover $U$ maximally:
$$
|S'| \le k \land \mathop{argmax}\limits_{S'} |\bigcup\limits_{T \in S'}T|
$$
Important thing to note is that in its conventional form MC doesn't require elements of
$S'$ to be disjoint, which is a problem for our case -- as overlapping attestations cannot be
aggregated.
So, the important differences of AA include: no constraints on cardinality of $S'$, requirement of
pairwise disjoint elements in $S'$.
MC can still be utilized for our purposes: since there exists an approximation algorithm
with $\alpha \approx 0.6$ (pretty impressive) we can rely on it to build partial solution by gradually
increasing $k$ (see the Possible Solutions section below).
### 3.3. Sample problem
Since $S' \subseteq S$, when implementing the relation in code, the optimal solution will be represented
as a list of indexes corresponding to selected items from $S$ (to avoid unnecessary copying):
$$
I = \{i | i \in [1..m] \land S_i \in S'\}
$$
To better illustrate the formal definition, here is a sample problem:
```
// We're given list of attestations, in matrix incidence form.
// Those attestations are essentially instances of S_i:
S = [
[0 0 1 0 1 0], // participating bits are at columns: 3, 5
[0 1 0 0 0 0],
[0 0 0 0 1 0],
[1 0 0 0 0 0],
[0 0 1 0 0 1],
]
// Union of all *column indexes* where participation bit is set to 1,
// for at least one attestation, represents the universe set, U:
U = [1, 2, 3, 5] // w/i given list of attestations, there's no bit at column 4
// We want to come up with a subcollection of attestations,
// which are disjoint and represent maximum number of the validators
// specified in initial list of attestations, S':
I = [2, 3, 4, 5] // merge rows 2, 3, 4, 5
```
## 4. Possible Solutions
So, our problem is closely related to set cover kind of problems to which there exist several
possible approaches, none of which enjoys having a deterministically optimal solution.
Several closely related NP/NP-hard problems (and their variants) have been considered:
- [Set Cover Problem](https://en.wikipedia.org/wiki/Set_cover_problem)
- [Exact Cover](https://en.wikipedia.org/wiki/Exact_cover)
- [Maximum Coverage Problem](https://en.wikipedia.org/wiki/Maximum_coverage_problem)
- [Set Packing](https://en.wikipedia.org/wiki/Set_packing)
- [Max Disjoint Set](https://en.wikipedia.org/wiki/Maximum_disjoint_set)
The Exact Cover and Maximum Coverage problems seem to be the most relevant to what we have at hand,
and with some twisting can be utilized to solve AA.
### 4.1. Set Cover
The [Set Cover problem](https://en.wikipedia.org/wiki/Set_cover_problem) is one of [Karp's 21 NP-Complete problems](https://en.wikipedia.org/wiki/Karp%27s_21_NP-complete_problems).
It seems natural to start from the base covering problem because it serves as a foundation for other
problems, it has a greedy algorithm solver with $ln(n)$ approximation to optimal, and with some effort
we can even make that greedy solver run in a linear time!
**Def (Minimum Set Cover):**
==$MSC(U, S) \to S'$==
Let $U$ be a finite set of objects, where $|U| = n$. Furthermore, let
$S = \{S_1, ..., S_m | S_i \subseteq 2^U\}$ be a collection of its subsets, where
$\bigcup_{i = 1}^m S_i = U$. Then, **Minimum Set Cover** (MSC) is the problem of covering
$U$ with a subset $S' \subseteq S$ s.t $|S'|$ is minimal.
Framed like that, this problem doesn't abstract attestation aggregation completely. While MSC produces
a cover of $U$, $S'$ may contain subsets with overlapping elements from $U$, and as such can't be used as input to aggregation
function. So, we need to add an extra constraint -- making sure that all elements in $S'$ are pairwise
disjoint.
:::warning
**Relevant Works:**
- [Approximation Algorithms: Covering Problems](https://courses.engr.illinois.edu/cs583/sp2018/Notes/covering.pdf) by Chandra Chekuri
- [Greedy Approximation: Set Cover](https://www.cs.umd.edu/class/fall2017/cmsc451-0101/Lects/lect09-set-cover.pdf) short intro by Dave Mount
- [github.com/guangtunbenzhu/SetCoverPy](https://github.com/guangtunbenzhu/SetCoverPy) interesting implementation using a combination of greedy approach and Lagrangian relaxation.
:::
**Def (Minimum Membership Set Cover):**
==$MMSC(U, S, k) \to S'$==
The same set system as in MSC, with additional requirement on how many times each $u \in U$ can occur
in elements of $S'$ i.e. $\mathop{max}\limits_{u \in U} |\{T \in S'| u \in T\}| \le k$, for a
nonnegative $k \in [1..m]$.
:::warning
**Relevant Works:**
- [Minimum Membership Set Covering and the Consecutive Ones Property](http://www.akt.tu-berlin.de/fileadmin/fg34/publications-akt/red-blue-hitting-set-swat06.pdf) by Michael Dom, Jiong Guo, Rolf Niedermeier, Sebastian Wernicke
- [Interference in Cellular Networks: The Minimum Membership Set Cover Problem](https://tik-old.ee.ethz.ch/file/4670b05c97a9fd9be8ab793e7704d901/cocoon05.pdf) by Fabian Kuhn, Pascal von Rickenbach, Roger Wattenhofer, Emo Welzl, and Aaron Zollinger
- [CS261: Optimization Handout, A Linear Programming Relaxation of Set Cover](http://theory.stanford.edu/~trevisan/cs261/lecture08.pdf) by Luca Trevisan
:::
:star: Applicability of MMSC:
- Decision version of the problem (whether $S'$ exists) can be used to check for cover.
- When used as $MMSC(U, S, 1)$ i.e. limit number of occurrences of $u \in U$ to a single occurrence, we
effectively transform problem to Exact Cover variant (which matches our ideal case exactly).
Another variant worth mentioning is Partial Set Cover, where we again are looking for $S'$ of minimal
cardinality (just as we do in MSC) which covers at least $k$ elements from universe $U$.
**Def (Partial Set Cover):**
==$PSC(U, S, k) \to S'$==
Consider the same set system as in MSC, with additional parameter $k \in [1..m]$.
Then, **Partial Set Cover** (PSC) is the problem of finding $S' \subseteq S$ of minimal cardinality,
that covers at least $k$ elements of $U$.
:::info
**Partial Set Cover (PSC) vs Maximum Coverage (MC)**
PSC differs from Maximum Coverage problem in a subtle way: in the MC we limit number of subsets
$|S'| \le k$, for maximum covered elements in $U$; in PSC we limit upper bound on how many items are
covered $|\bigcup\limits_{T \in S'}T| \le k$ with $S'$ of minimal cardinality.
:::
:::warning
**Relevant Works:**
- [On Approximating Partial Set Cover and Generalizations](https://arxiv.org/pdf/1907.04413.pdf) by Chandra Chekuri, Kent Quanrud, Zhao Zhang
- [A Unified Approach to Approximating Partial Covering Problems](http://www.mathcs.emory.edu/~ojas/papers/UniPartialCov.pdf) by Jochen Konemann, Ojas Parekh, Danny Segev
:::
:star: Applicability of PSC:
- Again, decision version can be useful, to check the boundaries (gradually increasing $k$) of
$S'$ existence. With $k = |U|$ we effectively have MSC problem. In order for PSC be really useful,
we also need to constrain number of occurrences of $u \in U$ within $S'$ elements i.e. so that all subsets
in $S'$ are pairwise disjoint.
### 4.2. Exact Cover
The [Exact Cover problem](https://en.wikipedia.org/wiki/Exact_cover) is one of [Karp's 21 NP-Complete problems](https://en.wikipedia.org/wiki/Karp%27s_21_NP-complete_problems).
When exact cover exists within a given set system, the Exact Cover abstracts attestation
aggregation perfectly. The problem is that perfectly non-overlapping partitions of $U$ are not
naturally happening in our system
(so making them happen can be an attack vector when solving the problem :question:).
**Def (Exact Cover):**
==$EC(U, S) \to S'$==
Let $U$ be a finite set of objects, where $|U| = n$. Furthermore, let
$S = \{S_1, ..., S_m | S_i \subseteq 2^U\}$ be a collection of its subsets, where
$\bigcup_{i = 1}^m S_i = U$. Then, **Exact Cover** (EC) is the problem of covering
$U$ with a subset $S' \subseteq S$ s.t $S'_i \cap S'_j = \emptyset, \forall i, j \in [1..m], i \ne j$.
This NP-Hard problem has a nondeterministic backtrack solver algorithm
([Algorithm X](https://en.wikipedia.org/wiki/Knuth%27s_Algorithm_X) by D.Knuth).
The Algorithm X is capable of finding *all* the optimal solutions to the problem.
However, having such an $S$ that there exists a subcollection of pariwise disjoint subsets that
cover $U$ completely is a rare luck in our system (is it :question:). More than often $S$ will not contain the solution to EC.
In such cases, we still want some partial solution, even if only part of attesters can be collected
within a single aggregation.
So, adding constraint similar to MMSC (where we limited number of times $u \in U$ can occur in $S'$), we need to transform the problem into accepting another parameter $k \in [1..n]$, with the purpose of
finding the $S'$, where $|\bigcup\limits_{T \in S'}T| \ge k$ i.e. union of elements of found subsets
covers at least $k$ elements of $U$. Then by gradually increasing $k$ we want it to be as close
to $|U|$ as possible (max k-cover? :question:).
:::warning
**Relevant Works:**
- [Dancing Links](https://arxiv.org/pdf/cs/0011047v1.pdf) by D.Knuth
- [Analysing Algorithm X](https://11011110.github.io/blog/2008/01/10/analyzing-algorithm-x.html) Good overview of algorithm (with additional heuristic to improve performance if single solution is enough)
- [github.com/Kappeh/dlx](https://github.com/Kappeh/dlx) (not particularly optimized implementation)
:::
:star: Applicability of EC:
- If solution exists, then Algorithm X (effectively implemented using DLX) can find it. If full solution
is impossible, we need to explore possibility of finding partial cover.
### 4.3. Maximum Coverage
**Def (Maximum Coverage):**
==$MC(U, S, k) \to S'$==
Let $U$ be a finite set of objects, where $|U| = n$. Furthermore, let
$S = \{S_1, ..., S_m | S_i \subseteq 2^U\}$ be a collection of its subsets, where
$\bigcup_{i = 1}^m S_i = U$. Then, **Maximum Coverage** (MC) is the problem of
finding $S' \subseteq S, |S'| \le k$ covering $U$ with maximum cardinality, that's
$$
|S'| \le k \land \mathop{argmax}_{S'}|\bigcup_{T \in S'}T|
$$
:::warning
**Relevant Works:**
- [Analysis of the Greedy Approach in Problems of Maximum k-Coverage](https://hochbaum.ieor.berkeley.edu/html/pub/HPathria-max-k-coverage-greedy.pdf) by Dorit S. Hochbaum, Anu Pathria
- [Online maximum k-coverage](https://core.ac.uk/download/pdf/11427978.pdf) by Giorgio Ausiello, Nicolas Boria, Aristotelis Giannakos, Giorgio Lucarelli, Vangelis Th. Paschos
- [A Multi-objective Disjoint Set Covers for Reliable Lifetime Maximization of Wireless Sensor Networks](https://www.researchgate.net/publication/271136384_A_Multi-objective_Disjoint_Set_Covers_for_Reliable_Lifetime_Maximization_of_Wireless_Sensor_Networks) by Bara’a A. Attea, Enan A. Khalil, Suat Özdemir & Oktay Yıldız
- [Approximating Maximum Disjoint Coverage in Wireless Sensor Networks](https://www.academia.edu/4041695/Approximating_Maximum_Disjoint_Coverage_in_Wireless_Sensor_Networks) by Shagufta Henna and Thomas Erlebach
- [The Online Disjoint Set Cover Problem and its Applications](https://arxiv.org/pdf/1411.5739.pdf) by Ashwin Pananjady, Vivek Kumar Bagaria, Rahul Vaze
:::
:star: Applicability of MC:
- With additional requirement of $S_i \cap S_j, \forall i, j \in [1..m], i \ne j$
(pairwise disjoint sets in $S'$) we can have a very useful mechanism to build approximate solutions using
greedy approach.
## Summary and Further Work
So, possible solutions can be enumerated as following:
- **Exact Cover (EC)**
- Can be used to check for solutions if situations when perfect solution exist are not rare.
- If combined with Partial Set Cover (PSC) for partial cover solutions, can match Attestation Aggregation perfectly.
- **Maximum Coverage (MC)**
- Greedy algorithm + additional constraint of disjoint sets in $S'$
- Gradual increase of $k$ ($1 \to |S|$) to obtain maximal cover for a maximum number of available attestations.
Some items to consider:
- We, need a section with formal runtime analysis -- just to make sure upper bounds of what can be done is known.
- Like for many NP-Hard problems, one can get polynomial time algorithm, if the problem size is fixed
(parameterized complexity) .
Say, for $S'$ subsets, if we only look into solutions of size 4, we are dealing with $O(|S|^4)$ items,
even if we enumerate every possible 4-tuples. We might consider some optimization based on this.
- Since, we're dealing with NP-Hard problems, there's little chance we will not be bothered by runtime
complexity. Another possible attack vector is to rely on parallel computing.
- Consider combining algorithms or running individual algorithms in rounds.
- There are number of highly effective aggregation heuristics that rely on how data is transmitted i.e.
instead of concentrating on covering arbitrary set systems, we try to come up with heuristic that will
result in a preferable attestations propagating the network (
see [Heuristically Partitioned Attestation Aggregation](https://notes.ethereum.org/@protolambda/Hkjl_L6_H)
for a very interesting approach).