MvD Validation

## Algorithm Description ### Initialization Phase From the given records, we build a PLI storage (HashMap) that maps from a set of attributes to the corresponding PLI. We initially calculate the PLIs for each single attribute (to be exact, for each set containing just one attribute) and store it. When intersecting two PLIs later, we extend this storage with every new PLI for reuse. ### Validation Phase #### Trivial MvDs In the beginning of the validation phase, we always check if the current MvD is a trivial MvD (X ->> X or X u Y = R) and instantly return true if thats the case. #### PLI Intersection In the initializing step, we already generated a PLI for each individual attribute. But very often, X consists of multiple attributes (our algorithm does not need PLIs for Y and Z, thus no PLI will be generated for them). So before continuing, we might need to intersect existing PLIs to gain a PLI for X in the end. The intersection algorithm is implemented according to the slides. After intersecting two PLIs, we store the newly generated for later use and for later intersection. For complex attribute sets of X, one intersection of two existing PLIs might not be enough. In that case we need to perform multiple intersections to obtain a PLI for X. To optimally utilize the stored PLIs (which might be PLIs for multiple attributes), we have to consider all of our generated PLIs and choose them such that the amount of intersections to get to the desired PLI is minimal. As our PLI storage maps a set of attributes to a PLI, we search for the minimal amount of known attribute sets in the storage such that they add up to X. This is more commonly known as the "Set Cover Problem". ##### Set Cover Algorithm Because the computation of the minimal number of subsets for X is neither fast nor trivial, we use a greedy algorithm to get the subsets fast but not necessarily the optimal ones. The greedy algorithm works by choosing the greatest available subset (from our storage) of X in every step until the selected subsets add up to X. To improve the runtime of this greedy algorithm, we filter all of our attribute sets that we have a PLI for by selecting those that are a subset of X. Then, in every step after choosing the greatest subset, we remove every candidate that contains and attribute of the just removed set. With that, the remaining candidates will always be a subset of X minus the already selected sets. Ultimately we can omit the subset check completely. Additionally, the number of candidates shrinks with every step. A little test using the flights_1k table with 100 attributes showed a 5x speedup compared to iterating over every known set of attributes and selecting the greatest one until they add up to X. #### Checking if X ->> Y holds At this point, we have successfully generated the PLI for X. For every group in the PLI of X, we check whether the number of distinct values in Y for the rows in this group multiplied by the number of distinct values in Z (R/X U Y) is equal to the size of the group. If this holds for every group in the PLI for X, then X->>Y is true. If every value in Y for the specific rows are the same, we know that all the XY values for these rows are the same and therefore the statement X->> Y is true.