Filters - HackMD

# Filters ## Open Questions - What is the narrative: - improve build time to speed in memory queries up (push down in join queries) - use filters for reducing I/O (computational storage) - comparison of all filter types that exists - What do we want to optimize - Build Time: relevant at all for distributed use case? - Probe Time: Harald did a lot of research regarding this point (why do we need another paper) - Space consumption: does not matter on non-distributed use-case, transfering data is no problem (except computational storage) - Why should we use filters for push downs, use the original hashtable instead (far more precise, no additional space needed) -> Needs some kind of benchmark to underline importance - Partitioning: improves performance greatly but is already done "accidentily" for work partitioning - How about updates and deletions (possible for some filter types using counters but degrades performance) - How about duplicates (counting them, bag semantic) ## - Compare all filters - Optimal - Compare to state of the art - Optimization - Raw Runtime - UseCases: LSM-TREE(low fpp, recontruct), identify most ipmortant factors ## Related Work Adaptive Cuckoo Filters ## References #### [1] [Performance-optimal filtering: Bloom overtakes Cuckoo at high throughput](http://www.vldb.org/pvldb/vol12/p502-lang.pdf) - Bloom- and Cuckoo-Filter Implementation - Magic Modulo Function (more cycles than other approaches, needs more hash bits) - Performance-optimal with respect to query-time - high work time per hit -> Cuckoo (lower false positive rate) - low work time per hit -> Bloom (faster access times) - In-depth analysis of access times depending on SIMD-Optimization and filter sizes (cache residient) - SIMD lookup based on GATHER-AND-CMP, i.e. multiple keys are looked up simultaneously #### [2] [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/pdf/1912.08258.pdf) - Introduction of the new filter - Build and lookup algorithm - Comparison to other filter (bloom, cuckoo, semi-sorted cuckoo, ...) - Fast access times and small filter sizes - Module is replaced with `mulh` (needs more hash bits) - How about SIMD? #### [3] [Succinct Data Structures for Retrieval and Approximate Membership](https://www.itu.dk/people/pagh/papers/bloomier.pdf) #### [4] [All hash table sizes you will ever need](http://databasearchitects.blogspot.com/2020/01/all-hash-table-sizes-you-will-ever-need.html) - Constant table sizes that are prime (can deal with bad hash functions) - Module is replaced with Magic Numbers (needs more hash bits) #### [5] [Vacuum Filters: More Space-Efficient and Faster Replacement for Bloom and Cuckoo Filters](http://www.vldb.org/pvldb/vol13/p197-wang.pdf) - Claims to be smaller and faster, while offering good fpr, how does it compare to Xor Filter - Allows deletions and resizing, how relevant is this in practice - "Vacuum filters provide higher lookup and insertion throughput, due to better data locality." - "Compared to cuckoo filters, vacuum filters reduce the space cost by > 25% on average." - Somehow improved eviction process, not totally clear why this is so special, look into [16] - "Note even if IUPR is not used, vacuum filters provide better performance than Bloom or cuckoo filters in most concerned metrics including the space cost, throughput, and false positive rate." (quite strong wording and probably not true) - code: [Vacuum Filter GitHub](https://github.com/wuwuz/Vacuum-Filter) #### [6] [Stable Learned Bloom Filters for Data Streams]() #### [7] [Space/Time Trade-offs in Hash Coding with Allowable Errors](https://dl.acm.org/doi/pdf/10.1145/362686.362692) - Introduces Bloom Filters, should be cited #### [8] [The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Tables](https://www.cs.princeton.edu/~chazelle/pubs/soda-rev04.pdf) #### [9] [Cache-, Hash- and Space-Efficient Bloom Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf) #### [10] [A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing](https://publications.cispa.saarland/59/1/p96-richter.pdf) #### [11] [Vectorized bloom filters for advanced SIMD processors](http://www.cs.columbia.edu/~orestis/damon14.pdf) #### [12] [Cuckoo Filter: Practically Better Than Bloom](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) #### [13] [Morton Filters: Faster, Space-Efficient Cuckoo Filters via Biasing, Compression, and Decoupled Logical Sparsity](https://www.vldb.org/pvldb/vol11/p1041-breslow.pdf) #### [14] [Improving Database Performance on Simultaneous Multithreading Processors](http://www.cs.columbia.edu/~kar/pubsk/vldb2005.pdf) #### [15] [Improving Hash Join Performance through Prefetching](http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf) #### [16] [Algorithmic improvements for fast concurrent cuckoo hashing]() #### [17] [Fast Hash Table Lookup Using Extended Bloom Filter: An Aid to Network Processing]() #### [18] [Horton Tables: Fast Hash Tables for In-Memory Data-Intensive Computing](https://pdfs.semanticscholar.org/6eb1/d0c948c9221f6dc0d0355dbb1029357bd919.pdf?_ga=2.166090878.1594073560.1599590551-1889358740.1594677159) - look into it as alternative to cuckoo hashtable #### [19] [Optimal Bloom Filters and Adaptive Merging for LSM-Trees](https://nivdayan.github.io/monkey-journal.pdf) #### [20] [Cuckoo Index: A Lightweight Secondary Index Structure](https://www.vldb.org/pvldb/vol13/p3559-kipf.pdf) #### [21] [Less Hashing, Same Performance: Building a Better Bloom Filter]() #### [22] [Universal hashing and k-wise independent random variables via integer arithmetic without primes]() ## Hardware ### Parallelization - Number of CPU cores - Number of Threads (SMT) ### Caches - L1 cache size - L2 cache size - L3 cache size ### SIMD - AVX2 (legacy, relevant?) - AVX512 ### Register - 32bit (legacy, relevant?) - 64bit ## Utilities ### Hash Function - single MurmurHash/CityHash iteration - crc32 - aes iteration (aqua hash) - xor/shift combination ### Adresser - Power of Two - Prime: prime numbers from Neumann blog - Lemire: multiply shift combination [2] - given: hash(key) ∈ [0, ..., 2^32 - 1] - find: index ∈ [0, ..., s - 1] (s < 2^32) - index = hash(key) / 2^32 * s = hash(key) * s / 2^32 = mulh(hash(key), s) - Magic: fast magic number operation [1] ### Partitioning - software-managed buffers and streaming instructions - might require multilevel partitioning ## Bloom Filter - Dimensions: - `k = #hash_functions = #set_bits_per_key` - `m = filter_size` - Metrics: - false positive rate - build time - lookup time (counting / building bitmap) ### Optimizations #### Loop Unrolling - Build: does it at help at all (instruction cache missses)? - Lookup: - first compares will more likely succeed when later ones - train CPU branch prediction - alternative: use likely and unlikely flag for code layout #### Partitioning - How large to choose the partitions (probable L1-cache size) - Build: can improve build time drastically - Lookup: is partitioning amortized by increased locality? #### Prefetching: - helper thread [14] - group prefetching [15] - software-pipelined [15] #### SIMD - size of keys becomes relevant (in theory restricts filter size due to `gather` instruction, can be overcome using partitioning) - Lookup: should be easy to SIMDify using `gather` - Build: - horizontal: insert multiple keys at the same time - problem: writes to the same address - solution: each lanes processes keys from a different partition - will this approach scale (L1 cache misses will probably increase)? - vertical: insert one key using SIMD instructions (not applicable to naive and blocked bloom filters(distinct addresses)) - problem: finding the right block cannot be parallized - solution: precompute block address for multiple keys - sectorized (`B = 512 bit`): compute mask for entire line using SIMD and write it out - sectorized (`B != 512 bit`): compute mask for multiple keys in one SIMD register and write them sequentially to disk (does not require `scatter`) - grouped: compute mask for multiple keys in one SIMD register and write them out sequentially - mask belonging to the same block can be written out together - can be done using `scatter` but should also be possible by first distributing the mask to an entire SIMD register and then using a standard write #### Addressing - use different addressing schemes than Power of Two - do we need this in combination with partitioning? - reduce space consumption - is space consumption relevant for in memory case (why optimize for it, when we are mostly interestet in runtime and fpr) ### Variants #### Naive - set `k` arbitrary bits in the entire filter #### Blocked (Cache- / Register-Blocked) - determine block of size `B` - set `k` arbitrary bits within this block - `B ∈ {32 bit, 64 bit, 128 bit, 256 bit, 512 bit}` #### Sectorized - first determine block of size `B` - distribute the k bits uniformly over `S` sectors of size `B / S` in this block - `B ∈ {32 bit, 64 bit, 128 bit, 256 bit, 512 bit}` - `S ∈ {1, 2, 4, 8, 16, 32, 64}` - `B / 8` and `k` multiple of `S` #### Chache-Sectorized / Grouped - first determine block of size `B` - the block is subdivided into `z` groups - each group consists of `S / z` sectors - for each group determine one sector - set `k / z` arbitrary bits in this sector - `B ∈ {32 bit, 64 bit, 128 bit, 256 bit, 512 bit}` - `z ∈ {1, 2, 4, 8}` - `B / 8` and `k` multiple of `z` - `S ∈ {1, 2, 4, 8, 16, 32, 64}` - `B / 8` multiple of `S` ## Xor Filter - fingerprint based, 3 arrays - lookup: `fingerprint(x) == a1(h1(x)) xor a2(h2(x)) xor a3(h3(x))` - problem constructing the 3 arrays such that lookup-property holds for all insertet keys - 4-Stage build process - 1. Stage: build for each index in the arrays a set of fingerprints that are mapped to this index using the corresponding hash function - 2. Stage: find in each array the sets of size 1 - 3. Stage: push fingerprints in the correct order to stack - 4. Stage build filter from stack - could probably merge 3. and 4. stage, no need for stack - sets in stage 1 are optimized using xor and counter - somehow connected to bloomier filters, more research needed - how about tweaking number of array (2 or 4), probably not a good idea - Dimensions: - `b = #bits_per_tag` - `m = filter_size` - Metrics: - false positive rate - build time - lookup time (counting / building bitmap) ### Optimizations #### Software Managed Buffered - Authors use SMB in stage 1 to reduce random memory accesses - all fingerprints to the same area of the array are buffered, until the buffer is full - probably useless in combination with partitioning #### Partitioning - could improve build performance drastically - maybe also lookup, more experiments needed - probably could use Power of Two addressing without wasting space #### Addresser - Authors looked into Lemire and Power of Two - Prime could be interesting - Magic Numbers are too expensive #### SIMD - 2. Stage easy to optimize - problem: remaining stages will suffer from races - solution: execute on each SIMD lane the same stage for a different partition #### Prefetching - quite diffecult for build phase (helper thread probably not applicable) - lookup: - group prefetching: can be combined with SIMD - software pipeline: should work on SIMD register granularity - helper thread: probably no point in using, most of the work are memory accesses ## Cuckoo Filter - store fingerprints of each key in a hashtable - use cuckoo hashing scheme (how about other schemes, e.g. linear, robin hood, etc. ?) - if destination bucket is full use alternative bucket: - `B1 = hash(x) mod m` - `B2 = (B1 xor hash(x'sig)) mod m` - self-inverse if `m` power of two - building the filter can fail in theory (restart with different hash function) - Dimensions: - `l = #tags_per_bucket` - `b = #bits_per_tag` - `m = filter_size` - Metrics: - false positive rate - build time - lookup time (counting / building bitmap) ### Optimizations #### SIMD - horizontal: load both buckets for multiple keys - buckets size must be smaller than SIMD register (at least factor 4) - vertical: load only one buckets of the same key and search for the given tag in the entire bucket - bucket must be size of SIMD register, no `gather` - low parallelism, but probably better space usage and #### Addressing - use Magic Addresser two reduce space overhead - nead different function to compute alternative bucket (self-inverse) - `B1 = hash(x) mod m` - `B2 = - (B1 + hash(x'sig)) mod m` - should also work for Lemire and Prime #### Partitioning - maybe choose filter size for each partition always power of two (compete with Vacuum Filter) #### Prefetching - software pipelining and group prefetching should work nicely - helper thread is more compicated but possible: - prefetch first bucket - main thread still has enough work finding the empty space and handle full buckets (accessing alternate buckets) - problem: what if main thread is stuck in eviction chain? ### Variants #### Semi-Sorting - TODO #### Vacuum Filter - Cuckoo Filters suffers from a lot of random memory accesses due to alternate bucket - idea: make sure that alternate bucket is near - different alternate bucket function: - `B2 = B1(x) xor (hash(x'sig) mod L)` - L is power of two - distance between B1 and B2 is less than `L` - problem: for small `L` the probability of a successful build reduces - solution: use for different keys (based on radix) different `L`'s - only works for a large number of keys (> 2^18) - author applied semi-sorting in padding for optimization - probably useless in combination with partitioning, hash table is already small enough ## Morton Filter ## Quotient Filter