# 5: Data Structures (Bloom Filter) and Strategies ###### tags: `Tag(sp22)` * If you are still working on Sprint 0: STOP. You won't be able to get more credit from a Sprint 0 submission now. Instead, recover lost points on Sprint 1. * Hopefully everyone has started Sprint 1 and knows what their component is. If you don't yet, decide with your team **today**. * Steady, consistent effort! * You should all have a mentor to go with your group. Don't think of mentor meetings as technical office hours; instead, view them as process-oriented. Demo functionality. By the end of the first project (4 weeks) you'll have built a recommender system. Sprint 1 is the first part: building the data structures and extensible framework. Last time we talked about kd-trees (and vaguely hinted about the REPL). This time, I want to talk about the other data structure (Bloom filters) and say a lot more on the strategy pattern. ## Tradeoffs In computer science, we're used to tradeoffs. Specifically, you've all learned about the classic _space-time tradeoff_ that caching gives. You might implement this by memoization, or dynamic programming, or (if you're Google, Netflix, or some other popular website) a Content Delivery Network. But what _else_ might we compromise on? Here's a thought: can we ever compriomise on _correctness_ in exchange for a time or space savings? Does that even make sense? Is it possible we've already done that somehow? <details> <summary>Think, then click!</summary> The basic idea underlying hash tables is a compromise on correctness. They're just built to resolve that compromise internally. Here's what I mean. If we assume that every record key hashes to a different table key, hash tables are a miracle (constant-time lookup). But this isn't true: collisions happen. So the table has a (worst-case) linear-time backup plan built in to recover from that compromise. </details> <br/> This idea of compromise on correctness is also valuable in algorithms. E.g., there exist fast algorithms for primality testing that just happen to be wrong sometimes. Is that enough to build something useful out of? Would you maybe want more? <details> Yeah, you'd maybe like something along the lines of: * only wrong in one direction; and * only wrong X% of the time. Note again that we're back to what _guarantees_ a data structure or algorithm provides. When you're building or selecting a data structure or algorithm, consider the needs of your caller and/or end-user. In the case of these prime-testing algorithms, they can very quickly rule out many non-prime numbers, and do so fast enough that it's worth using them as a pre-test before applying a slow, but sound, algorithm. **Technical aside:** I am oversimplifying the algorithm a bit, because there are a few other details. You might learn about these if you take a number theory class, or an algorithms class! </details> <br/> ## Bloom Filters Bloom filters are a data structure for implementing set-membership queries. They work a lot like a hash table. Indeed, the basic structure they build on **is** a hash table. They just happen to not return booleans, or at least not in the same sense as (say) a `Set` in Java. Instead, a Bloom filter returns one of: * the item is _possibly_ in the set; or * the item is _definitely_ not in the set. ### A Common Use Case They're frequently used as an efficient first-pass check to more expensive membership checks on very, very large datasets. (Think: "No, I don't have that YouTube video stored in this server." or "Yes, I might have that YouTube video---let's look for it.") ### Building The Idea Like with kd-trees, Bloom filters build on a simpler idea; in this case, it's hash tables. Start with an array of bits. We can use that to implement an approximate membership check: just hash the key you're looking for, and see if the bit in that cell of the array is `true`, rather than `false`. Here's an example. Suppose we're in charge of Amazon's Prime TV service, and we want to very quickly tell whether a certain CDN node has a show that a local customer requests. We could add a bit array to every node, and when a new piece of media arrived (say, the first chunk of a new series), we'd hash the name of the media and set the corresponding bit to `1`. Like this: ![](https://i.imgur.com/mfzLIUa.png) Note one crucial difference. Instead of saving a reference to the media in the table, we're just saving _one bit_. Very space efficient, especially if we've got a large array. Now what happens if there's a collision? <details> <summary>Think, then click!</summary> The key that the 2nd arrival hashes to is _already set to 1_. The array now can't tell the difference between these two media: if a customer requests the second arrival but the first arrival isn't there, the array will report a false positive. But, either way, a lookup is very, very fast. We can always do a _real_ search through the local media catalogue if the array says we _might_ have what the customer wants. And if the array says it's not here, it definitely isn't. </details> ### Scaling The Idea Bloom Filters are a generalization of this idea. More full details are in the assignment handout, but the basic idea is this: why not have _several_ hash functions, and use all the bits those hash functions produce for adding and looking up membership. That is, if we have 3 hash functions and they produce `{13, 2, 118}`, we'd set those 3 bits to true on arrival, and check those 3 bits on a membership check. The trick is in tuning the hash functions, and deciding how big the table needs to be. #### Is that really the _only_ problem? No. Consider deletion. But we won't make you do that. ## Strategy Pattern in Sprint 1 Our OO design goal for Sprint 1 is the strategy pattern. Thus, all 3 components need to use it, and we've told you one way we expect it to appear for each component. ### Bloom Filter For the Bloom filter, the opportunity appears when you're comparing two Bloom filters as part of the "nearest" operation on categorical data. You can think of this as roughly corresponding to the kd-tree using Euclidian distance (say), except it's a bit more subtle because we don't have coordinates---we have bit arrays. ### kd-Tree In contrast, the kd-tree manages numeric data. But it's still reasonable to imagine a caller who wants to control what comparison metric is used on the data. (See the handout for details.) ### REPL If you want me to be able to use your REPL for my own purposes, the strategy pattern is perfect. Here you'll need to define your own interface---a `Comparator` doesn't really fit. Consider what I might need to do in order to provide my own commands to your REPL, without modifying your REPL's code. (See the handout for details.) ## Code Review Exercise It's time for our code-review exercise! Look over this code and critique it. Remember to evaluate it along 4 dimensions: * its readiness for change; * its fitness for purpose; * its safety from bugs; and * whether it is easy to understand. If you have criticism: * What's the problem? * What's a scenario illustrating the problem? * What is your suggested change? * Is there any nuance, or tradeoffs involved in the change? ![](https://i.imgur.com/EFQ1bE5.png) Go to the [Code Review Exercise Form](https://forms.gle/efMzSoyiXYGigcu37) and provide your feedback! ## More On Comparators Let's go back to the world of comparators. There's one place they are immediately and obviously useful: sorting. Java's [`Collections.sort`](https://docs.oracle.com/javase/7/docs/api/java/util/Collections.html#sort(java.util.List,%20java.util.Comparator)) library method takes in a collection and a `Comparator` that says how to compare objects in the collection. The method signature is: ```java public static <T> void sort(List<T> list, Comparator<? super T> c) ``` Why the `T`? Because the collection might contain anything! The `Comparator` needs to know how to compare objects of type `T`. Indeed, a `Comparator` that knows how to compare objects of some `super`class of `T` would work too. **Question:** What's the _point_ of all this generics stuff? It's a lot of work, and sometimes the definitions get tangled. What do we get out of it? <details> <summary>Think, then click!</summary> More safety from bugs. Back in the bad old days of the 90s, when Tim was learning Java, the language had no generics. So we'd just write `Comparator` without a generic type variable. This was easier at first glance, but it meant we had to do a lot of runtime checking and typecasting. (After all, if you want to iterate over elements of a `Collection`, the best you can do is call them `Object`s unless you're going to typecast.) Generics save a _lot_ of debugging time. They also provide documentation of your intentions and spec to your caller: without that `T` in the `Comparator`, that dependency on the type of objects in the collection would have been completely implicit. In this case, you might think it's obvious, but I hope not! It's very easy to miss things that another engineer thinks are "obvious". </details> (Do you see an opportunity to use a `Comparator` to improve the code you just reviewed?) ### How Do I Create A Comparator There are a few ways. One is to just create a class that implements `Comparator<T>` in the normal way. Just make sure you know what `T` represents! Many, many kd-tree bugs have come from mixing up whether a `Comparator` functions on the class of the _datapoints_, or the internal kd-tree node class that wraps them. Another is to use anonymous classes. Java will let us create a `Comparator` directly, in the place it's needed. Still another is to use _lambdas_. Here is some example code. In the future, I'll put lecture livecode up on a Git repository for you to use directly. ```java= package edu.brown.cs32.livecode.feb10; import java.util.*; public class Main { public static void main(String[] args) { List<Integer> myList = new ArrayList<>(Arrays.asList(17, 3, 42, 44, 8, 9, 1, 100)); System.out.println(myList); // unsorted Collections.sort(myList); // SORT THE LIST FOR ME, JAVA! System.out.println(myList); // hopefully sorted // Concise strategy: lambda Collections.sort(myList, (a, b) -> -a.compareTo(b)); // THAT WAY System.out.println(myList); // hopefully sorted // Less concise, but also in-place strategy: anonymous class Collections.sort(myList, new Comparator<>() { @Override public int compare(Integer o1, Integer o2) { return o2.compareTo(o1); } }); System.out.println(myList); // hopefully sorted // Most verbose, but extensible/externally-visible Collections.sort(myList, new MyComparator()); System.out.println(myList); // hopefully sorted } } class MyComparator implements Comparator<Integer> { @Override public int compare(Integer o1, Integer o2) { return o1.compareTo(o2); } } ``` ### Beyond Comparators All these approaches work for strategies more generally. If you haven't read the notes from last time, and you're doing the REPL component, definitely go back and find the place that suggests a starting point. We'll continue this theme next week, with some more involved livecoding (and review).