# 4: Data Structures (kd-tree) and Strategies ###### tags: `Tag(sp22)` * Sprint 1 is out today. I'll summarize it during class. * 0320 allows you to use external materials. If you find a YouTube video or Wiki page that helps you better than this lecture, go for it! But **don't** use anybody else's code without (really) thinking about it carefully and understanding it well enough that it could be your own. That includes testing it well. And always, _always_ cite your sources and give credit. ## Data Representations <a name="reps"></a> There's an adage you may have heard before: "the query influences the structure". This means that your choice of data structure shouldn't just be about the shape of the data you get, it should also be affected what you plan to _do_ with the data. This may seem obvious, but it's harder than it sounds. In 0320, we're going to make a very careful distinction between the _representation_ of data and the _interface_ it provides. Blurring this distinction leads to pain later in the development cycle. Here's an example. **QUESTION:** What's a list? <details> <summary><B>Think, then click!</B></summary> It depends. Do you mean the set of _operations_ one usually expects a list to be able to perform? Adding, removing, index-based accesses, etc.? If so, a list is an object providing that interface to a caller. Or do you mean a particular data structure? A _linked_ list? Maybe a _doubly linked_ list? A (dynamic) _array_ list? All of these can conform to the `List` interface, although some of them have to do more work than others for certain operations. Don't conflate these two meanings. One is about the functionality you must provide; the other is about how you concretely choose to represent your data. </details> <br/> Keep this in mind when you're splitting up work: decide together on the operations (interface) each component needs from the others early, and keep other criteria broad ("I need you to give me worst-case $O(log(N))$ runtime for the `find` operation", not "I need you to use a Java `TreeList`"). ## Sprint 1 Overview There are 3 components in Sprint 1: * a kd-tree based recommender system; * a Bloom-filter based recommender system; and * an extensible REPL and CSV parser. All of these built on ideas you've already started in Sprint 0. The difference is in depth: we'll ask you to use specific OO design ideas and meet other goals. Each of you will be **primarily responsible** for one of these 3 components, while also providing support and code review to your partners. Make steady, consistent progress! Each sprint is 2 weeks long, and each week you'll have a user story or stories to work on. **Every week**, you'll meet with your mentor and demo your functionality so far. **It is vital that you show something working, even if it's incomplete**: we give partial credit, but not for code that can't be run or features that can't be demoed. Our OO theme for Sprint 1 is the strategy pattern. Each of the 3 components uses it, and we'll start talking about it today. Likewise, we'll cover the two data structures this week: kd-trees today, and Bloom filters next time. ## The Mailbox Problem <a name="mailbox"></a> In the _one-dimensional mailbox problem_, you've got a world consisting of buildings positioned on a very long street. Every building has a numeric location on the street; we'll assume it's always an integer. If I told you that I wanted to give you the addresses of all houses on the street, and you needed to let me check to see whether there was a house at any address, you'd be quite justified in reaching first for the `Set` interface. You might even go for `Map`, anticipating I might later want to add more to the data for each house. These tend to be backed by efficient hash-table data structures. But now there's a twist. I don't just want to check membership (or get data about a specific house). Instead, I've got to route the mail. Why is this hard? Because I'm doing the routing at a very high level: I'm working for a gigantic or district post office! We can't hand-deliver every piece of mail ourselves. So I need to send letters to the right small, local post office for hand-delivery. ![](https://hackmd.io/_uploads/SyUQYG9mY.png) In short, I need to find the _nearest neighbor_ to the target house, from among a set of post office addresses. During Sprint 0, you essentially made a new interface that provided a "find nearest neighbor" functionality and implemented it using linear-time data structure. The new interface is "right" for the problem, but what about the data structure? A linear-time search won't scale as the dataset grows. Our first go-to data structure, the hash table, won't work for nearest-neighbor---hashing will lose information about locality and closeness that are so vital to solving this problem! What would you use instead? <details> <summary><B>Think, then click!</B></summary> This is one of several reasons why we still teach you binary search trees (BSTs). BSTs are absolutely perfect for this problem. Not only are they aware of relative positions (and have to be) but it turns out that the ordinary recursive descent you might perform to search for membership also suffices to find a nearest neighbor: you're guaranteed to always find the nearest neighbor somewhere on the descent! </details> <br/> Suppose we store a set of post offices in the following (balanced!) BST: ![](https://hackmd.io/_uploads/HJztFzcmY.png) If we're seeking a nearest-neighbor to `17`, I claim that it always works to do a search in the tree for `17`. If `17` is found, it is its own nearest neighbor. If `17` isn't found, then its nearest neighbor must be on the path taken during the search. ![](https://hackmd.io/_uploads/HJam9zq7K.png) And `20` is the nearest neighbor to `17`. It won't always be the last node visited, but it must be on the path. Why does this work? <details> <summary><B>Think, then click!</B></summary> Because every time we move down the tree, we move in the direction of the goal. It would be impossible to visit a node `X`, where `X < goal`, and move right when we should have moved left. We get this from one of the BST invariants: _every node is less than (or equal to) all its right descendants_. </details> <br/> ### Moving beyond 1 dimension This seems pretty useful, and 1 dimension can be enough for data organized by (say) price or time, where "nearest" still makes sense. But often you really do need more than one dimension. How might we extend a BST for multiple dimensions? It turns out that there are a few answers. There are solutions, like quad-trees and oct-trees, where nodes have a large number of children. These are often used in (e.g.) video game engines. Today we're going to explore an alternative solution that retains the binary-tree foundation, but still handles membership and nearest-neighbor queries on multiple dimensions: _k-d trees_. (I'm going to completely sidestep the question of balance; it's vital for performance but today we'll be talking about functionality.) K-d trees have nodes corresponding to k-dimensional points. So, in a tree with two dimensions, we might have points like `(0,5)` or `(6,2)`. Here's an example of what I mean: ![](https://hackmd.io/_uploads/rJZ6nGc7F.png) OK, sure: this is a binary tree with 2-dimensional node labels. But what makes it _usable_ in ways other binary trees aren't? A better way to phrase that question is to ask: _what are the invariants of a k-d tree_? The key is that every rank in a k-d tree is focused on a single dimension: starting at index `0`, then moving on to index `1` and so on, and rotating back to `0`. Given that added structure, we can say that for every node `P` at rank `X`: * all left-descendants of `P` have an `X` coordinate less than `P.X`; and * all right-descendants of `P` have an `X` coordinate greater than or equal to `P.X`. (I'm making an arbitrary design choice and saying that same-value entries go to the right. Notice that this design choice is more important here than in a BST: there, you can avoid the question by excluding duplicates; here, you'd have to exclude any point from sharing a single dimension from another.) Here's the tree, annotated more helpfully. (Check that it satisfies the invariants!) ![](https://hackmd.io/_uploads/SkCYRG9mY.png) Now how do we actually _use_ these invariants? Since every rank has a single dimension of interest, we might start by doing a normal recursive descent, like we would on a BST, but ignoring dimensions that the current node has no interest in. Let's try it! We'll search for the nearest neighbor of `(5,5)` in this kd-tree. ![](https://hackmd.io/_uploads/ByVP1mqXt.png) This looks promising; the nearest neighbor is `(6,5)` and we did visit that node in our descent. But will it always work? See if you can think of a target `(X,Y)` coordinate whose nearest neighbor is _not_ along the normal recursive descent. <details> <summary><B>Think, then click!</B></summary> `(5,4)` is one example of this; there are others. Sadly, this means that we can't always get away with a single descent. Which, yes, means that even a balanced kd-tree has a worst-case search time that's worse than logarithmic in the number of nodes. In fact, the worst-case complexity is linear. In the average case, it's still much better than a list, though. </details> <br/> #### Fixing the problem So we can't always get away with a single descent. Surely we aren't doomed to always exploring the _entire_ tree? Maybe we can still exploit the invariants in a useful way. To explore this idea, we'll draw those same 7 points on a 2-dimensional graph, and frame the same search for `(5,4)`: ![](https://hackmd.io/_uploads/r1ArWQcQK.png) Here is where I notice that my drawing app has gridlines, but the exported images don't. Sigh. (TODO: fix for next semester.) The purpose of the invariants is, like in a normal BST, splitting the search space between points "bigger" and "smaller" than the current point. We no longer have a single number line, but we _can_ still split the space. The root note, `(6,5)` is a dimension `0` (X) node, so let's draw a line partitioning the space accordingly: ![](https://hackmd.io/_uploads/rkDdf7q7Y.png) The invariants guarantee that any target node with an X coordinate smaller than 6 _must be located in the left subtree_ of `(6,5)`. Sure, but that's the same reasoning that got us into trouble in the first place. We know that the nearest neighbor might be, and indeed is, in the _right_ subtree of `(6,5)`. The trick is in the diagram. How far away from the goal `(5,4)` is the node we're currently visiting? `sqrt(2)`, assuming that we're using the usual Cartesian distance metric. So we know that any regions of the space further away from the goal than `sqrt(2)` aren't productive to visit. In fact, let's mentally draw a circle of radius `sqrt(2)` around the goal: ![](https://hackmd.io/_uploads/BJbyE7cQt.png) At least, we'll pretend that's a circle. As we follow the normal recursive descent, we'll draw 2 more dotted lines: one for `(0,5)` and another for `(4,1)`. But, sadly, our best estimate remains no closer than `sqrt(2)`. We'll pop back up from `(4,1)` to `(0,5)` and ask: _do we need to explore the right subtree?_ Put another way, is there potentially useful space (space inside the circle) on on the far side of the dotted line we drew for `(0,5)`? By the way, you'll sometimes hear these dotted lines referred to as _cutting planes_. Here's a possibly helpful picture: ![](https://hackmd.io/_uploads/ByJSvmcXF.png) Yes! there's productive space on the far side, and so we do need to recur on the right subtree. In general, we can check for productive space by caluclating: $(GOAL[d] - CURR[d]) \leq dist(GOAL, BEST)$ where `GOAL` is the goal point, `CURR` is the current node (where we arere trying to decide whether or not to recur again), `BEST` is our best candidate so far, and `dist` is the distance formula of your choice. ## Comparators and the Strategy Pattern Data structures like this rely heavily on _comparisons_, and there isn't one single comparison operation that works for every imaginable dataset. Another engineer who is using your data structure might want to provide their own, domain-appropriate comparisons. But if we code a particular comparison into the class, the engineer can't provide their own comparison metric. **Idea Check:** Does this make sense? Don't let me be an Architecture Astronaut! Can you think of ways a user of your kd-tree might want to influence what the $\leq$ operator is on data points? ### OO and FP Agree on the Interface If this reminds you of the _dependency injection_ idea from last week, good! Whether or not we ask the caller to provide the metric to the constructor or at the time we call another method, the idea is the same. Object-Oriented design calls this the _strategy pattern_; Functional Programming calls this a use of _first-class functions_---that is, functions or methods that are values in the language itself and thus can be passed as arguments. You may have heard that in Java, "everything is an object". This isn't strictly true (because primitives like `int` exist) but even if it were, like in (say) Scala, I see no reason why a function can't also be an object. After all, objects carry methods. ### Comparators At the core of the strategy pattern is _humility_: we don't pretend to know every possible scenario that our caller might think of, so we give them a way to customize the larger functionality we provide. Java provides another standard interface that's useful here: `Comparator`. But rather than being implemented by the class being compared, it's implemented by a standalone class whose implementation provides a new comparison method. E.g.: ```java class CartesianComparator implements Comparator<Cartesian> { @Override public int compare(Cartesian o1, Cartesian o2) { // e.g., an implementation of Euclidian distance // (throw an error if #dimensions is different) } } ``` Every instance of the CartesianComparator class contains a _strategy function_ that a kd-tree can use for comparisons. A caller could create an object of this type, and pass it to the tree as needed. And they could write more of these: `WeightedCartesianComparator`, `ManhattanCartesianComparator`, and on and on, limited only by *their* needs and *your* interface. (You'll provide them a suitable interface, won't you?) Finally, remember you're free to ask something of _them_. For instance, if I write a comparator method that always returns `-1`, I'm setting myself up for frustration and debugging in the future **through no fault of yours!** Why isn't it your fault? Can it be there's some obligation I have, when I use [the `Comparator` interface](https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html)? #### Exercise [Brief Note-Card Exercise](https://forms.gle/BLpaZdqcj5RFf7tK9) ### Strategies in REPL Could you have used the strategy pattern when designing your REPL in the Onboarding project? I don't mean the`Comparator` interface: that's just one example of the pattern! Rather, does it make sense to let someone extend the REPL with custom functionality? If so, can we do so in a way that doesn't make them modify the original REPL source at all, but rather lets them provide a function object, much like `Comparator` does? If you pick the REPL component, you'll do exactly that in Sprint 1. But what's a "strategy" in this context? <details> <summary>Think, then click!</summary> A strategy for the REPL might be a method that takes in arguments and runs a specific command. E.g., something that implements an interface like this: interface REPLCommand { void execute(String[] args); } Implemented well, the extensible REPL would allow me to define my own `REPLCommand` strategy objects and _register_ them with your REPL. Then, I could use your code without ever having to edit or---or even look at it, beyond your Javadoc! What guarantees would you offer me, as the engineer trying to use your REPL class as a library? What requirements of "good behavior" might you ask of me? </details>