# KD-Tree Discovery
## Introduction
This document mainly for solving the problem on Algo2's class on Feb 14th, solving the following question.
*Question: What will happen to the KD tree building and query when all the points are having the same $x$-coordinate? Under the condition that we don't know those points are not having the same coordinate in advance.*
This problem is not hard actually. I will explain it by the theory from our classes first. Then, let's see what the industry doing on such problems, to make my result more convincing. The industy has already used KD tree techique for over 10 years, and one of the most famous usage is the Neariest Neighbor Searching.
## Theoretical Analysis
### A Notice
In the lecture note, we can see there is a point "in" each rectangle. But that point is actually on the cutting line, so it is actually on the edge of each rectangle. And this point represents the cutting line. The following example:

And we are searching on the tree shown on the right not the left one. By thinking in this way, we can explian the "all points have the same $x$-coordinate" problem.
### Solve Duplicate Problem in BST
Since the KD tree and BST are both "trees", we can first try to find some intuition from the duplicate problem in the BST.
Two classical ways to solve the duplicated problem in BST:
- Idea from the "CLRS" (who cares the duplicates...): let $k_2$ be the key, and $k_1$ is the value to insert. If $k_1 < k_2$, put $k_1$ on the left child. Else if $k_1 \ge k_2$, put $k_1$ on the right child. But this kind of stuffs will introduce an unbalanced problem in the following figure:

- Idea from the "Goodrich and Tamassia" (similar to AVL): adding another weighted factor to those values. This tree will still be balanced:

### Transfer those ideas from BST to KD Tree
#### The "CLRS" Idea (Not recommended)
Since all the points have the same $x$-coordinate, for each node from $x$ layer, it only has one child. And I think this tree will be unbalanced.

#### The "Goodrich and Tamassia" Idea(Recommended)
If the points' $x$ coordinates are duplicated, we distinguish those points by the $y$ coordinates. Well, if the $y$ coordinates are the same, they are the same points.
Thus, we have the new "order" of those points for $x$ coordinates:
$$
A < B < C < D < E < F < G < H < I.
$$

This tree is still balanced and it is no different from the $1$ dimensional KD tree.
<!-- ### Pick a random point
Let's think about a simple strategy: *pick a random point when there are points with the same coordinate.*
From the figure above, we can see that for the "$x$" layer only represent the information for the $x$-coordinate, and the "$y$" layer only represent the information of $y$-coordinate. Consider the following case: -->
## Some Views from the Industry
KD tree has a widely usage in Neariest Neighbor Searching (including KD tree insert, delete, and query). And millions people are using this algorithms on the earth :)
Here, I referred from two mostly using Python packages in Data Science:
- Sklearn. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.
- Scipy. https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html.
Paper References:
Sklearn - https://dl.acm.org/doi/10.1145/361002.361007
Scipy - https://arxiv.org/abs/cs/9901013
The paper referred by Sklearn proved the fact that random KD tree performanced similar as the random BST. And their solution of the $x$ coordinate is the same problem will be more close to the "CLRS". And in their packages, they don't have the balanced tree condition.
The paper referred by Scipy is based on a balanced KD tree. The KD tree they are using is more close to the version in your lecture notes. And their KD tree is not exactly the same one as the lecture notes, instead of the median, they are using something called the "sliding-median" to improve the performance.