# Bug in shuffle algorithm in DAPOL+ paper
## Original algorithm
The shuffle algorithm is used in the NDM-SMT. Excerpt from section 4.3.1:
> It turns out Durstenfeldβs shuffle algorithm optimized by a HashMap
can achieve the goal. The HashMap originally is empty. We start
from the first user and repeat the following steps for π times to
generate a random mapping. For the π-th user, randomly select a
number $π β [π, 2^π» ]$. If π exists as a key in the HashMap, the π-th user
is mapped to HashMap(π). Otherwise the user is mapped to π. At
the end of each iteration, update the HashMap by HashMap(π) = π.
In this algorithm, only π RNG operations are performed, and both
computation and memory complexities are π (π log π) in the worst
case if the HashMap is optimized by some balanced search tree.
The following pseudo-code formalizes the above algorithm.
Key:
- `n` is the number of users that need to be mapped to leaf nodes
- `x_coord` is the index of the leaf node (left-most x-coord is 0, right-most x-coord is `max_x_coord`)
- `user_mapping` is the result of the algorithm, where each user is given a leaf node index i.e.
![CodeCogsEqn](https://hackmd.io/_uploads/SyiyV9OV6.svg)
- `tracking_map` is used to determine which indices have been used
```
if n > max_x_coord throw error
user_mapping = new_map()
tracking_map = new_hash_map()
for i in [0, n):
- pick random k in range [i, max_x_coord]
- if k in tracking_map then set user_mapping[i] = tracking_map[k]
- else user_mapping[i] = k
- set tracking_map[k] = i
```
The assumptions here are that all users are mapped to a leaf node, and no 2 users share the same leaf node i.e. `user_mapping` is an injective function with domain equal to the entire user set.
However, there is a bug in the algorithm, which can be seen via the following example:
1. Suppose there are 3 users, `n=3`
2. **1st loop iteration (i=0)**
- Suppose `k=1`
- Then `user_mapping[i=0] = k = 1`
- And `tracking_map[k=1] = i = 0`
2. **2nd loop iteration (i=1)**
- Suppose `k=2`
- Then `user_mapping[i=1] = k = 2`
- And `tracking_map[k=2] = i = 1`
2. **3rd loop iteration (i=2)**
- Suppose `k=2`
- Then `user_mapping[i=2] = traking_map[k=2] = 1`
- And `tracking_map[k=2] = i = 2`
But now we have 2 users (`i=0` & `i=2`) mapped to index 1. The bug can be summarized as follows: `tracking_map` can form chains which cause multiple users to be mapped to the same index.
## Adjusted algorithm
One way to fix the bug is to traverse the chains to their end:
```
if n > max_x_coord throw error
user_mapping = new_map()
tracking_map = new_hash_map()
for i in [0, n):
- pick random k in range [i, max_x_coord]
- if k in tracking_map then set v = traking_map[k]
- while traking_map[v] exists: v = tracking_map[v]
- set user_mapping[i] = v
- else user_mapping[i] = k
- set tracking_map[k] = i
```
The following modification of the algorithm fixes the bug, and does not introduce worse runtime complexity since the inner while loop will run a max of `n` times because the longest chain that can be formed in the hashmap has length `n`.
The following code offers some proof (albeit not a mathematical one) that the inner loop does not execute more than `n` times. Basically, we create every combination of random numbers that would be used, and run the algorithm against each combination. We count the number of times the inner loop executes and print a statement if it exceeds `n` (which is 5 in this case). Since we looped over every possible combination of random numbers we can be sure that there is no chance the inner loop runs more than `n` times.
```rust
fn main() {
use std::collections::HashMap;
let max_x_coord = 10u64;
// let num_entities = 5;
// 5 loops so we can get all possible combinations of random number samples
for rand_num_1 in 1..=max_x_coord {
for rand_num_2 in 2..=max_x_coord {
for rand_num_3 in 3..=max_x_coord {
for rand_num_4 in 4..=max_x_coord {
for rand_num_5 in 5..=max_x_coord {
let mut loop_counter = 0u16;
let rng = [rand_num_1, rand_num_2, rand_num_3, rand_num_4, rand_num_5];
// execute the algorithm now
let mut hash_map = HashMap::<u64, u64>::new();
for i in 0..5 {
// pick random number in range [1, max_x_coord]
let k = rng[i];
let _user_mapping = match hash_map.get(&k) {
Some(mut existing_mapping) => {
// follow the full chain of linked numbers until we find the leaf
while hash_map.contains_key(existing_mapping) {
loop_counter += 1;
existing_mapping = hash_map.get(existing_mapping).unwrap();
}
*existing_mapping
}
None => k,
};
hash_map.insert(k, i as u64);
}
if loop_counter > 5 {
println!("O(n) exceeded; rng {:?}; loop_counter {}", rng, loop_counter);
}
}
}
}
}
}
}
```
## Another way to adjust the algorithm
```
user_mapping = new_hashmap()
for i in [0, n):
- user_mapping[i] = i
for i in [0, n):
- pick random k in range [i, max_x_coord]
- if k not in user_mapping then set user_mapping[k] = k
- swap(user_mapping[i], user_mapping[k])
```
## Another algorithm entirely
The above algorithms are biased (i.e. not uniformly random) since if there is a collision (random number is found in the hashmap) then entities are mapped to the first n spaces of the bottom of the tree, but the bottom of the tree has N total spaces which could be much larger than n (so the first n spaces are preferred by the algorithm).
A cleaner randomness can be gained by doing the following:
1. Use regular Durstenfeld's shuffle algorithm to shuffle entity list in place. This has O(n) time & space.
2. Use the algorithm outlined [here](https://stackoverflow.com/a/77309566) to add N spaces between the n entities (same problem as distributing N apples into n buckets). This also has O(n) time.
We now have a more efficient O(n) algorithm.