Bug in shuffle algorithm in DAPOL+ paper

# Bug in shuffle algorithm in DAPOL+ paper ## Original algorithm The shuffle algorithm is used in the NDM-SMT. Excerpt from section 4.3.1: > It turns out Durstenfeld’s shuffle algorithm optimized by a HashMap can achieve the goal. The HashMap originally is empty. We start from the first user and repeat the following steps for 𝑛 times to generate a random mapping. For the 𝑖-th user, randomly select a number $𝑘 ∈ [𝑖, 2^𝐻 ]$. If 𝑘 exists as a key in the HashMap, the 𝑖-th user is mapped to HashMap(𝑘). Otherwise the user is mapped to 𝑘. At the end of each iteration, update the HashMap by HashMap(𝑘) = 𝑖. In this algorithm, only 𝑛 RNG operations are performed, and both computation and memory complexities are 𝑂 (𝑛 log 𝑛) in the worst case if the HashMap is optimized by some balanced search tree. The following pseudo-code formalizes the above algorithm. Key: - `n` is the number of users that need to be mapped to leaf nodes - `x_coord` is the index of the leaf node (left-most x-coord is 0, right-most x-coord is `max_x_coord`) - `user_mapping` is the result of the algorithm, where each user is given a leaf node index i.e. ![CodeCogsEqn](https://hackmd.io/_uploads/SyiyV9OV6.svg) - `tracking_map` is used to determine which indices have been used ``` if n > max_x_coord throw error user_mapping = new_map() tracking_map = new_hash_map() for i in [0, n): - pick random k in range [i, max_x_coord] - if k in tracking_map then set user_mapping[i] = tracking_map[k] - else user_mapping[i] = k - set tracking_map[k] = i ``` The assumptions here are that all users are mapped to a leaf node, and no 2 users share the same leaf node i.e. `user_mapping` is an injective function with domain equal to the entire user set. However, there is a bug in the algorithm, which can be seen via the following example: 1. Suppose there are 3 users, `n=3` 2. **1st loop iteration (i=0)** - Suppose `k=1` - Then `user_mapping[i=0] = k = 1` - And `tracking_map[k=1] = i = 0` 2. **2nd loop iteration (i=1)** - Suppose `k=2` - Then `user_mapping[i=1] = k = 2` - And `tracking_map[k=2] = i = 1` 2. **3rd loop iteration (i=2)** - Suppose `k=2` - Then `user_mapping[i=2] = traking_map[k=2] = 1` - And `tracking_map[k=2] = i = 2` But now we have 2 users (`i=0` & `i=2`) mapped to index 1. The bug can be summarized as follows: `tracking_map` can form chains which cause multiple users to be mapped to the same index. ## Adjusted algorithm One way to fix the bug is to traverse the chains to their end: ``` if n > max_x_coord throw error user_mapping = new_map() tracking_map = new_hash_map() for i in [0, n): - pick random k in range [i, max_x_coord] - if k in tracking_map then set v = traking_map[k] - while traking_map[v] exists: v = tracking_map[v] - set user_mapping[i] = v - else user_mapping[i] = k - set tracking_map[k] = i ``` The following modification of the algorithm fixes the bug, and does not introduce worse runtime complexity since the inner while loop will run a max of `n` times because the longest chain that can be formed in the hashmap has length `n`. The following code offers some proof (albeit not a mathematical one) that the inner loop does not execute more than `n` times. Basically, we create every combination of random numbers that would be used, and run the algorithm against each combination. We count the number of times the inner loop executes and print a statement if it exceeds `n` (which is 5 in this case). Since we looped over every possible combination of random numbers we can be sure that there is no chance the inner loop runs more than `n` times. ```rust fn main() { use std::collections::HashMap; let max_x_coord = 10u64; // let num_entities = 5; // 5 loops so we can get all possible combinations of random number samples for rand_num_1 in 1..=max_x_coord { for rand_num_2 in 2..=max_x_coord { for rand_num_3 in 3..=max_x_coord { for rand_num_4 in 4..=max_x_coord { for rand_num_5 in 5..=max_x_coord { let mut loop_counter = 0u16; let rng = [rand_num_1, rand_num_2, rand_num_3, rand_num_4, rand_num_5]; // execute the algorithm now let mut hash_map = HashMap::<u64, u64>::new(); for i in 0..5 { // pick random number in range [1, max_x_coord] let k = rng[i]; let _user_mapping = match hash_map.get(&k) { Some(mut existing_mapping) => { // follow the full chain of linked numbers until we find the leaf while hash_map.contains_key(existing_mapping) { loop_counter += 1; existing_mapping = hash_map.get(existing_mapping).unwrap(); } *existing_mapping } None => k, }; hash_map.insert(k, i as u64); } if loop_counter > 5 { println!("O(n) exceeded; rng {:?}; loop_counter {}", rng, loop_counter); } } } } } } } ``` ## Another way to adjust the algorithm ``` user_mapping = new_hashmap() for i in [0, n): - user_mapping[i] = i for i in [0, n): - pick random k in range [i, max_x_coord] - if k not in user_mapping then set user_mapping[k] = k - swap(user_mapping[i], user_mapping[k]) ``` ## Another algorithm entirely The above algorithms are biased (i.e. not uniformly random) since if there is a collision (random number is found in the hashmap) then entities are mapped to the first n spaces of the bottom of the tree, but the bottom of the tree has N total spaces which could be much larger than n (so the first n spaces are preferred by the algorithm). A cleaner randomness can be gained by doing the following: 1. Use regular Durstenfeld's shuffle algorithm to shuffle entity list in place. This has O(n) time & space. 2. Use the algorithm outlined [here](https://stackoverflow.com/a/77309566) to add N spaces between the n entities (same problem as distributing N apples into n buckets). This also has O(n) time. We now have a more efficient O(n) algorithm.