--- tags: dev, r, statistics, distributed title: Time Collisions description: A demonstration that we cannot rely on uniqueness of time generated identifiers --- # Time Collisions In his awesome book "[Designing Data-Intensive Applications](https://dataintensive.net/)" :heart:, [Martin Kleppmann](https://martin.kleppmann.com/) explains and demonstrates the dangerousness of relying on timestamps to synchronize events (section "Relying on Synchronized Clocks", p.291). :clock10: Whoever uses timestamps to generate (expected) unique identifiers :id: will be exposed to the same issues. Even if it seems obvious, some people still refuse to admit it [^1]. :confused: In order to demonstrate that timestamp-based identifiers generation inevitably leads to collisions :collision:, I wrote a simple program in [R](https://cran.r-project.org/) beloved :heart_eyes: language. *The full code can be found at the [end of the post](#The-Code).* The exercise consisted in running **3 concurrent processes**, each generating **1 000 identifiers** using system time (with milliseconds precision). A random sleep time going from 1 to 2 seconds occured between each generation in order to simulate a **variable execution time**. ![](https://i.imgur.com/ofwqrbA.png) *Diagram [^3] made with [Excalidraw](https://excalidraw.com/). Use it, share it!* :smiley: *A similar real world application could be a serverless-one, with a message queue triggered function ([Amazon Lambda](https://aws.amazon.com/lambda/features/), [Azure Functions](https://azure.microsoft.com/en-us/services/functions/), or any of your choice).* *Multiple instances of this function will be silmutaneously executed on the numerous compute resources hosted by cloud provider infrastructure [^4].* And the verdict is... :tada: Collisions **do** happen, as shown on these graphics: ![](https://i.imgur.com/ZHlVcbI.png) ***3 times** in 20 minutes.* ![](https://i.imgur.com/8VN05bW.png) ***3 times** in 25 minutes.* I let you imagine the consequences depending on how such identifiers could be used... :flushed: As a conclusion, **never ever use timestamps** to generate identifiers. [^2] [^1]: Believe it or not, I witnessed this anti-pattern in some professional solutions. :tired_face: [^2]: Prefer solutions such as [**U**niversally **U**nique **ID**entifiers](https://www.uuidgenerator.net/). [^3]: <https://excalidraw.com/#json=5714824436645888,9uZWHFuCuM8B26PqtJPW9A> [^4]: On Azure Functions, the [consumption plan](https://docs.microsoft.com/en-us/azure/azure-functions/functions-scale#consumption-plan) has an [upper-limit](https://docs.microsoft.com/en-us/azure/azure-functions/functions-scale#service-limits) of 200 instances. > When you're using the Consumption plan, instances of the Azure Functions host are dynamically added and removed based on the number of incoming events. This serverless plan scales automatically. ## The Code ### job.R ```r library(tidyverse) library(parallel) # Number of random sleep values sleep_precision <- 1000 # Sleep duration in seconds sleep_min <- 1 sleep_max <- 2 # Number of values to generate values_count <- 1000 ## Identifiers generation function ---- get_values <- function(sleep_precision, values_count, sleep_min, sleep_max) { sleeps <- sample(sleep_precision, values_count, replace = T) / sleep_precision * sleep_max sleeps[sleeps < sleep_min] <- sleep_min sapply(sleeps, function(s) { Sys.sleep(s) # Identifier is generated here as.numeric(Sys.time()) * 1000 } ) } ## Run cluster ---- cores_count <- detectCores() - 1 cluster <- makeCluster(cores_count) process_start <- Sys.time() res <- clusterCall(cluster, get_values, sleep_precision, values_count, sleep_min, sleep_max) process_end <- Sys.time() stopCluster(cluster) ## Collect and save data ---- values <- tibble( value = c(res, recursive = T) ) attr(values, "sleep_min") <- sleep_min attr(values, "sleep_max") <- sleep_max attr(values, "values_count") <- values_count attr(values, "cores_count") <- cores_count attr(values, "duration") <- as.double(process_end) - as.double(process_start) save(values, file = paste0("data/TimeCollisions-", sleep_min, "-", sleep_max, "-", values_count, ".RData") ) ``` ### dataviz.R ```r library(tidyverse) ## Load data ---- load(file = "data/TimeCollisions-1-2-1000.RData") sleep_min <- attr(values, "sleep_min") sleep_max <- attr(values, "sleep_max") values_count <- attr(values, "values_count") cores_count <- attr(values, "cores_count") duration <- attr(values, "duration") ## Build metrics ---- values_agg <- values %>% group_by(value) %>% summarise(count = n()) %>% arrange(value) (collisions_max <- values_agg %>% summarise(max = max(count)) %>% pull()) # Global collision rate (collisions_rate <- (values_agg %>% filter(count > 1) %>% summarise(sum(count)) %>% pull()) / (values %>% nrow())) # Collisions by group collisions_rates <- values_agg %>% group_by(count) %>% summarise(c = n()) %>% transmute( group = count, count = c ) %>% ungroup() %>% mutate(rate = count / sum(count)) # Total number of collisions (collisions_count <- collisions_rates %>% filter(group > 1) %>% transmute(group_count = (group - 1) * count) %>% summarise(sum(group_count)) %>% pull()) ## Dataviz ---- # Base colour = "seagreen" values_colours <- c(rgb(46, 139, 87, alpha = 128, maxColorValue = 255), colorRampPalette(c("orange", "red"), alpha = T)(collisions_max - 1)) values_agg %>% ggplot() + geom_col(aes(1:(values_agg %>% nrow()), count, fill = factor(count)), show.legend = F, width = 1) + geom_point(aes(row, count, size = 2), shape = 8, color = "red", show.legend = F, data = values_agg %>% mutate(row = row_number()) %>% filter(count > 1)) + scale_fill_manual(values = values_colours) + scale_y_continuous( breaks = seq(1, collisions_max) ) + xlab("Identifiers") + ylab("Occurences") + ggtitle( "Time Collisions", paste0( "Collision rate of ", sprintf("%.02f%%", collisions_rate * 100), " (", collisions_count, " collisions) for ", prettyNum(as.integer(values_count), big.mark = " "), " identifiers (cores = ", cores_count, ", sleep time = ", sleep_min, "-", sleep_max, "s, total duration = ", sprintf("%.02f", duration), "s)." ) ) + theme( axis.ticks.x = element_blank(), axis.text.x = element_blank(), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank() ) ```