Time Collisions

In his awesome book "Designing Data-Intensive Applications"

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

, Martin Kleppmann explains and demonstrates the dangerousness of relying on timestamps to synchronize events (section "Relying on Synchronized Clocks", p.291).

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Whoever uses timestamps to generate (expected) unique identifiers

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

will be exposed to the same issues.

Even if it seems obvious, some people still refuse to admit it ^[1].

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

In order to demonstrate that timestamp-based identifiers generation inevitably leads to collisions

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

, I wrote a simple program in R beloved

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

language.

The full code can be found at the end of the post.

The exercise consisted in running 3 concurrent processes, each generating 1 000 identifiers using system time (with milliseconds precision).
A random sleep time going from 1 to 2 seconds occured between each generation in order to simulate a variable execution time.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Diagram ^[2] made with Excalidraw. Use it, share it!

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

A similar real world application could be a serverless-one, with a message queue triggered function (Amazon Lambda, Azure Functions, or any of your choice).
Multiple instances of this function will be silmutaneously executed on the numerous compute resources hosted by cloud provider infrastructure ^[3].

And the verdict is…

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Collisions do happen, as shown on these graphics:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

3 times in 20 minutes.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

3 times in 25 minutes.

I let you imagine the consequences depending on how such identifiers could be used…

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

As a conclusion, never ever use timestamps to generate identifiers. ^[4]

The Code

job.R

library(tidyverse)
library(parallel)

# Number of random sleep values
sleep_precision <- 1000
# Sleep duration in seconds
sleep_min <- 1
sleep_max <- 2
# Number of values to generate
values_count <- 1000

## Identifiers generation function ----
get_values <- function(sleep_precision, values_count, sleep_min, sleep_max) {
  sleeps <- sample(sleep_precision, values_count, replace = T) / sleep_precision * sleep_max
  sleeps[sleeps < sleep_min] <- sleep_min
  
  sapply(sleeps,
         function(s) {
           Sys.sleep(s)
           # Identifier is generated here
           as.numeric(Sys.time()) * 1000
        }
  )
}

## Run cluster ----
cores_count <- detectCores() - 1
cluster <- makeCluster(cores_count)

process_start <- Sys.time()
res <- clusterCall(cluster,
                   get_values,
                   sleep_precision,
                   values_count,
                   sleep_min,
                   sleep_max)
process_end <- Sys.time()

stopCluster(cluster)

## Collect and save data ----
values <- tibble(
  value = c(res, recursive = T)
)

attr(values, "sleep_min") <- sleep_min
attr(values, "sleep_max") <- sleep_max
attr(values, "values_count") <- values_count
attr(values, "cores_count") <- cores_count
attr(values, "duration") <-  as.double(process_end) - as.double(process_start)

save(values,
     file = paste0("data/TimeCollisions-", sleep_min, "-", sleep_max, "-", values_count, ".RData")
)

dataviz.R

library(tidyverse)

## Load data ----
load(file = "data/TimeCollisions-1-2-1000.RData")

sleep_min <- attr(values, "sleep_min")
sleep_max <- attr(values, "sleep_max")
values_count <- attr(values, "values_count")
cores_count <- attr(values, "cores_count")
duration <- attr(values, "duration")

## Build metrics ----
values_agg <- values %>%
  group_by(value) %>%
  summarise(count = n()) %>%
  arrange(value)

(collisions_max <- values_agg %>% summarise(max = max(count)) %>% pull())
# Global collision rate
(collisions_rate <- (values_agg %>% filter(count > 1) %>% summarise(sum(count)) %>% pull()) / (values %>% nrow()))
# Collisions by group
collisions_rates <- values_agg %>%
  group_by(count) %>%
  summarise(c = n()) %>%
  transmute(
    group = count,
    count = c
  ) %>%
  ungroup() %>%
  mutate(rate = count / sum(count))
# Total number of collisions
(collisions_count <- collisions_rates %>%
  filter(group > 1) %>%
  transmute(group_count = (group - 1) * count) %>%
  summarise(sum(group_count)) %>%
  pull())

## Dataviz ----
# Base colour = "seagreen"
values_colours <- c(rgb(46, 139, 87, alpha = 128, maxColorValue = 255), colorRampPalette(c("orange", "red"), alpha = T)(collisions_max - 1))

values_agg %>%
  ggplot() +
  geom_col(aes(1:(values_agg %>% nrow()),
               count,
               fill = factor(count)),
           show.legend = F,
           width = 1) +
  geom_point(aes(row, count, size = 2),
             shape = 8,
             color = "red",
             show.legend = F,
             data = values_agg %>% mutate(row = row_number()) %>% filter(count > 1)) +
  scale_fill_manual(values = values_colours) +
  scale_y_continuous(
    breaks = seq(1, collisions_max)
  ) +
  xlab("Identifiers") +
  ylab("Occurences") +
  ggtitle(
    "Time Collisions",
    paste0(
      "Collision rate of ",
      sprintf("%.02f%%", collisions_rate * 100),
      " (",
      collisions_count,
      " collisions) for ",
      prettyNum(as.integer(values_count), big.mark = " "),
      " identifiers (cores = ",
      cores_count,
      ", sleep time = ",
      sleep_min,
      "-",
      sleep_max,
      "s, total duration = ",
      sprintf("%.02f", duration),
      "s)."
    )
  ) +
  theme(
    axis.ticks.x = element_blank(),
    axis.text.x = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank()
  )

Believe it or not, I witnessed this anti-pattern in some professional solutions.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
↩︎
https://excalidraw.com/#json=5714824436645888,9uZWHFuCuM8B26PqtJPW9A ↩︎
On Azure Functions, the consumption plan has an upper-limit of 200 instances.

When you're using the Consumption plan, instances of the Azure Functions host are dynamically added and removed based on the number of incoming events. This serverless plan scales automatically.

↩︎
Prefer solutions such as Universally Unique IDentifiers. ↩︎

Time Collisions

The Code

job.R

dataviz.R

Read more

Hash Function Distribution

PHP array_map vs foreach

PHP, cURL and Memory Management