Try โ€‚โ€‰HackMD

Hash Function Distribution

After examining distributed systems sharding [1] strategies [2], I wanted to verify if hash computation follows a uniform distribution [3].

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

TL;DR: Yes it does.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

When it comes to solve such problems, R is my best friend.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

Pseudo code:

Generate a set of M random values
For each value
  Apply a hash function
  Apply a modulo of 10

N = 1
Repeat
  For each first N values
    Plot a histogram

  Increase N
Until N = M

The full code can be found at the end of the post.

An animated GIF is generated for each hash function, to make it more visual.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

Hash Functions

MD5 SHA1 SHA256
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

MD5

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

SHA1

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

SHA256

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

The Code

library(tidyverse)
library(digest)
library(animation)

item_max <- 500000
shard_count <- 10

lseq <- function(from, to, length.out) {
  exp(seq(log(from), log(to), length.out = length.out))
}

get.shard <- Vectorize(function(value, hash_algo) {
  digest2int(digest(value, hash_algo)) %% shard_count + 1
})

plot.hash <- function(shards, hash_algo, pass_count=100) {
  for (item_count in as.integer(lseq(1, shards %>% nrow(), pass_count))) {
    graph <- shards %>%
      head(item_count) %>%
      ggplot() +
      geom_bar(aes(shard), na.rm = T) +
      scale_x_continuous(
        n.breaks = shard_count,
        limits = c(0, shard_count + 1),
        labels = c("", 1:shard_count, "")
      ) +
      xlab("Shard") +
      ylab("Count") +
      ggtitle(
        paste0("Hash Distribution (", hash_algo, ")"),
        paste0(
          hash_algo,
          " hash function distribution for ",
          prettyNum(as.integer(item_count), big.mark = " "),
          " values."
        )
      ) +
      theme(
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank()
      )
    
    print(graph)
  }
}

shards <- tibble(
  value = runif(item_max)
) %>%
  transmute(
    md5 = get.shard(value, "md5"),
    sha1 = get.shard(value, "sha1"),
    sha256 = get.shard(value, "sha256")
  )

for (hash_algo in names(shards)) {
  filename <- paste0("HashDistribution-", hash_algo, ".gif")
  unlink(filename)
  saveGIF(
    plot.hash(shards %>% select(shard = all_of(hash_algo)), hash_algo),
    movie.name = filename,
    interval = .2,
    ani.width = 600,
    ani.height = 600,
    autobrowse = F
  )
}

  1. Sharding pattern โ†ฉ๏ธŽ

  2. 4 Data Sharding Strategies for Distributed SQL Analyzed. โ†ฉ๏ธŽ

  3. Hash function Wikipedia article states that "every hash value in the output range should be generated with roughly the same probability". โ†ฉ๏ธŽ