--- tags: dev, r, statistics title: Hash Function Distribution description: Monte-Carlo simulation to verify uniform distribution of the modulo of a hash function --- # Hash Function Distribution After examining distributed systems [sharding](https://docs.microsoft.com/en-us/azure/architecture/patterns/sharding) [^1] [strategies](https://dzone.com/articles/four-data-sharding-strategies-for-distributed-sql) [^2], I wanted to verify if hash computation follows a [uniform distribution](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)) [^3]. :nerd_face: TL;DR: Yes it does. :sweat_smile: When it comes to solve such problems, [R](https://cran.r-project.org/) is my best friend. :heart_eyes: Pseudo code: ``` Generate a set of M random values For each value Apply a hash function Apply a modulo of 10 N = 1 Repeat For each first N values Plot a histogram Increase N Until N = M ``` *The full code can be found at the [end of the post](#The-Code).* An animated GIF is generated for each [hash function](#Hash-Functions), to make it more visual. :movie_camera: [^1]: [Sharding pattern](https://docs.microsoft.com/en-us/azure/architecture/patterns/sharding) [^2]: [4 Data Sharding Strategies for Distributed SQL Analyzed](https://dzone.com/articles/four-data-sharding-strategies-for-distributed-sql). [^3]: Hash function [Wikipedia article](https://en.wikipedia.org/wiki/Hash_function#Uniformity) states that "every hash value in the output range should be generated with roughly the same probability". ## Hash Functions | [MD5](#MD5) | [SHA1](#SHA1) | [SHA256](#SHA256) | |:-:|:-:|:-:| | ![md5](https://i.imgur.com/HEVY2wT.gif) | ![sha1](https://i.imgur.com/1fcxqJr.gif) | ![sha256](https://i.imgur.com/zAvqcHw.gif) | ### MD5 ![md5](https://i.imgur.com/HEVY2wT.gif) ### SHA1 ![sha1](https://i.imgur.com/1fcxqJr.gif) ### SHA256 ![sha256](https://i.imgur.com/zAvqcHw.gif) ## The Code ```r library(tidyverse) library(digest) library(animation) item_max <- 500000 shard_count <- 10 lseq <- function(from, to, length.out) { exp(seq(log(from), log(to), length.out = length.out)) } get.shard <- Vectorize(function(value, hash_algo) { digest2int(digest(value, hash_algo)) %% shard_count + 1 }) plot.hash <- function(shards, hash_algo, pass_count=100) { for (item_count in as.integer(lseq(1, shards %>% nrow(), pass_count))) { graph <- shards %>% head(item_count) %>% ggplot() + geom_bar(aes(shard), na.rm = T) + scale_x_continuous( n.breaks = shard_count, limits = c(0, shard_count + 1), labels = c("", 1:shard_count, "") ) + xlab("Shard") + ylab("Count") + ggtitle( paste0("Hash Distribution (", hash_algo, ")"), paste0( hash_algo, " hash function distribution for ", prettyNum(as.integer(item_count), big.mark = " "), " values." ) ) + theme( axis.ticks.y = element_blank(), axis.text.y = element_blank(), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank() ) print(graph) } } shards <- tibble( value = runif(item_max) ) %>% transmute( md5 = get.shard(value, "md5"), sha1 = get.shard(value, "sha1"), sha256 = get.shard(value, "sha256") ) for (hash_algo in names(shards)) { filename <- paste0("HashDistribution-", hash_algo, ".gif") unlink(filename) saveGIF( plot.hash(shards %>% select(shard = all_of(hash_algo)), hash_algo), movie.name = filename, interval = .2, ani.width = 600, ani.height = 600, autobrowse = F ) } ```