---
tags: dev, r, statistics
title: Hash Function Distribution
description: Monte-Carlo simulation to verify uniform distribution of the modulo of a hash function
---
# Hash Function Distribution
After examining distributed systems [sharding](https://docs.microsoft.com/en-us/azure/architecture/patterns/sharding) [^1] [strategies](https://dzone.com/articles/four-data-sharding-strategies-for-distributed-sql) [^2], I wanted to verify if hash computation follows a [uniform distribution](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)) [^3]. :nerd_face:
TL;DR: Yes it does. :sweat_smile:
When it comes to solve such problems, [R](https://cran.r-project.org/) is my best friend. :heart_eyes:
Pseudo code:
```
Generate a set of M random values
For each value
Apply a hash function
Apply a modulo of 10
N = 1
Repeat
For each first N values
Plot a histogram
Increase N
Until N = M
```
*The full code can be found at the [end of the post](#The-Code).*
An animated GIF is generated for each [hash function](#Hash-Functions), to make it more visual. :movie_camera:
[^1]: [Sharding pattern](https://docs.microsoft.com/en-us/azure/architecture/patterns/sharding)
[^2]: [4 Data Sharding Strategies for Distributed SQL Analyzed](https://dzone.com/articles/four-data-sharding-strategies-for-distributed-sql).
[^3]: Hash function [Wikipedia article](https://en.wikipedia.org/wiki/Hash_function#Uniformity) states that "every hash value in the output range should be generated with roughly the same probability".
## Hash Functions
| [MD5](#MD5) | [SHA1](#SHA1) | [SHA256](#SHA256) |
|:-:|:-:|:-:|
| ![md5](https://i.imgur.com/HEVY2wT.gif) | ![sha1](https://i.imgur.com/1fcxqJr.gif) | ![sha256](https://i.imgur.com/zAvqcHw.gif) |
### MD5
![md5](https://i.imgur.com/HEVY2wT.gif)
### SHA1
![sha1](https://i.imgur.com/1fcxqJr.gif)
### SHA256
![sha256](https://i.imgur.com/zAvqcHw.gif)
## The Code
```r
library(tidyverse)
library(digest)
library(animation)
item_max <- 500000
shard_count <- 10
lseq <- function(from, to, length.out) {
exp(seq(log(from), log(to), length.out = length.out))
}
get.shard <- Vectorize(function(value, hash_algo) {
digest2int(digest(value, hash_algo)) %% shard_count + 1
})
plot.hash <- function(shards, hash_algo, pass_count=100) {
for (item_count in as.integer(lseq(1, shards %>% nrow(), pass_count))) {
graph <- shards %>%
head(item_count) %>%
ggplot() +
geom_bar(aes(shard), na.rm = T) +
scale_x_continuous(
n.breaks = shard_count,
limits = c(0, shard_count + 1),
labels = c("", 1:shard_count, "")
) +
xlab("Shard") +
ylab("Count") +
ggtitle(
paste0("Hash Distribution (", hash_algo, ")"),
paste0(
hash_algo,
" hash function distribution for ",
prettyNum(as.integer(item_count), big.mark = " "),
" values."
)
) +
theme(
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank()
)
print(graph)
}
}
shards <- tibble(
value = runif(item_max)
) %>%
transmute(
md5 = get.shard(value, "md5"),
sha1 = get.shard(value, "sha1"),
sha256 = get.shard(value, "sha256")
)
for (hash_algo in names(shards)) {
filename <- paste0("HashDistribution-", hash_algo, ".gif")
unlink(filename)
saveGIF(
plot.hash(shards %>% select(shard = all_of(hash_algo)), hash_algo),
movie.name = filename,
interval = .2,
ani.width = 600,
ani.height = 600,
autobrowse = F
)
}
```