contributed by < MicahDoo
>
First, we zoom in on the function where we need to fill in the blanks:
hash()
using hash_key.n
points to the list node contained in the hash_key node. (&
allows us to use the ->
operator)first
is the first node in the "bucket list" (no pun intended NULL
if bucket is empty.first->prev
point to the pointer to the new node n
's next
variable: h->first
points to the new first node: From the graphs below we can easily see the missing links (in turquoise):
AAA: n->next = first
BBB: n->pprev = &h->first
Fixed Code:
n->next
is accessed in between, but only the "address" is accessed, not the value.
TODO
TODO
GOLDEN_RATIO_PRIME
used for in tools/include/linux/hash.h?TODO: 寫個中文的版本。
In short, the golden ratio ensures an optimally efficient distribution of the hash values. That means it ensures that:
To illustrate that, I will stray off a bit and draw from real life examples.
Let's start with a simple Seed Distribution problem:
Suppose I have a fairly big number of seeds I want to arrange. I start from the center and put the next seed to the right side of it:
Let's say I am a stupid robot. I can only remember one rule regarding how to decide the next location to place a seed:
Make amount of a turn around the center from the current location. If a seed already occupies that location, push it out radially (along the radius).
Note: n should be constant throughout the process, or it wouldn't just be one rule.
Let's put n into this program and see what happens:
I make one turn before I put my next seed. That is, I turn 360° before I place the next seed.
(Trivially, , i.e. making no turn, will result in the same configuration.)
We can see that the seeds are arranged towards the same direction on a line.
That is not an efficient use of space.
As with , the resulting arrangement has 3 spokes.
This results in 10 spokes (輻條).
We can see that it looks similar to the one with but with the spokes curved a little. That's because with every turn it nudges the turn a bit further than the one.
The spokes curve even more.
From the above experiments we can come up with a few observations:
When equals or is approximately a rational number , the spoke pattern is clear, and the fair amount of space between the spokes are wasted.
Let's turn our attention now to irrational numbers
The reason the number of spokes changes has to do with the size of the seed. Towards the center it is not possible to stack 22 spokes in a circle without pushing a seed out.
In fact, if we continue putting more seeds and let the 22 spokes keep growing, the spokes' curvature increases and another spoke pattern occur (This time, the number of spokes would be 355.)
The spoke counts, outwards from the center:
1
()
7
()
106
()
113
()
The reason it takes so much time to go from 7 spokes to 106 but so little from 106 to 113:
From 7, it takes a lot of time for the pattern to go far away enough from the center to have a circumference big enough to put all 106 spokes, but it doesn't take that much from 106 to 113 because the two numbers are really close. That's why we can see how the 106 spoke pattern meshes and merges into the 113 pattern quite nicely and there seems to be less space wasted around that area.
Even with or , which look pretty irrational, there is still quite some loss of space because it is easily approximated by a few rational numbers. That results in spoke patterns that grow outwards, which wastes space, especially if it remains for a while.
To know how easily an irrational number can be approximated by rational numbers, we first have to know how to approximate an irrational number by rational numbers.
To approximate an irrational number by rational numbers, we call on the method of Best Approximations by Continued Fractions:
→ 7 spokes
→ 106 spokes
→ 113 spokes
→ 33102 spokes
Recall previous observations:
Going along the observations, we can predict a colossal waste of space as the pattern with grows from 113 spokes to 33102 spokes - the difference between the two numbers is simply too big!
Also, if we inspect their corresponding continued fractions closely, especially the red parts:
→ Big jump from 7 to 106.
→ Small jump from 106 to 113.
→ HUGE jump from 113 to 33102.
We can observe that the closer the proper fraction we add to the denominator (red part) is to , the smaller the increase is from the previous spoke count to the next.
Such observation tells us that a number that looks like this might be ideal:
We can see that all the new fractions added to the denominator in each iteration are .
Solve the equation for :
is the golden ratio.
Now, let's plug the given golden ratio into the spiral generator:
The golden ratio results in the prefect arrangement: efficient space use, even distributioin.
It's even hard to figure out how many spokes there are at a certain distance from the center because they seem to mix quite tightly and transitions so smoothly.
Let's examine how the number of spokes changes arithmetically:
→ 1 spoke
→ 2 spokes
→ 3 spokes
→ 5 spokes
→ 8 spokes
The number of spokes grows slowly.
In fact, it matches the Fibonacci Sequence.
That is why hashing using the golden ratio is called Fibonacci Hashing.
Seeds in the sunflower are arranged so as to take maximum advantage of the limited space.
The pattern looks exactly the same as the pattern we derived with the golden ratio above.
A plant often grows in a way that spreads out its leaves to catch the most sunlight.
The golden ratio helps achieve that.
Plants tend to be fairly consistent when it comes to the angle between one leaf and the next. That angle, not surprisingly, is the same angle with which we arrange the seeds in the previous section.
In fact, the golden ratio is often not only useful, but also inevitible in general. Watch this following video series by Vihart to find out:
Doodling in Math: Spirals, Fibonacci, and Being a Plant [1 of 3]
Doodling in Math: Spirals, Fibonacci, and Being a Plant [2 of 3]
Doodling in Math: Spirals, Fibonacci, and Being a Plant [3 of 3]
With that, we come back to why and how GOLDEN_RATIO_PRIME
is used for hashing.
As shown in the above exploration, the golden ratio is perfect at taking input data and scattering them into efficiently distributed numbers:
In computers, drawing a comparison between rotation and modular arithmetic, hence likening the key to the number of turns, the angle to the hash values, this would roughly be:
In hashing, though, we often deal with discrete integer values based on the size of the hash table, so the hash value must repeat itself at some point (the pigeonhole principle).
So here we make use of another property of the output:
The higher digits/bits (more significant digits/bits) are subject to carries from the lower digits/bits. In other words, changes in as far as the lowest digits/bits can propagate themselves to as far as the highest digits/bits. This sort of chain reaction can go infinitely far in some cases.
Imagine a series of multiplications:
With that in mind we change our function to something like this:
That is what we call the Fibonacci Hashing.
In fact, Linux Kernel implements a version of Fibonacci Hashing in /include/hash.h:
GOLDEN_RATIO_32
: 0x61C88647 = 2654435769 (unsigned)
GOLDEN_RATIO_64
: 0x61C8864680B583EBull = 7.0460293E18 (unsigned)
GOLDEN_RATIO_PRIME
can be either of them depending on the word length.Both of them are the biggest unsigned number possible in their corresponding number of bits divided by the golden ratio.
Since or , we can see multiplying by GOLDEN_RATIO_xx
as multiplying by .
Notice the comment /* High bits are more random, so use them. */
, which goes in line with our observation above.
Following the code above, inlining functions when possible, a hashing in 32 bits in Linux is basically:
Transform that into MathJax and let :
…which is not entirely accurate because with integer type values, there are overflow and truncation:
Compare that to the hash formula we had above:
It might not be immediately clear that they are doing the same thing, so here is a short walkthrough:
Let's assume an integer word length of 8 bits (instead of 32 or 64) for simplicity.
in binary, to the 8th bits after the binary point.
Let's apply the following formula with the key and the hash size
Now we plug in the same values to the linux version of the formula:
…which is the same result!
The major difference between the normal version and the Linux version:
The Linux version has a few advantages:
int
only, which is decidedly faster than float
or double
multiplications.Since very rarely does anyone need a hash function of more than in size, Linux does not have a hash function that maps a 64-bit hash key to a 64-bit hash value. (Source)
Nevertheless, Linux does have a mechanism that maps a 64-bit hash key to a 32-bit hash value. Which is the function below:
When BITS_PER_LONG == 64
, the function does the hashing normally in 64-bit operations but the result is returned in 32-bit integer, which means the left 32-bit half is discarded.
When BITS_PER_LONG == 32
, the function takes advantage of the fact that XOR
causes no information loss:
val
by GOLDEN_RATIO_64
, then performing XOR
on the resulting value and the right half of val
, before feeding it into the hash function.QUESTIONS TO RESOLVE:
u64
value when the retrun type is u32
raise an error? Or does the compiler do the conversion automatically?u64
value when BITS_PER_LONG == 32
? What exactly does BITS_PER_LONG == 32
mean?The golden ratio is a nice way to randomly spread out, or efficiently distribute something with one simple rule (e.g. an angle, a multiplicant).
It allows us to go from simple to complex and random with only one carefully picked out rule.
First we establish a run in a sequence of values to be a subsequence as long as possible consisting entirely of the same value. (Sometimes a run is defined to be of length at least two, but for the purpose of this explanation, we scrap that requirement.)
首先我們定義「連串(run, 中國翻遊程)」為一個序列之中,數值全部相同並且盡可能越長越好之子序列。(有時候一個「連串」的定義必須要是長度至少二的子序列,但這裡為方便解釋,我們忽略該條件限制。)
Consider the following sequence:
111116777344
The runs in this sequence would be 11111, 6, 777, 3 and 44.
Run as a simple word in English has a myriad of meanings. In fact, according to the NPR, run has in total 645 meanings in The Oxford Dictionary. (See here and here.)
In the field of Computer Science, it is no exception.
In the context of sequence, for example, run can also mean a non-decreasing subsequence of values as long as possible. When we consider the sequence 123234563345678234, the runs by this definition would be 123, 23456, 3345678 and 234.
A run can also mean an instance of the execution of a program or other task by a computer.
Now let's examine the function:
The code here does one simple thing: take the head of a run and inspect it:
該函式做的事情一言以蔽之為:將一個連串的頭作為輸入並對其做檢驗:
- 倘若該連串長度大於等於二,回傳其後第一個非重複的元素。
- 倘若該連串長度為一(表示該元素不重複),則使其指向下一個不重複的元素,並回傳自己。
Consider the following intial configuration:
The program finds that 4 isn't a duplicate, so the following code is executed:
This will make 4 point to the next non-duplicate.
But before that, the function is called on head->next
to find the next non-duplicate.
In the following graph, there is a second head
, which is the argument supplied to the second function call, in deleteDuplicates(head->next)
:
The program then finds that 6 is a duplicate. So head
points to the next element.
In the while loop, the program finds that there are more 6 duplicates:
Yet one more:
The program finds that the next element isn't 6 anymore so it calls the function on the next element head->next
to find the next non-duplicate.
9 has a duplicate, head points to the next element:
9 has no more duplicates. The function is called again:
13 has no duplicates. Another function call:
Since head
now points to NULL
, the current function returns NULL
.
13 now points to the return value NULL, which happens to be the same as befor:
On the next line the head
is returned.
The function that handles 9 returns the same element.
The fuction that handles 6 also returns the element 13:
The function for 4 takes the return value and points to it:
This results in the removing of all nodes between head
(4) and the return node(13):
From the above demonstration, we know that to know if a head contains a duplicate value, we have to check:
NULL
. (e.g. The node with value 13 won't pass this.)So COND1
should be:
Note that !head->next
should be the first condition to check because if head->next == NULL
, the first condition fails and the next condition will not be checked, which avoids segmentation fault by accessing head->next->value
.
Checking if there are more duplicates in a run requires basically the same conditions:
NULL
.So COND2
, as with COND1
, is also:
Following the program flow for the above code, we can see that the recursive functions are called from left to right and they return from right to left.
To make an iterative version, the flow has to go from left to right and end there.
My mind first went into how many pointers to struct List Node
I would have to maintain to achieve that, this was my conclusion:
head
that keeps track of the new head.
tmp
that traverses the list from where head
is and removes nodes according to the requirements.
tmp->next
until the next run is reached.In the first try, tmp
works by pointing to the last non-duplicate element and keeps removing the "next duplicate element". Since the head has no element before it, there is no way to delete the head the same way tmp
does, which is why we handled them in different loops.
However, if I add a sentinel node dummy
in front of the head and use a value pred
(predecessor) the same way we use tmp
in the first try, we can do it in one pass. Note that the return value would be dummy->next
:
Keeping in mind what I have learned from linked list 和非連續記憶體, I thought about using a pointer to pointer to allow deleting elements without accessing the previous element, which obviates the need to create a sentinel/dummy node.
First assume the following container structure:
… where list_head
comes from the list.h
library provided in the course.
Most of the code is the same, except:
->next
and ->next->prev
needs to be updated.The original LeetCode problem seems to have confused remove and delete. Since this implementation doesn't deal with freeing memory, I changed the function name to removeDuplicates
.
prev
LRU stores the head to every node in the cache.
capacity
: The size of this cache.count
: Current number of LRUNodes
in the cache.dhead
: Embedded list node that allows you to traverse all the nodes in the cache. The order also relates to how long it has been since the node's been accessed. The further to the right, the longer it has been.hheads[]
: Hash table that allows for quick access.Create a cache of size capacity
.
Free the whole cache including all its nodes.
Since it needs to traverse &obj->dhead
, MMM1 = list_for_each_entry_safe
.
Why "safe"?
Find the cache in the hash table.
Since we need to traverse the hash bucket, MMM2 = list_for_each_entry
Add a new cache node.
Since MMM3
is where we search for the key in the bucket, MMM3 = list_for_each_entry
MMM4
is where we are evicting the last node in dhead
, so MMM4 = list_last_entry
TODO:
I haven't figured out a way to write a test file to generate test data for this problem.
I did find this paper that can potentially point me in the right direction.
A node in the list, containing a number and a list_head
node. Implements Instrusive Linked List.
int longestConsecutive(int *nums, int n_size)
Traverse the list and put all numbers into the hash table. If a number appears more than twice, only one node is added of it.
Since any number appears only once in this hash table, we could refer to it as a hash set.
Traverse the list and go from each new, unvisited number upwards and downwards to get the count of consecutive numbers.
From the logic delineated in the comments above:
--left
.
++right
.
I wrote a python program to test the code.
Test code here on Google Colab (comments are welcome!)
Here's a snippet of the code where I generate test data:
answer
is a random number that will be the size of the longest consecutive sequence.start_num
is the starting number of that sequence.answer = 5
and start_num = 3
, 3, 4, 5, 6, and 7 will be in the array and 2 and 8 can't be in the list.array_size
is the size of the array. It should be greater than the answer.start_num
of size answer
, I also need to make sure other sequences don't have sizes longer than answer
(which will change the answer).list_of_sequences
to keep track of all the consecutive sequences.answer
or that the gaps are all of size 1 and filling any of them would result in a combined size bigger than answer
, I will just make sure the rest of the numbers are in between start_num
and start_num + answer
, so that the answer doesn't change.In line 11, after calling find()
on num
, there is a while loop that executes if the return value node
isn't NULL
(i.e. num
is found). However, I didn't think the loop would be executed more than once:
node
is last modified in the second inner while loop:
node == NULL
, and the condition for the outer while loop is node
, which means the next iteration of the outer while loop can never happen. Which means the second while loop is effectively the same as a simple if
statement.while
to if
, the tests ran fine, no errors occurred from it.The complexity of the original code is :
list_delete(&node->link)
can only be called on each element once.Since there is no escaping having to look at every node at least once, is the best time complexity we can get, but we can still do some fine-tuning to make the execution slightly faster:
n_size
iterations. When the number of nodes not visited in the third loop is smaller than length
(the greatest length so far), it's impossible to do better with the little number of nodes left. At that point, we can break away from the loop early:
struct list_head *heads = malloc(n_size * sizeof(*heads))
. That memory needs to be freed too.Here I recycle some of the code from problem one and try applying my learning from that problem.
Most of the functions I need can be found in the first problem, except a node deletion function.
So I first look at how Linux implements a hash node delettion:
WRITE_ONCE
is defined in linux/compiler.h
:
So WRITE_ONCE(*pprev, next)
is effectively *prev = next
.
See this post to see why it is used.
This is my version of it:
I also modified the add function:
longestConsecutiveSequence()
The above code is also put in this python notebook for testing.
Edotor: Online Graphviz Editor
Graphviz Brief Intro
Hashing Washington University in St. Louis
Hashing Cornell
B.R. Preiss
EXploring Knuth's Multiplicative Hash
Render math formula & graphs
Spirals
Fibonacci Numbers and Nature
The Golden Ratio (why it is so irrational) - Numberphile
Fibonacci Hashing: The Optimization that the World Forgot (or: a Better Alternative to Integer Modulo)
Syntax CodiMD
Generice Hashing Functions and The Platform Problem