contributed by < kdnvt
>
2
This problem introduces a list to record the hazard pointers, and a list to record the retired list.
Notice that every time a thread accesses the shared object (in this case, the object share_config
points to), it inserts the address of the object into the hazard pointer list.
Thus, there may be many of the same addresses in the hazard pointer list. When the reader finishes the operation, it will remove one of the addresses, the object is safe to be free if there is no address of such object in the hazard pointer list.
The data structure is as shown below:
When the last reader finishes reading A, it will remove it from the hazard pointer list. The implementation doesn't "remove the node" from the list, just uses an atomic CAS empty the address to NULL. This implementation saves the overhead to allocate and release the node every time (or we may need another garbage collection mechanism to do it).
Moreover, it avoids the trouble that the linked list may generate errors when inserting and removing nodes simultaneously by different threads.
The below diagram shows the data structure after A is remove from hazard pointer list:
ref: graphviz
I use address sanitizer to find that there are memory leak in list_free
. It only free the "node", but it should also free the node->ptr
.
It's worth mentioning that the code atomic_exchange
, which is used in swap
, uses __ATOMIC_SEQ_CST
as memory order.
In the function writer_thread, we expect that the random value will be written in the new_config
, and the shared_config
will points to it, so the reader will read the new shared object. However, on some weak memory model machines like DEC Alpha, it is possible that the Store/Store will be out-of-order without memory barrier, that is, it is possible that the shared_config
gets the new object address before the random value is written into the object. If the reader read the value between this time interval, it may get the original value of new_config
, which depends on the implementation of create_config()
(in this case, zero).
Therefore, it is more portable to put a memory barrier, like __ATOMIC_SEQ_CST
in this case.
The similar memory barrier is also used in load
.
The atomic_load
and atomic_cas
also use the __ATOMIC_SEQ_CST
. If the
second atomic_load
be reordered before the atomic_cas
, it may be a bug if writer changes the shared pointer between this time interval.
I consider the possibility to use __ATOMIC_RELEASE
to replace __ATOMIC_SEQ_CST
in atomic_exchange case, because I only see a possible error from Store/Store reorder so far.__ATOMIC_RELEASE
is a less expensive barrier than __ATOMIC_SEQ_CST
.
After some analysis, I think the __ATOMIC_SEQ_CST
is necessary.
Consider the code below, the return val
and free(val)
shouldn't happen simultaneously.
The program works fine, the ptr == val
in reader will fail, and read the new value from ptr next loop.
However, if Store/load could be out-of-order in writer thread, the result may be as below:
The code ptr == val
will be true this time, so the return val and free(val) will both be executed, which results in error.
Now consider the Store/Load out-of-order on reader thread:
The Store/Load out-of-order in reader also results in error.
In conclusion, the both of two need to be __ATOMIC_SEQ_CST
to prevent Store/Load reorder.
Notice that the release operation and release fence are two different thing. I had been confused for a while until seeing the article Acquire and Release Fences Don't Work the Way You'd Expect.
I did an experiment to simulate similar situation : experiment
When using multiple readers and one writer, the program works well(Albeit the problem discription says that the thread sanitizer will generate error message, I don't get it when execute the program).
However, when using multiple readers and multiple writers, I get some error messages from conflict of free
and calloc
. In addition to these, I think there may be error if multiple writer try to insert and remove nodes to the retired list simultaneously.
The paper describe the implementation of hazard pointer. I compared the paper with the corresponding implementation hp_list
, one difference important is the threshold when the retire_list(list_hp_retire
in hp_list
) function should scan and release the retired pointers. In hp_list
, the following codes set the threshold, while the HP_THRESHOLD_R
is 0, which means that the list_hp_retire scan and release every time.
Since hp_list
traverse the whole reader thread's hazard pointers to check instead of using a sorted dictionary structure or a hash table, it needs O(MR) complexity to check retired pointers,where M stands for numbers of reader, N stands for numbers of writer, R stands for length of retired list. To improve the performance, I tried to add a variable hp_count to count the total number of hazard pointers, and defer the timing of release when the hp_count is less than the 4/5 of the length of the retired list, so the amortized time complexity achieve the O(M).
I had a misunderstanding on amortized cost, which I thought is just ditributing the cost to all operations, so the cost of list_hp_retire
is O(MR/R) originally. Nevertheless, When considering this problem carefully using potential method, I found that this is wrong.
Let where is length of retired list. Let constant .
When using hash table mentioned in the paper, the is O(R) because each time checking hazard pointers only needs O(1).
Since the number of hazard pointers is less than , the retired list is sure to release at least pointers. So the potential is at least .
However, hp_list
doesn't use the hash table, so the cost of retired will be O(MR).
To think in another way, imagine in the hash table case the R is distributed to each insertion, so each operation will be O(1). However, in this problem, when distributing the MR to each insertion, the cost of insertions increases from O(1) to O(M), which is not what we want.
In conclusion, to take advantage of the amortized analysis, we need a hash table to achieve O(1) time complexity when checking hazard pointers.
reference: potential method
This implementation was done before I found my amortized analysis is wrong. Even though the complexity can't achieve constant time of amortized cost, I think it is still a good practice to reduce the number of operations by defering it.
First, I introduced global variable hp_count
.
The hp_count is changed with CAS when new hazard pointer is read by reader or complete reading. Notice that I added if (!ptr)
condition checking in list_hp_protect_ptr
and list_hp_protect_release
, so they wouldn't protect the NULL pointer.
Add the new dynamic threshold to list_hp_retire
, and use shift operation to compute the hp_count
.
I also used macro TRACE to add runtime analysis on traversal of retired lists and hazard pointers.
I used clock_gettime
in main
to simply compute the execution time, but I thought that there is still too many interference, so it can't be a reference.
Without threshold:
With threshold:
To my surprise, the number of retired_traversal is the same.
By printing the hp_count
on each call of list_hp_retire, the hp_count
is zero for all time. It seems that the hazard pointers is consumed by reader very fast, so when the list_hp_retire is called, it is safe to release the pointers.
This phenomenon make me think of another improvement, instead of traversing the whole list, maybe we can first checking the hp_count
, and release it directly if the hp_count
is zero.
hp_list
also optimizes the cache. The following code makes each rl
in different cacheline. Although the x86 uses 64 bytes cacheline, I guess the 128 in this program is for portability in case some machines using 128 bytes cacheline. However, I am confused about the alignas(128) in the structure, since I think the alignas(16)(which is the power of two bigger than sizeof(retiredlist_t)) is sufficient to meet our requirement. I think the alignas(128) is to make hp and rl be distributed to different cacheline with alignas instead of inserting padding variable which I am familiar with.
After reexamining the code, I'm of the opinion that the CLPAD
in this program is superfluous. Since the value of rl[i]
is the address of retirelist_t
, the rl[i]
itself won't be modified after the initialization. Therefore, the scenario that writers thread invalidate each others frequenlty won't happen.
If we want to prevent allocator from allocating multiple retirelist_t
on adjacent addresses, which causes false sharing due to sharing the cacheline, it is better to make retirelist_t
as bigger as cacheline and align with sizeof(size)
+ sizeof(list)
. The latter isn't related to false sharing, but obviet one writer from caching two cahelines for one retirelist_t
.