2022q1 Homework5 (quiz5)

contributed by < kdnvt >

Problem `2`

This problem introduces a list to record the hazard pointers, and a list to record the retired list.

Notice that every time a thread accesses the shared object (in this case, the object share_config points to), it inserts the address of the object into the hazard pointer list.
Thus, there may be many of the same addresses in the hazard pointer list. When the reader finishes the operation, it will remove one of the addresses, the object is safe to be free if there is no address of such object in the hazard pointer list.

The data structure is as shown below:

When the last reader finishes reading A, it will remove it from the hazard pointer list. The implementation doesn't "remove the node" from the list, just uses an atomic CAS empty the address to NULL. This implementation saves the overhead to allocate and release the node every time (or we may need another garbage collection mechanism to do it).
Moreover, it avoids the trouble that the linked list may generate errors when inserting and removing nodes simultaneously by different threads.

The below diagram shows the data structure after A is remove from hazard pointer list:

ref: graphviz

Memory leak

I use address sanitizer to find that there are memory leak in list_free. It only free the "node", but it should also free the node->ptr.

void list_free(hp_t **head)
{
    hp_t *cur = *head;
    while (cur) {
        hp_t *old = cur;
        cur = cur->next;
+       free((void *)(old->ptr));
        free(old);
    }
}

Memory barrier

It's worth mentioning that the code atomic_exchange, which is used in swap, uses __ATOMIC_SEQ_CST as memory order.

#define atomic_exchange(ptr, val) \
    __atomic_exchange_n(ptr, val, __ATOMIC_SEQ_CST)

void swap(domain_t *dom, uintptr_t *prot_ptr, uintptr_t new_val, int flags)
{
    const uintptr_t old_obj = atomic_exchange(prot_ptr, new_val);
    cleanup_ptr(dom, old_obj, flags);
}

In the function writer_thread, we expect that the random value will be written in the new_config, and the shared_config will points to it, so the reader will read the new shared object. However, on some weak memory model machines like DEC Alpha, it is possible that the Store/Store will be out-of-order without memory barrier, that is, it is possible that the shared_config gets the new object address before the random value is written into the object. If the reader read the value between this time interval, it may get the original value of new_config, which depends on the implementation of create_config() (in this case, zero).

static void *writer_thread(void *arg)
{
    (void) arg;

    for (int i = 0; i < N_ITERS / 2; ++i) {
        config_t *new_config = create_config();
        new_config->v1 = rand();
        new_config->v2 = rand();
        new_config->v3 = rand();
        print_config("updating config", new_config);

        swap(config_dom, (uintptr_t *) &shared_config, (uintptr_t) new_config,
             1);
        print_config("updated config ", new_config);
    }

    return NULL;
}

void swap(domain_t *dom, uintptr_t *prot_ptr, uintptr_t new_val, int flags)
{
    const uintptr_t old_obj = atomic_exchange(prot_ptr, new_val);
    cleanup_ptr(dom, old_obj, flags);
}

Therefore, it is more portable to put a memory barrier, like __ATOMIC_SEQ_CST in this case.

The similar memory barrier is also used in load.

uintptr_t load(domain_t *dom, const uintptr_t *prot_ptr)
{
    const uintptr_t nullptr = 0;

    while (1) {
        uintptr_t val = atomic_load(prot_ptr);
        hp_t *node = list_insert_or_append(&dom->pointers, val);
        if (!node)
            return 0;

        /* Hazard pointer inserted successfully */
        if (atomic_load(prot_ptr) == val)
            return val;

        /*
         * This pointer is being retired by another thread - remove this hazard
         * pointer and try again. We first try to remove the hazard pointer we
         * just used. If someone else used it to drop the same pointer, we walk
         * the list.
         */
        uintptr_t tmp = val;
        if (!atomic_cas(&node->ptr, &tmp, &nullptr))
            list_remove(&dom->pointers, val);
    }
}

#define atomic_load(src) __atomic_load_n(src, __ATOMIC_SEQ_CST)
#define atomic_cas(dst, expected, desired)                                 \
    __atomic_compare_exchange(dst, expected, desired, 0, __ATOMIC_SEQ_CST, \
                              __ATOMIC_SEQ_CST)
                              
hp_t *list_insert_or_append(hp_t **head, uintptr_t ptr)
{
    hp_t *node;
    bool need_alloc = true;

    LIST_ITER (head, node) {
        uintptr_t expected = atomic_load(&node->ptr);
        if (expected == 0 && atomic_cas(&node->ptr, &expected, &ptr)) {
            need_alloc = false;
            break;
        }
    }

    if (need_alloc)
        node = list_append(head, ptr);

    return node;
}

The atomic_load and atomic_cas also use the __ATOMIC_SEQ_CST. If the
second atomic_load be reordered before the atomic_cas, it may be a bug if writer changes the shared pointer between this time interval.

~~I consider the possibility to use __ATOMIC_RELEASE to replace __ATOMIC_SEQ_CST in atomic_exchange case, because I only see a possible error from Store/Store reorder so far.~~ __ATOMIC_RELEASE is a less expensive barrier than __ATOMIC_SEQ_CST.

After some analysis, I think the __ATOMIC_SEQ_CST is necessary.

Consider the code below, the return val and free(val) shouldn't happen simultaneously.


reader              |   writer
---------------------------------------------
                    |   /* store on ptr */
                    |   val = atomic_exchange(ptr);
                    |
                    |   /* load old */
                    |   if(!list_contain(val)){                   
/* store on list */ |                       
list_insert(val);   |
                    |
/* load on ptr */   |
if(ptr == val){     |
    return val;     |
}                   |
                    |       free(val);
                    |   }

The program works fine, the ptr == val in reader will fail, and read the new value from ptr next loop.

However, if Store/load could be out-of-order in writer thread, the result may be as below:


reader              |   writer
---------------------------------------------
                    |   /* store on ptr */
/* invisible to     |   val = atomic_exchange(ptr,new);
   reader */        |   /* still in write buffer */
                    |  
                    |   /* load old */
                    |   if(!list_contain(val)){                   
/* store on list */ |                       
list_insert(val);   |
                    |
/* load on ptr */   |
if(ptr == val){     |
    return val;     |
}                   |
                    |   
/* visible to       |   // ptr = new;
   reader now */    |       free(val);
                    |   }

The code ptr == val will be true this time, so the return val and free(val) will both be executed, which results in error.

Now consider the Store/Load out-of-order on reader thread:


reader              |   writer
---------------------------------------------
            
/* store in list */ |                       
list_insert(val);   |
                    |
/* load ptr */      |
if(ptr == val){     |
    return val;     |
}                   |
                    |
                    |   /* store in ptr */
                    |   val = atomic_exchange(ptr,new);
                    |
                    |   /* load list */
                    |   if(!list_contain(val)){       
                    |       free(val);
                    |   }


reader              |   writer
---------------------------------------------
            
/* store in list */ |                       
list_insert(val);   |  /* invisible to writer */
/* still in buffer*/|
                    |
/* load ptr */      |
if(ptr == val){     |
    return val;     |
}                   |
                    |
                    |   /* store in ptr */
                    |   val = atomic_exchange(ptr);
                    |
                    |   /* load list */
                    |   if(!list_contain(val)){       
                    |       free(val);
                    |   }
                    |
//list_insert(val); |   /* visible to writer now */
                    |
                    |
                    |

The Store/Load out-of-order in reader also results in error.

In conclusion, the both of two need to be __ATOMIC_SEQ_CST to prevent Store/Load reorder.

Notice that the release operation and release fence are two different thing. I had been confused for a while until seeing the article Acquire and Release Fences Don't Work the Way You'd Expect.

I did an experiment to simulate similar situation : experiment

Multiple readers/write with Hazard Pointer

When using multiple readers and one writer, the program works well(Albeit the problem discription says that the thread sanitizer will generate error message, I don't get it when execute the program).

However, when using multiple readers and multiple writers, I get some error messages from conflict of free and calloc. In addition to these, I think there may be error if multiple writer try to insert and remove nodes to the retired list simultaneously.

Study Lock-Free Data Structures with Hazard Pointers

The paper describe the implementation of hazard pointer. I compared the paper with the corresponding implementation hp_list, one difference important is the threshold when the retire_list(list_hp_retire in hp_list) function should scan and release the retired pointers. In hp_list, the following codes set the threshold, while the HP_THRESHOLD_R is 0, which means that the list_hp_retire scan and release every time.

if (rl->size < HP_THRESHOLD_R)
        return;

Since hp_list traverse the whole reader thread's hazard pointers to check instead of using a sorted dictionary structure or a hash table, it needs O(MR) complexity to check retired pointers,where M stands for numbers of reader, N stands for numbers of writer, R stands for length of retired list. To improve the performance, I tried to add a variable hp_count to count the total number of hazard pointers, and defer the timing of release when the hp_count is less than the 4/5 of the length of the retired list, ~~so the amortized time complexity achieve the O(M)~~.

I had a misunderstanding on amortized cost, which I thought is just ditributing the cost to all operations, so the cost of list_hp_retire is O(MR/R) originally. Nevertheless, When considering this problem carefully using potential method, I found that this is wrong.

Let

Φ = 5 \times R

where

R

is length of retired list. Let constant

C = 1

T_{a m o r t i z e} (i n s e r t) = T_{a c t u a l} (i n s e r t) + C (Φ (S_{a f t e r}) - Φ (S_{b e f o r e}))

T_{a m o r t i z e} (i n s e r t) = 1 + 5 C = O (1)

T_{a m o r t i z e} (r e t i r e d) = T_{a c t u a l} (r e t i r e d) + C (Φ (S_{a f t e r}) - Φ (S_{b e f o r e}))

When using hash table mentioned in the paper, the

T_{a c t u a l}

is O(R) because each time checking hazard pointers only needs O(1).

T_{a m o r t i z e} (r e t i r e d) = R + C (- 5 \times \frac{R}{5}) = O (1)

Since the number of hazard pointers is less than

\frac{4}{5} R

, the retired list is sure to release at least

\frac{1}{5} R

pointers. So the potential

C (Φ (S_{a f t e r}) - Φ (S_{b e f o r e}))

is at least

- C R

However, hp_list doesn't use the hash table, so the cost of retired will be O(MR).

T_{a m o r t i z e} (r e t i r e d) = M R + C (- 5 \times \frac{R}{5}) = O (M R)

To think in another way, imagine in the hash table case the R is distributed to each insertion, so each operation will be O(1). However, in this problem, when distributing the MR to each insertion, the cost of insertions increases from O(1) to O(M), which is not what we want.
In conclusion, to take advantage of the amortized analysis, we need a hash table to achieve O(1) time complexity when checking hazard pointers.

reference: potential method

Try to improve hp_list

This implementation was done before I found my amortized analysis is wrong. Even though the complexity can't achieve constant time of amortized cost, I think it is still a good practice to reduce the number of operations by defering it.

First, I introduced global variable hp_count.

int hp_count = 0;

The hp_count is changed with CAS when new hazard pointer is read by reader or complete reading. Notice that I added if (!ptr) condition checking in list_hp_protect_ptr and list_hp_protect_release, so they wouldn't protect the NULL pointer.

void list_hp_clear(list_hp_t *hp)
{
    for (int i = 0; i < hp->max_hps; i++){
        if(atomic_exchange_explicit(&hp->hp[tid()][i], 0, memory_order_release)){
            int old;
            do{
                old = atomic_load(&hp_count);
            }while(!atomic_compare_exchange_weak(&hp_count,&old,old-1));
        }
    }
}

uintptr_t list_hp_protect_ptr(list_hp_t *hp, int ihp, uintptr_t ptr)
{
    if(!ptr)
        return ptr;
    if(!atomic_exchange_explicit(&hp->hp[tid()][ihp], ptr, memory_order_seq_cst)){
        int old;
        do{
            old = atomic_load(&hp_count);
        }while(!atomic_compare_exchange_weak(&hp_count,&old,old+1));
    }
    return ptr;
}

uintptr_t list_hp_protect_release(list_hp_t *hp, int ihp, uintptr_t ptr)
{
    if(!ptr)
        return ptr;
    if(!atomic_exchange_explicit(&hp->hp[tid()][ihp], ptr, memory_order_release)){
        int old;
        do{
            old = atomic_load(&hp_count);
        }while(!atomic_compare_exchange_weak(&hp_count,&old,old+1));
    }
    return ptr;
}

Add the new dynamic threshold to list_hp_retire, and use shift operation to compute the

\frac{5}{4}

hp_count.

I also used macro TRACE to add runtime analysis on traversal of retired lists and hazard pointers.

enum {
    TRACE_nop = 0,
    TRACE_retry,     /* the number of retries in the __list_find function. */
    TRACE_contains,  /* the number of wait-free contains in the __list_find
                        function  that curr pointer pointed. */
    TRACE_traversal, /* the number of list element traversal in the __list_find
                        function. */
    TRACE_fail,      /* the number of CAS() failures. */
    TRACE_del, /* the number of list_delete operation failed and restart again.
                */
    TRACE_ins, /* the number of list_insert operation failed and restart again.
                */
    TRACE_inserts, /* the number of atomic_load operation in list_delete,
                      list_insert and __list_find. */
    TRACE_deletes, /* the number of atomic_store operation in list_delete,
                      list_insert and __list_find. */
++  TRACE_retired_traversal
};

struct runtime_statistics {
    atomic_uint_fast64_t retry, contains, traversal, fail;
    atomic_uint_fast64_t del, ins;
    atomic_uint_fast64_t load, store;
++  atomic_uint_fast64_t retired_traversal;
};

void list_hp_retire(list_hp_t *hp, uintptr_t ptr)
{
    retirelist_t *rl = hp->rl[tid() * CLPAD];
    rl->list[rl->size++] = ptr;
    assert(rl->size < HP_MAX_RETIRED);

    if (rl->size < HP_THRESHOLD_R)
        return;

++  if (((unsigned)atomic_load(&hp_count) >> 2) + atomic_load(&hp_count) > rl->size)
++      return ;

    for (size_t iret = 0; iret < rl->size; iret++) {
        uintptr_t obj = rl->list[iret];
        bool can_delete = true;
        for (int itid = 0; itid < HP_MAX_THREADS && can_delete; itid++) {
            for (int ihp = hp->max_hps - 1; ihp >= 0; ihp--) {
++              TRACE(retired_traversal);
                if (atomic_load(&hp->hp[itid][ihp]) == obj) {
                    can_delete = false;
                    break;
                }
            }
        }
        
        if (can_delete) {
            size_t bytes = (rl->size - iret) * sizeof(rl->list[0]);
            memmove(&rl->list[iret], &rl->list[iret + 1], bytes);
            rl->size--;
            hp->deletefunc((void *) obj);
        }
    }
}

I used clock_gettime in main to simply compute the execution time, but I thought that there is still too many interference, so it can't be a reference.

Without threshold:

$ ./list
time=73
retry     : 185
contains  : 0
traversal : 21064122
fail      : 0
del       : 0
ins       : 0
load      : 63275803
store     : 4096
retired_traversal: 1572864
deletes   : 4098
inserts   : 4098

With threshold:


$ ./list
time=78
retry     : 157
contains  : 0
traversal : 21006957
fail      : 0
del       : 0
ins       : 0
load      : 63104185
store     : 4096
retired_traversal: 1572864
deletes   : 4098
inserts   : 4098

To my surprise, the number of retired_traversal is the same.
By printing the hp_count on each call of list_hp_retire, the hp_count is zero for all time. It seems that the hazard pointers is consumed by reader very fast, so when the list_hp_retire is called, it is safe to release the pointers.
This phenomenon make me think of another improvement, instead of traversing the whole list, maybe we can first checking the hp_count, and release it directly if the hp_count is zero.

void list_hp_retire(list_hp_t *hp, uintptr_t ptr)
{
    retirelist_t *rl = hp->rl[tid() * CLPAD];
    rl->list[rl->size++] = ptr;
    assert(rl->size < HP_MAX_RETIRED);

    if (rl->size < HP_THRESHOLD_R)
        return;

    if (((unsigned)atomic_load(&hp_count) >> 2) + atomic_load(&hp_count) > rl->size)
        return ;

    for (size_t iret = 0; iret < rl->size; iret++) {
        uintptr_t obj = rl->list[iret];
        bool can_delete = true;
++      if(atomic_load(&hp_count))
            for (int itid = 0; itid < HP_MAX_THREADS && can_delete; itid++) {
                for (int ihp = hp->max_hps - 1; ihp >= 0; ihp--) {
                    TRACE(retired_traversal);
                    if (atomic_load(&hp->hp[itid][ihp]) == obj) {
                        can_delete = false;
                        break;
                    }
                }
            }
        
        if (can_delete) {
            size_t bytes = (rl->size - iret) * sizeof(rl->list[0]);
            memmove(&rl->list[iret], &rl->list[iret + 1], bytes);
            rl->size--;
            hp->deletefunc((void *) obj);
        }
    }
}

$ ./list
time=86
retry     : 170
contains  : 0
traversal : 20976870
fail      : 0
del       : 0
ins       : 0
load      : 63013973
store     : 4096
retired_traversal: 0
deletes   : 4098
inserts   : 4098

hp_list also optimizes the cache. The following code makes each rl in different cacheline. Although the x86 uses 64 bytes cacheline, I guess the 128 in this program is for portability in case some machines using 128 bytes cacheline. ~~However, I am confused about the alignas(128) in the structure, since I think the alignas(16)(which is the power of two bigger than sizeof(retiredlist_t)) is sufficient to meet our requirement.~~ I think the alignas(128) is to make hp and rl be distributed to different cacheline with alignas instead of inserting padding variable which I am familiar with.

#define CLPAD (128 / sizeof(uintptr_t))
struct list_hp {
    int max_hps;
    alignas(128) atomic_uintptr_t *hp[HP_MAX_THREADS];
    alignas(128) retirelist_t *rl[HP_MAX_THREADS * CLPAD];
    list_hp_deletefunc_t *deletefunc;
};

After reexamining the code, I'm of the opinion that the CLPAD in this program is superfluous. Since the value of rl[i] is the address of retirelist_t, the rl[i] itself won't be modified after the initialization. Therefore, the scenario that writers thread invalidate each others frequenlty won't happen.

If we want to prevent allocator from allocating multiple retirelist_t on adjacent addresses, which causes false sharing due to sharing the cacheline, it is better to make retirelist_t as bigger as cacheline and align with sizeof(size) + sizeof(list). The latter isn't related to false sharing, but obviet one writer from caching two cahelines for one retirelist_t.

typedef struct {
    alignas(16) int size;
    uintptr_t *list;
    uint8_t padding[128-sizeof(int)-sizeof(uintptr_t *)];
} retirelist_t;