Multiprocessor Architecture

##### Gaspard Peduzzi, Ulysse Ramage # Multiprocessor Architecture ## Report - Assignment 3 <style> #doc { font-family: 'Times New Roman'; } </style> We first start by finding data races that could occur in the initial implementation we were given. There is a data race between `insert` and `delete`. One example is with an initial list `[1, 3]`. If one thread tries to insert `2` at the same time that another deletes `3` - e.g. when `previous->next = current->next;` in `delete` happens-before `new_node->next = current;` in `insert`, the list could end up being `[1, 2, 3]` thus not the desired result. Even worse, on longer lists the tail of the list could become entirely unreferenced without being marked `to_remove`, and this would cause serious memory leak problems. There's another one with `delete` and `search`. One node could be marked as `to_remove = 1` while search finds it and returns it, thus returning an invalid result. This occurs when `if (current->val == val) {` in `search` happens-before `previous->next = current->next;` in `delete`. Another data race is between `insert` and `search`. If executed in parallel on two threads, depending on the order of execution, searching for a node while it is being inserted could result in it either being found or not. We present a first improvement on the original implementation: a global lock which acts as a synchronization mechanism for all accesses to the linked list. ```c typedef struct node { int val; struct node *next; int to_remove; } node_t; omp_lock_t lock; omp_init_lock(&lock); int insert(node_t *head, int val) { omp_set_lock(&lock); node_t *previous, *current; current = head; while (current && current->val < val) { previous = current; current = current->next; } if (current && current->val == val) { // This value already exists! return -1; } // Here is the right position to insert the new node. node_t *new_node; new_node = malloc(sizeof(node_t)); new_node->val = val; new_node->next = current; new_node->to_remove = 0; previous->next = new_node; omp_unset_lock(&lock); return 0; } int delete(node_t *head, int val) { omp_set_lock(&lock); node_t *previous, *current; if (head->next == NULL) { // The list is empty. return -1; } previous = head; current = head->next; while (current) { if (current->val == val) { previous->next = current->next; current->to_remove = 1; // Another system component will free this node later return val; } previous = current; current = current->next; } omp_unset_lock(&lock); return -1; } int search(node_t *head, int val) { omp_set_lock(&lock); node_t *current = head->next; while (current) { if (current->val == val) { return 0; } current = current->next; } omp_unset_lock(&lock); return -1; } ``` The main issue with creating a single lock for all accesses to the linked list is that, while concurrency problems are solved, we render all operations sequential, thus not being parallelized at all anymore. It is equivalent to calling each instruction one after the other. We thus present the next implementation using fine-grained locking, in which every node has its own lock. This means more code will be able to run in parallel while making sure each node is accessed by one thread at a time. We will add a new `lock` field to the `node_t` struct. ```c typedef struct node { int val; struct node *next; int to_remove; omp_lock_t lock; } node_t; int insert(node_t *head, int val) { node_t *previous, *current; current = head; omp_set_lock(&current->lock); while (current && current->val < val) { if (previous != NULL) omp_unset_lock(&previous->lock); previous = current; current = current->next; if (current != NULL) omp_set_lock(&current->lock); } if (current && current->val == val) { // This value already exists! omp_unset_lock(&current->lock); return -1; } if (current != NULL) { omp_unset_lock(&current->lock); } // Here is the right position to insert the new node. node_t *new_node; new_node = malloc(sizeof(node_t)); new_node->val = val; new_node->next = current; new_node->to_remove = 0; omp_init_lock(&new_node->lock); previous->next = new_node; omp_unset_lock(&previous->lock); return 0; } int delete(node_t *head, int val) { node_t *previous, *current; if (head->next == NULL) { // The list is empty. return -1; } previous = head; omp_set_lock(&previous->lock); current = head->next; omp_set_lock(&current->lock); while (current) { omp_set_lock(&current->lock); if (current->val == val) { previous->next = current->next; current->to_remove = 1; // Another system component will free this node later omp_unset_lock(&previous->lock); omp_unset_lock(&current->lock); return val; } omp_unset_lock(&previous->lock); previous = current; current = current->next; omp_set_lock(&current->lock); } omp_unset_lock(&previous->lock); omp_unset_lock(&current->lock); return -1; } int search(node_t *head, int val) { node_t *previous, *current = head->next; if (head->next == NULL) { // The list is empty. return -1; } previous = head; omp_set_lock(&previous->lock); current = head->next; omp_set_lock(&current->lock); while (current) { omp_set_lock(&current->lock); if (current->val == val) { // found omp_unset_lock(&previous->lock); omp_unset_lock(&current->lock); return 0; } omp_unset_lock(&previous->lock); previous = current; current = current->next; omp_set_lock(&current->lock); } omp_unset_lock(&previous->lock); omp_unset_lock(&current->lock); return -1; } ``` The list traversal in all methods has been modified to use hand-over-hand traversal with two locks simultaneously, so that we make sure a thread can delete a node's successor. If that was not the case, a thread may concurrently delete the current node by setting a pointer from the previous to the next. That's why we need to lock the previous node as well to make sure the two elements we are iterating over are untouched. The data races are dealt with thanks to each node having a unique lock, hence two threads cannot access or modify the same node at once, and thanks to the hand-over-hand traversal mechanism. This implementation is deadlock-free since all locks are taking in order, in ascending order accordingly with their position in the ordered list. No locks will ever be taken in reverse regarding to that matter. The performance of this implementation is better than previously, since the percentage of parallelized code is higher. Removing the global lock means we can execute different instructions in parallel, as opposed to sequentially before. Now the main performance bottleneck in our current implementation is the lock contention, as described in https://preshing.com/20111118/locks-arent-slow-lock-contention-is/ - we are locking each node for a very short amount of time, at a very high rate since we are traversing the list in every method. The next improvement over the initial implementation would be one without locks. For this implementation of a lock-free linked list, we are going to use the compare-and-swap atomic mechanism, which is included in GCC as the `__sync_bool_compare_and_swap(*ptr, old, new)` intrinsic. Note that since we are not using locks anymore, we add the keyword `volatile` to the `*next` and `to_remove` fields, to tell the compiler that they may be modified at any time by other threads and prevent any unwanted optimizations. ```c typedef struct node { int val; volatile struct node *next; volatile int to_remove; } node_t; int insert(node_t *head, int val) { node_t *previous, *current; current = head; while (current && current->val < val) { previous = current; current = current->next; } if (current && current->val == val) { // This value already exists! return -1; } // residual node if (current->to_remove) { // unreference node and retry __sync_bool_compare_and_swap(&previous->next, current, current->next); return insert(head, val); } // Here is the right position to insert the new node. node_t *new_node; new_node = malloc(sizeof(node_t)); new_node->val = val; new_node->to_remove = 0; new_node->next = current; if (!__sync_bool_compare_and_swap(&previous->next, current, new_node)) { // delete newly created node and try again free(new_node); return insert(head, val); } return 0; } int delete(node_t *head, int val) { node_t *previous, *current; if (head->next == NULL) { // The list is empty. return -1; } previous = head; current = head->next; while (current) { // residual node if (current->to_remove) { // unreference node and retry __sync_bool_compare_and_swap(&previous->next, current, current->next); return insert(head, val); } if (current->val == val) { if (__sync_bool_compare_and_swap(&previous->next, current, current->next)) { current->to_remove = 1; return val; } else { // try again return delete(head, val); } } previous = current; current = current->next; } return -1; } int search(node_t *head, int val) { node_t *current = head->next; while (current) { // residual node if (current->to_remove) { // unreference node and retry __sync_bool_compare_and_swap(&previous->next, current, current->next); return insert(head, val); } if (current->val == val) { return 0; } current = current->next; } return -1; } ``` Since search is a read-only operation, it does not pose any concurrency issue and we don't need any compare-and-swap mechanism. Note that in some cases, there can be residual nodes that are marked `to_remove = 1` but are still referenced in the list. In case we find one while traversing, we simply need to unreference it and retry the operation. Data races can still occur with this implementation, but they are dealt thanks to the compare-and-swap mechanism. Before making any write to a node that we have previously selected, we make sure it is still valid, or not corrupted by another thread, by using the `__sync_bool_compare_and_swap` intrinsic. If that node was touched, we restart the operation all over again to make sure we aren't corrupting or working on corrupted data. This implementation is more efficient as it does not use any lock and can thus be run in an entirely parallelized manner. It is not perfect, though, and might be less efficient than the fine-grained lock implementation in some edge cases. The compare-and-swap mechanism forces us to repeat the operation every time a thread loses a data race. That is to say, the implementation with locks will perform better than the lock-free implementation when there are many conflicts between operations (for example, many `insert` and `delete` operations with the same value) - in this situation, operations might be restarted many times before they succeed, and the overhead of repeatedly trying will be greater than the lock contention.