Assignment 2 - HackMD

# Assignment 2 | Name | Student code | | -------- | -------- | | Simon Damberg | sida0650 | | Theo Meier Ström | thme1370 | ## Numerical Integration ### Segmented ![](https://i.imgur.com/WlOoGeq.png) | Threads/Trapezes | 1 | 10000 | 1000000 | 10000000 | | ---------------- | --- | ----- | ------- | -------- | | 1 | 82 | 124 | 14312 | 83617 | | 2 | 74 | 119 | 9932 | 63857 | | 4 | 75 | 116 | 8432 | 54102 | | 8 | 72 | 102 | 6743 | 48216 | | 16 | 76 | 98 | 4921 | 45871 | ![](https://i.imgur.com/3XZCKom.png) ### Lock |Threads/Trapezes | 1 | 10000 | 1000000 | 10000000 | | -------- | -------- | --- | --- | -------- | | 1 | 45 | 165 | 23368 | 140112 | | 2 | 43 | 552 | 32616 | 210480 | | 4 | 53 | 849 | 35357 | 364550 | | 8 | 27 | 839 | 36618 | 297601 | | 16 | 69 | 724 | 34919 | 262184 | It is clear from the result that the integration is an approximation of pi. Our first implementation (segment) divided the trapezes into equally large segments for each thread. This required no synchronization since there was no overlapping read/writes between the threads. However, this implementation may not be the fastest. For example, large numbers in the upper range of the trapezes might take longer to calculate than the earlier segments. This leads to some cores being unutilized and waiting on the slower threads to finish. The second approach (lock) we tried was to have each thread take one trapeze at a time, compute its value and save it. However, this requires synchronization during the read/writing phase to the shared array, which might slow down the computation. As seen in the results, there was no speedup from 1 thread to 16 threads. The reason for this being the overhead of having to lock and unlock the mutex lock a large amount of times. A better approach for the one using a lock could have been that every thread takes a chunk of trapezes instead of only one at a time. This would lead to reduced overhead. If we had chunks of 8, the lock would be used 8 times less than now. ## Sieve of Eratosthenes ![](https://i.imgur.com/oxL5LyK.png) ![](https://i.imgur.com/WBfAkfU.png) In our implementation, each thread goes to each k from 2 up to sqrt(max), and marks multiples of all prime k inside its own chunk of numbers. Since each thread only reads from the sequentially calculated list of k up to sqrt(max), and only writes to its own chunk of numbers, no synchronizationw as needed. The work was distributed to the threads in chunks of equal size between sqrt(max)+1 and max, this may have lead to some threads taking longer than others due to additions and equality checks with larger values take longer to compute. We did not use any load balancing, so there could be a possibility for a larger speedup by implementing it. However, the supplied algorithm in the assignment states that chunks of equal size should be divided amongt the threads, thus we didn't try to change it. As seen in the table and graph above, when increasing the number of threads, the program finished faster. The speedup is the fastest in the beginning, going from 1 to 2 cores. The rate of speedup then slows down due to overhead being bigger and parts of the program still running sequential(the initial seeding). ## Exercise 3 ### Mutual Exclusion This will always satisfy mutual exclusion. This is because to be able to break out of the outer-most loop, it has to be your turn. It is not possible for a variable to be equal to two different things at once, thus only one thread will be able to proceed at a time. ### Starvation No, this protocol is not starvation free. Since the loop condition depends on a thread setting a variable, it is possible for one thread to always be scheduled last and thus never get turn to be equal to its threadID. To prevent this, some sort of load balancing is needed. ### Deadlock No, this protocol is not deadlock free. For example: Thread 1: set turn=me; Thread 1: break out of first loop due to busy=false Thread 1: set busy=true Thread 2: set turn=me; Thread 2: stuck in inner loop since busy=true Thread 1: Can't break out of outer loop since turn=thread2 Thread 1: repeat outer loop and set turn=me Thread 1: stuck in inner loop since busy=true ## Exercise 4 ![](https://i.imgur.com/rKRxp86.png) ![](https://i.imgur.com/JNQ2Dtv.png) ![](https://i.imgur.com/wDFHN40.png) ![](https://i.imgur.com/nuJ8YBm.png) ![](https://i.imgur.com/0BWLqAT.png) ![](https://i.imgur.com/i5oBR3Q.png) With coarse-grained locking, you lock around the whole operation, to ensure the data being read/removed/inserted is done atomically. This is slow since in theory, an insertion at the end, a deletion in the middle and read in the front of the list can all happen simultaneously. This is where fine-grained locking comes in, where instead each node in the list has its own lock, and when performing an operation only the nodes that are affected are locked. Unfortunately, we were unable to get fine-grained locking to work correctly in all cases so we have excluded that code from our handin. Fine-grained is a lot more scalable, both in terms of thread count and list size, since the chance of lock-contention is smaller due to the locks being spread out across the list. With coarse-grained, the size of the list and thread count does not matter at all in performance, since only one operation can happen at once. This is clear in the Mixed and Update results, where an increased thread count leads to less operations per second. This is due to the overhead introduces by locks. TATAS - Meaning Test and test-and-set is an algorithm combining the atomic test-and set instruction with a "normal" test(read). You test if the lock is locked using a normal read, when it is detected that the lock has been unlocked, it is tested again with the atomic test-and-set operation. This makes sure that locking the lock happens automatically. The reason for not just using test-and-set constantly is that atomic memory operations are a lot more expensive than normal reads, therefore would slow down the program. Mutex instead is a constant atomic operation which means that it is sometimes slower than TATAS. For inspiration on how to implement TATAS, we looked at the following code https://en.wikipedia.org/wiki/Test_and_test-and-set. This can be seen in the update and mixed graphs. The TATAS outperforms the mutex in general. Due to using less expensive atomic operations and more cheap reads. In the read-graph, we can see that there is no difference. This is due to the lock not being used when only reading, since the list never changes.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.