## Relevant concepts
- **Stack allocation**: We have learned in Assignment 0 that there are two
types of allocated memory: Stack and heap allocated memory. One acquires
stack allocated memory by using array declarations inside of functions, as
opposed to global declarations of arrays.
Memory allocated on the stack is limited in size, but tends to be faster.
Moreover, stack allocated memory is thread local and therefore provides an
opportunity to untangle the mutual impact of parallel threads on one another.
Consequentially, it is an important consideration to employ stack allocated
memory in the innermost iteration steps, i.e., the Newton iteration for an
individual point on the complex plane.
We plan to test this concept by comparing runtimes of variants of our program
using stack and heap allocated memory in the innermost iteration steps.
- **Memory fragmentation**: Memory fragmentation, meaning that allocated memory is scattered in main memory, can have a detrimental effect on the performance of a program. With the archetype being matrices where allocating it as a single continous memory block can have significant performance improvement opposed to the naive implementation of allocating e.g. each row independently. These performance improvements is typically caused by two effects; first the allocation itself are expensive and thus minimizing the number of allocations is thus preferable and secondly having the memory allocated continously typically improves cache performance. Note that the last effect is not necessarly always true, see for example the section on false sharing below.
In C-code this manifest itself as doing a single allocation (e.g. by using the malloc function) opposed to doing several independent allocations.
- **Locality**: A topic associated with memory fragmentation is locality. Whereas memory fragmentation is concerned with the allocation of memory, locality considers the access patterns of memory in conjunction with the allocation strategy. Again a matrix serves as a excellent example, where a matrix allocated in row-major order are considerable slower to access in a column-major order, since this will lead to a significant worse cache performance. This performance loss is completly due to bad spatial locality of the accessed memory. Since when traversing the matrix in column major order each memory access will typically lead to a cache miss since for a NxN matrix each column element are separated by N-1 other elements and hence if N is large enough this means that the cache needs to be reloaded for each column element access.
This problem can either be solved by changing allocation strategy, which in the previous example would mean to alllocate the matrix in a column-major order, or one can modify the algorithm such that the access patterns align with the allocation strategy. The former approach are typically not preferable as it quickly becomes unmaintainable for larger programs as the changed allocation strategy can effect other parts of the program that access the same memory. The prefered approach is thus to modify the access patterns of the algorithm since such a change typically only affects a local part of the program and is thus a more maintainable solution.
While memory fragmentation and locality certaintly have to be considered and should be utilized for this exercise there is however another aspect concerning the memory layout that needs to be taken into account when writing parallel programs, which we will consider next.
- **False sharing**: False sharing, meaning that when e.g. two threads reads and at least one writes to memory locations that reside in the same cache line, causes tainted cache lines which in turn invoke the CPU to perform a automatic synchronization step. Note that this opposed to data race, means that the read and writes of the two threads can be accessing two (from the programmers perspective) completly independent variables and hence explaining why this synchronization is performed automatically by the CPU. Hence when planning a parallel program layout one should not blindly optimize for locality since one also has to take false sharing into account and try to find a good compromise between the two.
With the last four concepts explained we can now address how these will influence the memory layout of the program. Below are a list of the 4 data structures of the program where the memory layout needs to be carefully considered:
1. Result matrices: two NxN matrices will be needed for saving the
results of the newton iterations
2. Synchronization array: A array for communicating between the read and
write threads.
3. Arrays of colour strings: To optimize the writing part two arrays of
precomputed colour strings will be used.
4. Array of roots: To check convergence of the newton iterations
a array containing the roots will be needed.
Below a list with the corresponding memory layouts for the above data strutures is shown with a brief explaination for each item.
1. Items in rows allocated continuously, but the rows themself are
fragmentated in memory.
This allocation strategy where chosen since each row will be accessed by
one thread at a time and hence having the elements in the row continuously
allocated should improve cache performance without risk of false sharing.
On the other hand several rows will be accessed by multiple threads at the
same time and thus fragmentating the rows in memory should minimize the
risk of false sharing.
2. Array allocated continuosly in memory.
This may seem suboptimal considering the effect of false sharing, since
clearly a synchronization array will be accessed by several threads.
However, with the access pattern of our program this array needs explicit
synchronization anyway and thus the effect of false sharing disappears.
3. A single continously allocated block of memory on the stack.
Since only one thread is used for writing to file and after creation of the
array no writing to the array will be needed false sharing will not
be present. Thus it makes sense to minimize the amount of allocations and
to optimize for locality, this is obviously done by allocating it as a
single block of memory. Further since these will be of limited size they
can also utilize the stack.
4. Array continuosly allocated on the stack.
These will be of limited size and frequently accessed in the
newton iteration. Thus it make sense to give each thread a local stack
allocated copy of the array. Note that this have to be carefully done such
that the array is only created once for each thread.
- **Inlining**:
An inlined function is a function where the compiler has substituted the body of the function into the code calling the function. This increases performance for frequently called functions as the function call is omitted. In assignment 1 we saw how this significantly improved the runtime of a function where the body of the function was fast to execute making the function call take a large part of the execution time. In this assignment function inlining is mainly worth taking into account for the computation function since this function is called many times.
To generate inlined functions in C-code one can add the static inline keywoard to the function header. Note that this requires the function and function call to be present in the same object file during linking. Meaning that one either has to place the function in the same c-file where it is used or place the implementation of the function in a header file.
- **Parsing from command line**:
In assignment 0 we learned a simple method of parsing command line arguments by passing them as arguments to main and using the function strtol. The same code was possible to use in this assignment with only minor tweaks.
- **Writing to file**
There are different functions that can be used in C for writing to file. fwrite() writes binary to a file while fprintf() writes a formatted text to a file. Because it is possible to print a number of precomputed strings, instead of regenerate them all the time, it is possible to use fwrite(). Considering time will favour the use of fwrite() instead of fprintf().
## Intended program layout
Per instructions the program naturally splits into two subtasks: The
computation of the Newton iteration and the writing of results to the two
output files. Each of them will be implemented in a separate function, that are
intended to be run as the main function of corresponding POSIX threads.
The computation of the Newton iteration can be further split up into one part that is dependent on the degree $d$ and another part which is not. The $d$ independent task is mainly checking if the algorithm has diverged and this is simply done for each iteration. For the $d$ dependent tasks, checking if the algorithm has converged and performing the actual calculation, an initalization function is used to minimize the number of times the switch case that determines $d$ has to be executed. The initializaiton function sets a function pointer to a hardcoded optimized expression for calculating the next step in the algorithm and a pointer to an array of hardcoded roots to the polynomial. These pointers are then used in the Newton iteration without the need of checking the degree several times. The number of if statements that have to be executed is further reduced by comparing the current value $z$ with the roots only if the absolute value of $z$ is close to 1. This is especially helpful for higher degree polynomials.
The writing thread first needs to compute the head of files and what colors that will be written to the file depending on the result of the Newton iteration. The actual writing of rows is waiting until the Newton interations of some row of interest have been computed. When the row is computed, the thread writes the corresponding colors for attractors and convergences. If also the next row is ready from the compute threads, that one will be written to file immediately after completing the previous. Otherwise the thread will again wait till the next row is done.