# Assignment 2: OpenMP # TODOs (:heavy_check_mark:) ## Implementation * :heavy_check_mark: Block sizes fixed. * :heavy_check_mark: After block size fixed, implement ``dist_between_batches(*x0,...,x1*,...)`` * :heavy_check_mark: Add large file support in ``else {...}`` block. ## Performance * `SIMD`-instructions. * Different types of unrolls may yield something * There's also something called "aligning memory", making sure that large arrays fit on "pages". Not really sure what to do with that information but could probably ask someone if it's something we could implement. * Fitting block sizes to cache lines: a cache line is 64 bytes long, and a short int is 2 bytes so the inner and outer block sizes should probably be 32? Also not needed with different block sizes so that could be refactored to a single variable (static const/#define) `block_size`. * :heavy_check_mark: Check in perf how much time is spent in different functions. All time is spent in dist-functions. ## Fixes and cleanup * :heavy_check_mark: Add `clean` target in `makefile`, removing all generated files from `make all`. (Probably just the executable) * :heavy_check_mark: Remove non-specified ``printf()``-statements, i.e. debugging info currently being printed. > Possibility: Add debug flag and print this stuff only if set * :heavy_check_mark: Modify ``print_freqs`` to specification: leading zero, space-separator, no header * Clean up comments # Design Rough overview of implementation ### Assumptions The coordinates are assumed to lie between $-10$ and $10$. We assume that the data is stored as described in the task description. ## Tasks ## Type choices To keep down memory consumption, we use small data types (where it makes a difference). - nåt slax lista här på typval? ## Memory consumption Skriva (skissa) uträkningar här ### Handling large sets of data # File parsing If all data fits in memory, we perform only one file read and store everything in a buffer. This makes the parsing much more efficient since file reads are intrinsically slow. If the file size is too large, parsing will be carried out in pieces and has to be done repeatedly. ## Parse function We take advantage of the fixed format of input to parse efficiently. As mentioned before, coordinates are stored in arrays of ``short int``, representing thousandths. In more detail, the parsing is carried out as follows ``"+01.234" -> 0*10000 + 1*1000 + 2*100 + 3*10 + 4*1``. ## Parallelization We distribute the parsing over the buffer evenly among threads. All threads write to separate parts of the coordinate arrays, and this together with their start indices being far apart should cause none or negligible cache invalidation between threads. A possible improvement might be to make sure that each thread starts to work 'aligned' in memory, but implementing this seems pretty unnecessary given the ratio between threads and rows in this application. ## Large data sets To enable the program to handle large files, the parsing function takes a start index and the amount of rows to read as arguments. # Measuring distances This task is by far the most computationally heavy. For `n` rows, the amount of distance computations needed is of order `n^2`. ## Distance between points The distance is computed with `sqrtf`, after converting to `float`. - Nånting om jämförelse här kanske? ## Saving results The results are saved in a length $3466$ ($\approx2000\sqrt3$, the maximal possible distance, see Assumptions) `short int`-array. The *position* in the array indicates the distance, and the value at this position is the number of occurrences of the corresponding distance. ## Parallelization The current batch of rows is split evenly between threads. The $n$:th thread is assigned the block with block number $\equiv n\ (\operatorname{mod}\ T)$ where $T$ is the total number of open threads. Each thread has its own "private" array of results. When all computations are finished, these arrays are combined into one. This yields a fast program compared to all threads working with the same result array, and also makes sure there are no synchronization issues. ## Large data sets # IO The program takes an argument `-t D` and sets the maximum number of OpenMP-threads to `D`. The structure of the results makes printing easy. We found that it was faster (and also not undefined behaviour?) to print as `printf("%02d.%d...",d/100,d%100,..)`over `printf("%05f.2...",(float)d/100...)`. Since this is a statement that's only executed approx. 3500 times it probably doesn't really impact the runtime. # Compiler flags We compile with ``gcc`` using native architecture and optimization level ``O3``. # Performance # Program flow > Write pseudo-code for the algorithm > Tomas Forssmark, Douglas Molin, Vincent Molin