# Week 7 Exercise Solutions ## Before You Begin: ### Clone the Repository https://github.com/UCL-PHAS0100-22-23/week_7_cpp_exercises ### Adding Optimisation Flags In these exercises we will explore the use of compiler optimisation. These need to be turned on! You can do this by making changes to the top level `CMakeLists.txt` in each exercise folder. For example, to use the `-O1` flag for `MyExecutable` we use the line: ``` target_compile_options(MyExecutable PRIVATE -O1) ``` You can list multiple flags to turn on more than one. If you change the flags or add/remove this line in the `CMakeLists.txt` you can just use `cmake --build build` to rebuild your executable and see what effect it has had. ### Timing Functions Using `std::chrono` You'll need to make use of [`<chrono>`](https://cplusplus.com/reference/chrono/) from the standard library in order to time your function calls. We've provided a bit of basic code to declare a clock and determine the time difference in seconds between two points in time as a `double`: ```cpp= #include <chrono> typedef std::chrono::high_resolution_clock myClock; double t_seconds(std::chrono::time_point<myClock> t1, std::chrono::time_point<myClock> t2) { return (t2 - t1).count() / 1e9; } ``` The `typedef` isn't necessary but it makes code a lot less cluttered given all the namespaces and the long type names! You can get a point in time in your code using: ```cpp= auto time = myClock::now(); ``` To time the execution of a piece of code you should get the time point before and after the code that you want to time, and then you can calculate the difference using `t_seconds`. ## Section 1: Complexity ### Exercise 1: Algorithm Analysis In this exercise we'll look at some methods of storing and looking up data in key-value pairs (such as in a `map` or `dictionary` type in C++ or Python). 1. Go into the `map_examples` folder, and read the code for the methods: - `SortedList.h` describes a simple list data-structure where the elements are sorted by key. - `Table.h` describes a table data-structure with a fixed number of rows. Key value pairs are placed into a row based on a "hash-function" called `hash`, which you may assume is $O(1)$ time. An ideal hash function converts data into an integer whilst minimising "collisions" i.e. different data should rarely map to the same number. - `Tree.h` describes a tree data-structure i.e. a graph where each node has (up to) two child nodes. 2. For each storage solution, determine based on the code provided: - What is the average case complexity of `insert` / `lookup`? - What is the worst case complexity of `insert` / `lookup`? - What conditions produce the worst case performance? - For the `Tree` consider how the ordering of keys added affects the tree structure. - For the `Table` consider how keys are mapped into rows, and consider what happens in the case where there are many more rows than key-value pairs (take this as the average case), and the case where there are more key-value pairs than rows. - **You may assume that `push_back` is $O(1)$, and can ignore any vector re-sizing. You should also assume `hash` is $O(1)$.** - **The complexity should be understood as scaling with the number of elements in the data-structure e.g. the complexity of inserting an element into a sorted list which contains $n$ elements already.** ### Solution: - Table: This is a basic hash-table. - If $n$ is small compared to the number of rows, then the chance of multiple occupancy in any given row is small, so in the average case you have $O(1)$ insert and look-up time as you just calculate the hash ($O(1)$ time) and then either `push_back` or read and return (also $O(1)$ time). - If $n$ is large so that multiple occupancy becomes a factor, or if you are unlucky and lots of inputs map to the same row, then we get $O(n)$ time for inserts and look-ups, since in both cases we need to search through the list of keys in a given row, and the number of keys in a row will be $\propto n$. - Tree: This is a binary search tree with no self-balancing behaviour. - In the average case you expect roughly as many elements on the left branch of a given node and the right branch. In this case, each time we descend a layer in our tree looking for an element, we reduce the number of elements to search by half. Insertion and look-up are then $O(\log(n))$. - In the worst case, we insert a sorted list. Let's say it's sorted from lowest to highest: then every element we insert goes to the right at every node, and our entire tree graph is just a straight line! This means insertions and look-ups become $O(n)$. (This behaviour is corrected for by self-balancing binary search trees such as [red-black trees](https://en.wikipedia.org/wiki/Red%E2%80%93black_tree) to give guaranteed $O(\log(n))$ time.) - Sorted List: Just a sorted list and a binary search! - Insertion is exactly as the insertion step in insertion sort, so $O(n)$. In the best case, when you insert every element already sorted, insertion would be $O(1)$. - Look-up is a binary search, which halves the search space every step so gives up $O(\log(n))$ time. ### Exercise 2: Sorting In this exercise we'll explore the merge sort and insertion sort algorithms described in the notes, and create a hybrid algorithm which can switch strategies. 1. In the `sorting_examples` folder, you will find a header which defines function templates for `merge_sort` and `insertion_sort`, and a source file with `main`. These sorting algorithms do _out of place_ sorting i.e. return a new sorted vector leaving the original intact. This means that you can use the same input vector on the different algorithms. 2. Write some code which generates a vector of random integers of a given length. 3. Time how long it takes to sort random lists of different lengths with `merge_sort` and `insertion_sort`. - Which algorithm works best for the smaller lists? Which algorithm works best for largest lists? Why? - Looking at your timings, do the algorithms scale in the way that you expect them to? - Find the length of the list for where the efficiency of these algorithms cross over i.e. time taken for insertion sort is approximately equal to the time taken for a merge sort. - You may want to find this programmatically using your timing code! 4. Compile your code with optimisations turned on. Does it affect which one is faster, or the cross-over point? Does it affect how they scale? 5. Write a new sorting method which appropriately switches between these methods so that lists below a given size are always sorted by the more efficient method. ### Solution: Timing code should look something like this: ```cpp= #include <chrono> #include "sorting.h" #include <iostream> #include <random> typedef std::chrono::high_resolution_clock myClock; double t_seconds(std::chrono::time_point<myClock> t1, std::chrono::time_point<myClock> t2) { return (t2 - t1).count() / 1e9; } int main() { std::mt19937_64 rng; std::uniform_real_distribution<double> dist(0,1); int N = 500; // generate random list std::vector<double> unsorted; for(int i = 0; i < N; i++) { unsorted.push_back(dist(rng)); } auto insert_start = myClock::now(); std::vector<double> insert_sorted = insertion_sort(unsorted); auto insert_end = myClock::now(); auto merge_start = myClock::now(); std::vector<double> merge_sorted = merge_sort(unsorted); auto merge_end = myClock::now(); printf("N = %d, Insertion sort time = %fs, Merge sort time = %fs\n", N, t_seconds(insert_start, insert_end), t_seconds(merge_start, merge_end)); return 0; } ``` If you want to find the crossing over point programmatically you can write something like a binary search, but bear in mind that the cross-over will be a little different every time because there is variation in the time it takes to sort any random list! ```cpp= #include <chrono> #include "sorting.h" #include <iostream> #include <random> typedef std::chrono::high_resolution_clock myClock; double t_seconds(std::chrono::time_point<myClock> t1, std::chrono::time_point<myClock> t2) { return (t2 - t1).count() / 1e9; } // Returns true if merge sort is faster than insertion sort bool MSortFaster(int N) { static std::mt19937_64 rng; static std::uniform_real_distribution<double> dist(0,1); // generate random list std::vector<double> unsorted; for(int i = 0; i < N; i++) { unsorted.push_back(dist(rng)); } auto insert_start = myClock::now(); std::vector<double> insert_sorted = insertion_sort(unsorted); auto insert_end = myClock::now(); auto merge_start = myClock::now(); std::vector<double> merge_sorted = merge_sort(unsorted); auto merge_end = myClock::now(); double t_insertion = t_seconds(insert_start, insert_end); double t_merge = t_seconds(merge_start, merge_end); printf("N = %d, Insertion sort time = %fs, Merge sort time = %fs\n", N, t_insertion, t_merge); return t_merge < t_insertion; } int main() { int Nmin = 1; int Nmax = 10000; int N = (Nmax - Nmin) / 2; while(Nmax != Nmin) { if(MSortFaster(N)) { Nmax = N; } else { Nmin = N; } N = Nmin + (Nmax - Nmin)/2; std::cout << N << " " << Nmin << " " << Nmax << std::endl; } std::cout << "Crossover approx = " << N << std::endl; return 0; } ``` Results can vary, but on my machine the merge sort becomes more efficient at N ~ 500. Below that threshold the overheads of merge sort (lots of function calls, for example) cause it to be slower than insertion sort; above this threshold the insertion sort is held back by its poor asymptotic complexity. Insertion sort is $O(n^2)$ so you should see as lists become large that time approximately quardruples as list size doubles. Merge sort is $O(nlog(n))$ so as list sizes double you should see time approximately double (quasi-linear). (Since $2nlog(2n) = 2n(log(n) + 1)$, approx double $nlog(n)$ if $log(n) >> 1$.) Compiling with optimisation does not change the complexity, so merge sort is still faster for large lists. For example output compiled with `-O3`: ``` N = 4999, Insertion sort time = 0.002823s, Merge sort time = 0.001027s N = 2500, Insertion sort time = 0.000736s, Merge sort time = 0.000455s ``` We can see that the merge sort time approximately halves as the list size halves, as expected for quasi-linear time, and the insertion sort time goes down by a factor of four, as expected for quadratic-time. **Optimisations don't change our overall complexity so it's still important to write methods that scale well!** In my case it _does_ change the crossover point, so they now become equal at $N \sim 1400$, so it is managing to optimise the insertion sort more effectively. We can write a hybrid algorithm which gets the best of both worlds like this: ```cpp= template<typename T> std::vector<T> hybrid_sort(const std::vector<T> &v) { size_t threshold = 150; if(v.size() < threshold) { return insertion_sort(v); } else { std::vector<T> v1, v2; v1.insert(v1.begin(), v.begin(), v.begin() + v.size()/2); v2.insert(v2.begin(), v.begin() + v.size()/2, v.end()); return merge(hybrid_sort(v1), hybrid_sort(v2)); } } ``` - This is a modified merge sort algorithm with switches to insertion sort for small lists. - The threshold at which we transition can have an impact on the performance, and on my machine (compiled with `-O3`) the best value seems to be somewhere around 150, although I get a very good improvement even for values several times smaller or larger. We don't want to change over to insertion sort the moment it becomes of equal efficiency to merge sort, because then we would just end up with the same time as doing a normal merge sort! Instead, we want to cut off the merge sort before it gets all the way to the bottom and has to do a huge amount of merge steps recursively. - If you explore values for the threshold you'll see that you get a substantial improvement in performance for a range of values, so although this is machine dependent it's still robust enough to be worth doing. ``` N = 4999, Insertion sort time = 0.002948s, Merge sort time = 0.000967s, Hybrid sort time = 0.000336s N = 2500, Insertion sort time = 0.000715s, Merge sort time = 0.000418s, Hybrid sort time = 0.000139s ``` Shows an example using a threshold of `150` comparing insertion sort and merge sort. Note that: - Hybrid sort is about three times as fast as merge sort for long lists! - Hybrid sort is still quasi-linear - we haven't changed complexity because the insertion sort only applies up to a _fixed_ value, and therefore can't contribute _asymptotically_. ## Section 2: Memory Access Patterns ### Exercise 1: Matrices In this exercise we'll look at how data layout and access patterns can affect the performance of matrix multiplication and transposition. In this exercise matrix algorithms have been provided in two forms: one in which the indexing is two dimensional (e.g. `M[i][j]` for a `double**` or `vector<vector<double>>`) and one in which it is one-dimensional (e.g. `M[i*N + j]` for a `double*` or `vector<double>` of size `N*N`). The methods are designed to be usable with both vectors and C-style arrays, so they take the size of the matrices as a parameter `N`. You may want to experiment with the efficiency of these different options of data structures, with optimisations on and off. Just make sure that you don't start leaking memory! For simplicity, all of these methods are just designed for square matrices so that both matrices have identical dimensions. **You may find it necessary to turn optimisation on to get good results if you are using `vector<>` or `vector<vector<>>`.** If you use C-style arrays you should check your results with optmisations on and off. 1. Go into the `matrix_examples` folder and compare the two matrix multiplication algorithms in `matrix_maths.h`. `rowMajor_by_rowMajor` multiplies two row major order matrices (stored row by row), and `rowMajor_by_colMajor` multiplies a row major order matrix and a column order matrix (stored column by column). Which do you expect to be more cache efficient? Time and compare them for different size matrices (up to N ~ 1000-2000 depending on the time it takes to execute). 2. A trivial matrix tranpose function has been provided. Write a new matrix transpose that is more cache efficient by transposing your matrix in smaller sub-matrices instead. 3. Compare the performance with the trivial algorithm for matrices up to the size of a few thousand elements. (You can also experiment with the effect that the size of the sub-matrices has!) 4. How does the multiplication time compare with the time to tranpose the matrix? Given two row major matrices, is it more efficient to just multiply them together using the naive algorithm, or to tranpose one into column major format first? Does it depend on the size of the matrix? 5. **Optional** Come back to this one if you finish with the rest with time to spare! - Use recursion to write a cache oblivious matrix multiplication algorithm. (Hint: first express the matrix product in terms of the products of sub-matrices.) ### Solution: Timing code could look something like this (can also use profiling if preferred): ``` #include "matrix_maths.h" #include <random> #include <vector> #include <iostream> #include <chrono> using std::vector; typedef std::chrono::high_resolution_clock myClock; double t_seconds(std::chrono::time_point<myClock> t1, std::chrono::time_point<myClock> t2) { return (t2 - t1).count() / 1e9; } int main() { std::mt19937_64 rng; std::uniform_real_distribution<double> dist(0,1); int N = 1000; double *X = new double [N*N]; double *Y = new double [N*N]; double *Z = new double [N*N]; for(int i = 0; i < N; i++) { X[i] = new double[N]; Y[i] = new double[N]; Z[i] = new double[N]; for(int j = 0; j < N; j++) { X[i][j] = dist(rng); Y[i][j] = dist(rng); Z[i][j] = 0; } } auto rowRow_start = myClock::now(); rowMajor_by_rowMajor_flat(X, Y, Z, N); auto rowRow_end = myClock::now(); auto rowCol_start = myClock::now(); rowMajor_by_colMajor_flat(X, Y, Z, N); auto rowCol_end = myClock::now(); printf("N = %d, Row major by Row major time = %fs, Row major by Column Major time = %fs\n", N, t_seconds(rowRow_start, rowRow_end), t_seconds(rowCol_start, rowCol_end)); delete[] X; delete[] Y; delete[] Z; return 0; } ``` - I've use a C-style in flat layout to simplify the memory layout, reduce any overheads which might interfere with what we're trying to do, and keep things as efficient as possible. For large matrices a time difference should be noticeable, e.g. (on my machine) for N = 1000: - Row major by Row major time = 5.497121s, Row major by Column Major time = 3.821844s with optimisations off - N = 1000, Row major by Row major time = 0.516904s, Row major by Column Major time = 0.270431s with `-O3` - If I use vectors with a flat arrangement I get roughly twice the time with optimisations on. Cache friendly matrix transpose can be recursive or block-wise iterative. A simple blockwise version would look something like this: ``` template<typename T> void cached_transpose(const T &A, T B, int N, int block_size) { int nBlocks = N/block_size + 1; for(int blk_i = 0; blk_i < nBlocks; blk_i++) { for(int blk_j = 0; blk_j < nBlocks; blk_j++) { for(int i = 0; i < block_size; i++) { for(int j = 0; j < block_size; j++) { int idx_i = blk_i * block_size + i; int idx_j = blk_j * block_size + j; if(idx_i >= N || idx_j >= N) { continue; } B[idx_i*N + idx_j] = A[idx_j*N + idx_i]; } } } } } ``` In this case it would be worth experimenting with the effect that the block size has on the solution. Block size should not impact the solution too much unless it becomes so large that sub-matrices no longer fit into the cache, or if the block size is very small (e.g. 1). Transposition is more efficient than multiplication ($O(n^2)$ vs $O(n^3)$) so you should be able to go to larger sizes if necessary to see a difference. Consider going up to 5000 - 10,000 which should be efficient enough. On my machine: - N = 10000, trivial transpose time = 1.286006s, cached transpose time = 0.628225s with optimisations off. - N = 10000, trivial transpose time = 0.977916s, cached transpose time = 0.220377s with `-O3` Given how quickly the time taken to multiply matrices rises, the difference between multiplying a row-major matrix with a row-major matrix or a column-major matrix quickly exceeds the time to tranpose the second row-major matrix into column-major form. So for large matrices it is more efficient to transpose the second matrix first and then multiply them together. This reduces the number cache misses. A recursive multiplication algorithm can split each matrix into four submatrices, and expresses the multiplication as 8 sub-matrix multiplications. ``` |A B| |A' B'| |C D| x |C' D'| = M ``` Elements of the product matrix can be calculated as e.g. $M_{00} = \sum_k (A_{0k} A'_{k0} + B_{0k} C'_{k0})$ Which allows us to express the matrix product in terms of the products of sub-matrices. Below a certain matrix size it becomes more efficient to transition to normal matrix multiplication due to the overheads of additional processing and recursive calls. (This reflects how typical matrix libraries work. Trivial matrix multiplication is used for small matrices, whereas larger matrices will use a method such as Strassen multiplication, which has better asymptotic performance but high overheads for small N. Strassen is also recursive, so when the sub-matrix size becomes small enough, libraries will switch to trivial multiplication to cut off the recursion and save time.) ### Exercise 2: Array of Structs vs Struct of Arrays This exercise explores two different ways of storing data about a large number of objects. In this case, we'll use particles which have mass, position, and velocity. - Mass is a scalar, i.e. a single `double` - Position and velocity are 3D vectors, i.e. three `double` values for $(x, y, z)$ or $(v_x, v_y, v_z)$. A large set of these particles can be stored in two ways: - **Array of Structs**: We can define a struct or class which represents a single particle and has fields for mass, position, and velocity. The whole set is then represented as an array of particlar structs. - **Struct of Arrays**: We can define a struct or class which represents the entire collection of particles, which has fields which are arrays. Each array then stores a value (for example mass) for every particle, and an individual particle's properties can be accessed by looking up the same index in each field. Sequential elements of arrays/vectors are contiguous in (virtual) memory. For structs or classes, fields are usually arranged in contiguous (virtual) memory in the order they are declared, one after the other. (In some cases the compiler may also pad the memory for alignment reasons.) 1. Go into the `array_examples` folder and write some code in the `.cpp` file to calculate the total mass of the particles, first using an array of structs (`pm_array`), and then using the struct of arrays approach (`pmc`). 2. Write a function to calculate the centre of mass of the particles, again using both approaches. - The centre of mass is a 3D vector like position or velocity. It's components are found by a weighted sum: $x_\text{com} = \frac{1}{M_\text{tot}}\sum_i m_i x_i$, which you can do for each of $x$, $y$, and $z$. 4. Write code to time these processes for both approaches. 5. What will happen if we make our particle description more complicated? Add fields to your structs so that every particle also has angular momentum (an additional vector $(L_x, L_y, L_z)$). - For the `PointMass` struct you should add a new `vec` field. - For the `PointMassCollective` class you should add three new vectors, one for each dimension of angular momentum. - How does this impact the timing for your array of structs? - How do these extra vectors impact the timing for your struct of arrays? - Why? The way that we arrange memory can also impact the efficacy of compiler optimisation. Many optimisations are built around "vectorised" functions i.e. code where we take a simple arithmetic process which applies to a group of variables and apply this to elements of vectors by looping over arrays. For example the following code: ```cpp= for(int i = 0; i < N; i++) { c[i] = a[i] + b[i]; } ``` applies the simple sum code to every element of arrays `a`, `b`, and `c`. This kind of code can be well optimised. 4. Experiment with compiling your code with different levels of optimisation `-O1`, `-O2`, `-O3`. How is the speed of each approach affected? Does one speed up more than the other? ### Solution: You can calculate and time mass and centre of mass calculations like this (or similar!) ```cpp= double M_tot; // Calculate total Mass M_tot = 0; auto t1 = timerClock.now(); for(long i = 0; i < N; i++) { M_tot += pm_array[i].m; } auto t2 = timerClock.now(); cout << "M_tot = " << M_tot << " in " << getTime(t1, t2) << "s" << endl; M_tot = 0; auto t3 = timerClock.now(); for(long i = 0; i < N; i++) { M_tot += pmc.m[i]; } auto t4 = timerClock.now(); cout << "M_tot = " << M_tot << " in " << getTime(t3, t4) << "s" << endl; cout << (getTime(t1,t2) / getTime(t3,t4) - 1) * 100 << "\% speed up on mass!" << endl; // Calculate C.O.M. double mx{0}, my{0}, mz{0}; vec r_com; M_tot = 0; auto t_com_1 = timerClock.now(); for(long i = 0; i < N; i++) { double m = pm_array[i].m; M_tot += m; mx += m * pm_array[i].r.x; my += m * pm_array[i].r.y; mz += m * pm_array[i].r.z; } auto t_com_2 = timerClock.now(); r_com.x = mx / M_tot; r_com.y = my / M_tot; r_com.z = mz / M_tot; cout << r_com.x << ", " << r_com.y << ", " << r_com.z << " in " << getTime(t_com_1, t_com_2) << endl; M_tot = 0; mx = 0; my = 0; mz = 0; auto t_com_3 = timerClock.now(); for(long i = 0; i < N; i++) { double m = pmc.m[i]; M_tot += m; mx += m * pmc.x[i]; my += m * pmc.y[i]; mz += m * pmc.z[i]; } auto t_com_4 = timerClock.now(); r_com.x = mx / M_tot; r_com.y = my / M_tot; r_com.z = mz / M_tot; cout << r_com.x << ", " << r_com.y << ", " << r_com.z << " in " << getTime(t_com_3, t_com_4) << endl; cout << (getTime(t_com_1,t_com_2) / getTime(t_com_3,t_com_4) - 1) * 100 << "\% speed up on com!" << endl; ``` - With optimisations off I get a clear advantage on summing the masses using a struct of arrays approach: usually around 30-40%. On the centre of mass calculation it's less impactful, usually not more than 10% speed-up. - When we use an array of structs which contains extra that we're not using (in the total mass calculation we don't use position or velocity, in the centre of mass calculation we don't use velocity) we can end up caching things that we don't actually use because they're next to values that we do, and we end up jumping around more in memory to get past the parts of the object that we're not interested in. Using a struct of arrays means that we only access the parts of the data that we care about and our accesses for each of these arrays are sequential. - With optimisations on, I get an even clearer advantage: usually over 50% speed up on the total mass and over 100% speed-up on the centre of mass calculation. - When we add more data to our fields, it makes our array of structs approach even worse, because there's now more data getting in the way than before! It doesn't make our struct of arrays approach any worse though, so this is particularly important if you have a large number of data fields. - This approach applies to many problems in physics, especially when it comes to simulations of particles or on a grid such a fluid dynamics where functions are defined at a large number of points. ## Section 3: Optimisation & Floating Point Arithmetic ## Exercise 1: Summation Methods 1. Look at the `kahan_sum` and `trivial_sum` methods provided in the `sum_examples` folder, which add a list of values together. What does the Kahan sum do differently? What is the purpose of the variable `c`? 2. Write a program which times and checks the result of the summation algorithms with at least a million `float` elements with value `1e-6`. (Use `float` because it will be easier to detect arithmetic errors compared with `double`.) For reliable timing you may need to increase the number of elements substantially! 3. Now try compiling with flags `-O1`, `-O2`, `-O3`, with and without `-ffast-math`. What happens to the timing? What about the accuracy? 4. What has `fast-math` done to the Kahan summation in order to optimise it? ### Solution: Kahan summation is a **compensated sum** method, which attempts to correct floating point rounding errors when summing lots of small values together. (There are more even better ways to do this, but this one is probably simplest!) The value `c` represents this rounding error. Assume that we have an inaccurate summation operator which we'll call $\oplus$, and an inaccurate subtraction we'll call $\ominus$. Then we could say that for a large sum $S$ and small value $x$: $S \oplus x = S' = (S+x) + \delta$ So our sum has this extra $\delta$ which needs to be subtracted form the next value that we add on. In order to find it we can use our subtraction: $S' - S$ finds the value that actually got added to $S$ when you take into account rounding. We can use $\ominus$ for this because if $S \gg x$ then $S$ and $S'$ will likely have the same exponent in which case we could calculate their difference exactly. In this case $S' \ominus S \approx x + \delta$, $(S' \ominus S) \ominus x \approx \delta$. This last step we approximate the difference between the value that was actually added ($S' \ominus S$) and the value we wanted to add ($x$) to get $\delta$. Keeping track of these rounding errors allows us to sum floating point numbers _much_ more accurately, and these kinds of techniques are vital for high precision numerical methods in general. Notice though that if we were to approximate our floating point operations ($\oplus$ and $\ominus$) as being equivalent to real arithmetic ($+$ and $-$), then we would always have that $\delta = 0$. We can see the difference in summing like this: ``` #include "Kahan.h" #include <iostream> int main() { long N = 1000000; float val = 1e-6; vector<float> v(N); for(auto &x : v) { x = val; } float triv = trivial_sum(v, N); float k = Kahan_sum(v, N); printf("trivial sum = %f, kahan sum = %f\n", triv, k); return 0; } ``` which would return `trivial sum = 1.009039, kahan sum = 1.000000`. If we use the flags `-O1 -ffast-math` in the command line compiler or edit the CMakeLists to: ``` cmake_minimum_required(VERSION 3.0) project(optimisation_exercises) set(CMAKE_CXX_STANDARD 11) set(CMAKE_CXX_STANDARD_REQUIRED ON) include_directories(include) add_executable(sum source/sums.cpp) target_compile_options(sum PRIVATE -O1 -ffast-math) ``` we should see that they agree: `trivial sum = 1.009039, kahan sum = 1.009039`. The timing of these two algorithms will now also agree, indicating that they are probably being executed in the same way despite the differences in code! Because fast-math allows free re-associations, fast-math can identify the variable `c` in our code as being algebraically zero if we use real arithmetic (see our discussion of $\delta$ above), and the compiler optimises away what it sees as a pointless calculation. This leads to the Kahan sum being turned into a normal summation, and destroys our attempts to capture the rounding errors associated with floating point arithmetic. Many numerical algorithms use similar methods to keep track of errors and compensate for them, so **be very careful not to use fast-math for numerical libraries.** ## Exercise 2: SIMD Registers 1. Write a function to add two vectors together, and try compiling it using `-fopt-info-vec` with either `-ftree-vectorize` (at any level > 0) or `-O3` flags to turn on vectorisation and get information back. Look at the size of the vectors in bytes - how many floats or doubles can fit into a vectorised block on your machine? - If your CPU does not have SIMD registers, this won't work so don't worry if this does not happen! - If you want to use 256-bit (32-byte) registers on an x86 machine you'll need to enable the compiler option `-mavx`. SIMD registers are more efficient if your code is **aligned** in memory, which means that the memory address is aligned with some regular spacing. For example, 16-byte alignment will mean that the memory address is divisible by 16, and 32-byte alignment will mean that the memory address is divisible by 32. A single precision `float` is 4 bytes, so 4 floats places next to each other are 16 bytes. If the floats are contiguous in memory, and the first float is aligned to a 16-byte boundary, then all 4 floats can be loaded into a 16 byte register with one efficient operation. If your floats are not aligned with the boundary, then the operation will be more complex and slower. To enforce alignment, we can use a few approaches, defined in the `#include <stdalign>`. For a variable allocated in automated storage (a stack variable) we can use `alignas`: ```cpp= //stack declared array of 4 floats, aligned to 16 byte boundary alignas(16) float f16[4]; //24 floats aligned to 32 byte boundary. alignas(32) float f32[24]; //f32[0], f32[8], and f32[16] are aligned with teh 32 byte boundary //8 floats (e.g. f32[0] to f32[7]) can be loaded into a 256 bit register ``` For a heap declared variable you can use `std::align_val_t` with `new`: ```cpp= float *x = new (std::align_val_t(16)) float[4]; double *y = new (std::align_val_t(32)) double[4]; ``` 2. Modify your vector addition example to ensure that your memory buffers are properly aligned. - Use 16-byte alignment for 128-bit / 16-byte SIMD registers. - Use 32-byte alignment for 256-bit / 32-byte SIMD registers. 3. **OPTIONAL** If you have time and are curious, you may want to experiment with using C++ intrinsics to hard code SIMD behaviour without having to rely on the compiler to optimise it for you. This can give you more control **but it is not platform independent**. There are different intrinsics for x86 and ARM processors. Let's consider how a iteration of our vectorised addition works at a processor level: 1. You load 4 (or 8) floats into a 16 (or 32) byte register from your first array. 2. You load 4 (or 8) floats into a 16 (or 32) byte register from your second array. 3. You perform a vectorised addition operation. 4. You place the resulting 4 (or 8) float buffer into memory. For x86 we need the commands / includes: ```cpp // 128 bit definitions #include <xmmintrin.h> // 256 bit definitions #include <immintrin.h> // _mm_load_ps takes a pointer to the first float of a pack of four floats __m128 loaded_floats = _mm_load_ps(address); // vectorised addition, takes 2 __mm128 arguments __mm128 result = _mm_add_ps(loaded_x, loaded_y); // store value into memory address, takes a pointer and an __mm128 _mm_store_ps(address, result) //256 bit intrinsics // load a float buffer. __mm256 i __m256 loaded_floats = _mm256_load_ps(address); // performing a vector addition __mm256 result = _mm256_add_ps(loaded_x, loaded_y) //store _mm256_store_ps(address, result); ``` for ARM we need the commands / includes: ```cpp #include <arm_neon.h> // Load data: takes a pointer to the first of the four floats // float32x4_t is for 4 floats i.e. 128 bit float32x4_t loaded_floats = vld1q_f32(addres); // Vectorised add: takes 2 float32x4_t type arguments float32x4_t result = vaddq_f32(loaded_x, loaded_y); // store in memory: takes a pointer and a float32x4_t argument and stores the result at that address vst1q_f32(address, result); ``` Try implementing your loop using intrinsics and compile **with optimisations turned off**. Check that your code is still vectorised. - You should check your code is working correctly by comparing the result with a simple non-vectorised version. - You can also time it to check that it is indeed speeding up and vectorising properly, or you can inspect the assembly code in the debugger or by using `objdump`. Because this code is not platform independent, we should generally avoid this approach unless we are sure that we only want to develop for a specific family of CPUs _or_ we can use CMake to detect the architecture and do system-specific compilation, compiling ARM code for ARM CPUs and x86 code for x86 CPUs. This is beyond the scope of this course, but is key to the efficiency of many popular libraries like Eigen! ### Solution You array function can be written (for example) ```cpp template <typename T> void add_vec(T &A, T &B, T &C, int N) { for(int i = 0; i < N; i++) { C[i] = A[i] + B[i]; } } ``` You can then use the line `target_compile_options(sum PRIVATE -O3 -fopt-info-vec)` in your CMakeLists to use vectorise and output information. When you re-build, you should get a line like this: `optimized: loop vectorized using 16 byte vectors` or likewise for 32 byte vectors. (Remember, you do have to call your function in your main, otherwise it will be ignored!) In this case, it can fit 4 `float` or two `double` numbers in a single SIMD register. To properly align our code, let's look at the main: ```cpp= #include <iostream> #include <random> #include <stdalign.h> using namespace std; template <typename T> void add_vec(T &A, T &B, T &C, int N) { for(int i = 0; i < N; i++) { C[i] = A[i] + B[i]; } } int main() { const int N = 128; // example of stack allocation //alignas(16) float x[N]; //alignas(16) float y[N]; //alignas(16) float z[N]; // heap allocation float *x = new (std::align_val_t(16)) float[N]; float *y = new (std::align_val_t(16)) float[N]; float *z = new (std::align_val_t(16)) float[N]; // Check alignment of addresses; each address modulo 16 should be 0 std::cout << uintptr_t(x) % 16 << " " << uintptr_t(y) % 16 << " " << uintptr_t(z) % 16 << std::endl; mt19937_64 rng(42); uniform_real_distribution<float> dist; float buffer[8]; for(int i = 0; i < N; i++) { x[i] = dist(rng); y[i] = dist(rng); } add_vec(x, y, z, N); // print out some values to avoid the loop being optimised out for(int i = 0; i < 16; i++) { cout << x[i] << " + " << y[i] << " = " << z[i] << endl; } // Don't forget to clear up memory, or better yet use RAII to wrap the allocations and put the delete[] in the destructor!! delete[] x; delete[] y; delete[] z; return 0; } ``` - If you have an x86 process you can also use `-mavx` in your compiler options and use 32 byte alignment instead of 16. To explicitly code with intrinsics, you can write something like this: ```cpp= #include <iostream> #include <random> #include <stdalign.h> #include <xmmintrin.h> // for 128 bit x86 //#include <immintrin.h> // for 256 bit x86 //#include <arm_neon.h> // for ARM using namespace std; //Removed templating since we need pointers now void add_vec(float *A, float *B, float *C, int N) { // Sum array using intrinsics // Note i+= 4 as we are adding 4 floats at a time for (int i = 0; i < N; i += 4) { // remember A points to the first element in the array // A[0], A[4], A[8] etc. are all memory aligned __m128 loaded_A = _mm_load_ps(A + i); __m128 loaded_B = _mm_load_ps(B + i); __m128 result = _mm_add_ps(loaded_A, loaded_B); _mm_store_ps(C + i, result); // ARM version // float32x4_t loaded_x = vld1q_f32(x + i); // float32x4_t loaded_y = vld1q_f32(y + i); // float32x4_t result = vaddq_f32(loaded_x, loaded_y); // vst1q_f32(z + i, result); } } int main() { const int N = 128; //alignas(16) float x[N]; //alignas(16) float y[N]; //alignas(16) float z[N]; float *x = new (std::align_val_t(16)) float[N]; float *y = new (std::align_val_t(16)) float[N]; float *z = new (std::align_val_t(16)) float[N]; float *q = new float[N]; std::cout << uintptr_t(x) % 16 << " " << uintptr_t(y) % 16 << " " << uintptr_t(z) % 16 << std::endl; mt19937_64 rng(42); uniform_real_distribution<float> dist; float buffer[8]; for(int i = 0; i < N; i++) { x[i] = dist(rng); y[i] = dist(rng); } add_vec(x, y, z, N); // Check your result with a non vectorised loop for(int i = 0; i < N; i++) { q[i] = x[i] + y[i]; } for(int i = 0; i < 16; i++) { cout << x[i] << " + " << y[i] << " = " << z[i] << "; " << q[i] << endl; } delete[] x; delete[] y; delete[] z; delete[] q; return 0; } ``` - If you are using ARM you will need to comment out the x86 include and uncomment the ARM include, and do the same for the x86/ARM loop code inside add_vec.