# PHAS0100 - Week 7 (3rd March 2023)
## Single Core Performance Programming
This week we'll explore complexity, memory structure, and floating point computation to learn more about building fast and accurate code for single processors.
## Timetable for this afternoon
| Start | End | Topic |
| -------- | -------- | -------- |
| 14:00 | 14:50 | Section 1: Complexity |
| 14:50 | 15:00 | Break |
| 15:00 | 15:50 | Section 2: Efficient Memory |
| 15:50 | 16:00 | Break |
| 16:00 | 16:30 | Section 3: Floating Point Computation |
| 16:30 | 16:50 | Discussion: Optimisation |
### Clone the Repository
You should [clone the repository here](https://github.com/UCL-PHAS0100-22-23/week_7_cpp_exercises). As usual, the top level folder will contain the `.devcontainer`, and each of the exercises will exist in a sub-folder with its own `CMakeLists.txt`.
### Adding Optimisation Flags
In these exercises we will explore the use of compiler optimisation. These need to be turned on! You can do this by making changes to the top level `CMakeLists.txt` in each exercise folder.
For example, to use the `-O1` flag for `MyExecutable` we use the line:
```
target_compile_options(MyExecutable PRIVATE -O1)
```
You can list multiple flags to turn on more than one.
If you change the flags or add/remove this line in the `CMakeLists.txt` you can just use `cmake --build build` to rebuild your executable and see what effect it has had.
### Timing Functions Using `std::chrono`
You'll need to make use of [`<chrono>`](https://cplusplus.com/reference/chrono/) from the standard library in order to time your function calls. We've provided a bit of basic code to declare a clock and determine the time difference in seconds between two points in time as a `double`:
```cpp=
#include <chrono>
typedef std::chrono::high_resolution_clock myClock;
double t_seconds(std::chrono::time_point<myClock> t1, std::chrono::time_point<myClock> t2)
{
return (t2 - t1).count() / 1e9;
}
```
The `typedef` isn't necessary but it makes code a lot less cluttered given all the namespaces and the long type names! You can get a point in time in your code using:
```cpp=
auto time = myClock::now();
```
To time the execution of a piece of code you should get the time point before and after the code that you want to time, and then you can calculate the difference using `t_seconds`.
## Section 1: Complexity
### Exercise 1: Algorithm Analysis
In this exercise we'll look at some methods of storing and looking up data in key-value pairs (such as in a `map` or `dictionary` type in C++ or Python).
1. Go into the `map_examples` folder, and read the code for the methods:
- `SortedList.h` describes a simple list data-structure where the elements are sorted by key.
- `Table.h` describes a table data-structure with a fixed number of rows. Key value pairs are placed into a row based on a "hash-function" called `hash`, which you may assume is $O(1)$ time. An ideal hash function converts data into an integer whilst minimising "collisions" i.e. different data should rarely map to the same number.
- `Tree.h` describes a tree data-structure i.e. a graph where each node has (up to) two child nodes.
2. For each storage solution, determine based on the code provided:
- What is the average case complexity of `insert` / `lookup`?
- What is the worst case complexity of `insert` / `lookup`?
- What conditions produce the worst case performance?
- For the `Tree` consider how the ordering of keys added affects the tree structure.
- For the `Table` consider how keys are mapped into rows, and consider what happens in the case where there are **many more rows than key-value pairs (take this as the average case)**, and the case where there are **more key-value pairs than rows (take this as the worst case)**.
- **You may assume that `push_back` is $O(1)$, and can ignore any vector re-sizing. You should also assume `hash` is $O(1)$.**
- **The complexity should be understood as scaling with the number of elements in the data-structure e.g. the complexity of inserting an element into a sorted list which contains $n$ elements already.**
### Exercise 2: Sorting
In this exercise we'll explore the merge sort and insertion sort algorithms described in the notes, and create a hybrid algorithm which can switch strategies.
1. In the `sorting_examples` folder, you will find a header which defines function templates for `merge_sort` and `insertion_sort`, and a source file with `main`. These sorting algorithms do _out of place_ sorting i.e. return a new sorted vector leaving the original intact. This means that you can use the same input vector on the different algorithms.
2. Write some code which generates a vector of random integers of a given length.
3. Time how long it takes to sort random lists of different lengths with `merge_sort` and `insertion_sort`.
- Which algorithm works best for the smaller lists? Which algorithm works best for largest lists? Why?
- Looking at your timings, do the algorithms scale in the way that you expect them to?
- Find the length of the list for where the efficiency of these algorithms cross over i.e. time taken for insertion sort is approximately equal to the time taken for a merge sort.
- You may want to find this programmatically using your timing code!
4. Compile your code with optimisations turned on. Does it affect which one is faster, or the cross-over point? Does it affect how they scale?
5. Write a new sorting method which appropriately switches between these methods so that lists below a given size are always sorted by the more efficient method.
## Section 2: Memory Access Patterns
### Exercise 1: Matrices
In this exercise we'll look at how data layout and access patterns can affect the performance of matrix multiplication and transposition.
In this exercise matrix algorithms have been provided in two forms: one in which the indexing is two dimensional (e.g. `M[i][j]` for a `double**` or `vector<vector<double>>`) and one in which it is one-dimensional (e.g. `M[i*N + j]` for a `double*` or `vector<double>` of size `N*N`). The methods are designed to be usable with both vectors and C-style arrays, so they take the size of the matrices as a parameter `N`. You may want to experiment with the efficiency of these different options of data structures, with optimisations on and off. Just make sure that you don't start leaking memory!
For simplicity, all of these methods are just designed for square matrices so that both matrices have identical dimensions.
**You may find it necessary to turn optimisation on to get good results if you are using `vector<>` or `vector<vector<>>`.** If you use C-style arrays you should check your results with optmisations on and off.
1. Go into the `matrix_examples` folder and compare the two matrix multiplication algorithms in `matrix_maths.h`. `rowMajor_by_rowMajor` multiplies two row major order matrices (stored row by row), and `rowMajor_by_colMajor` multiplies a row major order matrix and a column order matrix (stored column by column). Which do you expect to be more cache efficient? Time and compare them for different size matrices (up to N ~ 1000-2000 depending on the time it takes to execute).
2. A trivial matrix tranpose function has been provided. Write a new matrix transpose that is more cache efficient by transposing your matrix in smaller sub-matrices instead.
3. Compare the performance with the trivial algorithm for matrices up to the size of a few thousand elements. (You can also experiment with the effect that the size of the sub-matrices has!)
4. Given two row major matrices, is it more efficient to just multiply them together using the naive algorithm, or to tranpose one into column major format first? Does it depend on the size of the matrix?
- Compare the timing of your functions.
- Consider the time complexity of transposition and multiplication.
5. **Optional** Come back to this one if you finish the rest with time to spare!
- Use recursion to write a cache oblivious matrix multiplication algorithm. (Hint: first express the matrix product in terms of the products of sub-matrices.)
### Exercise 2: Array of Structs vs Struct of Arrays
This exercise explores two different ways of storing data about a large number of objects. In this case, we'll use particles which have mass, position, and velocity.
- Mass is a scalar, i.e. a single `double`
- Position and velocity are 3D vectors, i.e. three `double` values for $(x, y, z)$ or $(v_x, v_y, v_z)$.
A large set of these particles can be stored in two ways:
- **Array of Structs**: We can define a struct or class which represents a single particle and has fields for mass, position, and velocity. The whole set is then represented as an array of particlar structs.
- **Struct of Arrays**: We can define a struct or class which represents the entire collection of particles, which has fields which are arrays. Each array then stores a value (for example mass) for every particle, and an individual particle's properties can be accessed by looking up the same index in each field.
Sequential elements of arrays/vectors are contiguous in (virtual) memory.
For structs or classes, fields are usually arranged in contiguous (virtual) memory in the order they are declared, one after the other. (In some cases the compiler may also pad the memory for alignment reasons.)
1. To calculate the total mass of a set of particles we need to sum the mass for every particle. Given the memory layout of the array of structs and the struct of arrays, discuss which you would expect to be more memory efficient.
1. Go into the `array_examples` folder and write some code in the `.cpp` file to calculate the total mass of the particles, first using an array of structs (`pm_array`), and then using the struct of arrays approach (`pmc`).
2. Write a function to calculate the centre of mass of the particles, again using both approaches.
- The centre of mass is a 3D vector like position or velocity. It's components are found by a weighted sum: $x_\text{com} = \frac{1}{M_\text{tot}}\sum_i m_i x_i$, which you can do for each of $x$, $y$, and $z$.
4. Write code to time these processes for both approaches.
5. What will happen if we make our particle description more complicated? Add fields to your structs so that every particle also has angular momentum (an additional vector $(L_x, L_y, L_z)$).
- For the `PointMass` struct you should add a new `vec` field.
- For the `PointMassCollective` class you should add three new vectors, one for each dimension of angular momentum.
- How does this impact the timing for your array of structs?
- How do these extra vectors impact the timing for your struct of arrays?
- Why?
The way that we arrange memory can also impact the efficacy of compiler optimisation. Many optimisations are built around "vectorised" functions i.e. code where we take a simple arithmetic process which applies to a group of variables and apply this to elements of vectors by looping over arrays. For example the following code:
```cpp=
for(int i = 0; i < N; i++)
{
c[i] = a[i] + b[i];
}
```
applies the simple sum code to every element of arrays `a`, `b`, and `c`. This kind of code can be well optimised.
4. Experiment with compiling your code with different levels of optimisation `-O1`, `-O2`, `-O3`. How is the speed of each approach affected? Does one speed up more than the other?
## Section 3: Optimisation and Floating Point Arithmetic
## Exercise 1: Summation Methods
1. Look at the `kahan_sum` and `trivial_sum` methods provided in the `sum_examples` folder, which add a list of values together. What does the Kahan sum do differently? What is the purpose of the variable `c`?
2. Write a program which times and checks the result of the summation algorithms with at least a million `float` elements with value `1e-6`. (Use `float` because it will be easier to detect arithmetic errors compared with `double`.) For reliable timing you may need to increase the number of elements substantially!
3. Now try compiling with flags `-O1`, `-O2`, `-O3`, with and without `-ffast-math`. What happens to the timing? What about the accuracy?
4. What has `fast-math` done to the Kahan summation in order to optimise it? How does that relate to the flags you have used?
## Exercise 2: SIMD Registers
1. Write a function to add two vectors together, and try compiling it using `-fopt-info-vec` with either `-ftree-vectorize` (at any level > 0) or `-O3` flags to turn on vectorisation and get information back. Look at the size of the vectors in bytes - how many floats or doubles can fit into a vectorised block on your machine?
- If your CPU does not have SIMD registers, this won't work so don't worry if this does not happen!
- If you want to use 256-bit (32-byte) registers on an x86 machine you'll need to enable the compiler option `-mavx`.
SIMD registers are more efficient if your code is **aligned** in memory, which means that the memory address is aligned with some regular spacing. For example, 16-byte alignment will mean that the memory address is divisible by 16, and 32-byte alignment will mean that the memory address is divisible by 32.
A single precision `float` is 4 bytes, so 4 floats places next to each other are 16 bytes. If the floats are contiguous in memory, and the first float is aligned to a 16-byte boundary, then all 4 floats can be loaded into a 16 byte register with one efficient operation. If your floats are not aligned with the boundary, then the operation will be more complex and slower.
To enforce alignment, we can use a few approaches, defined in the `#include <stdalign>`.
For a variable allocated in automated storage (a stack variable) we can use `alignas`:
```cpp=
//stack declared array of 4 floats, aligned to 16 byte boundary
alignas(16) float f16[4];
//24 floats aligned to 32 byte boundary.
alignas(32) float f32[24];
//f32[0], f32[8], and f32[16] are aligned with teh 32 byte boundary
//8 floats (e.g. f32[0] to f32[7]) can be loaded into a 256 bit register
```
For a heap declared variable you can use `std::align_val_t` with `new`:
```cpp=
float *x = new (std::align_val_t(16)) float[4];
double *y = new (std::align_val_t(32)) double[4];
```
2. Modify your vector addition example to ensure that your memory buffers are properly aligned.
- Use 16-byte alignment for 128-bit / 16-byte SIMD registers.
- Use 32-byte alignment for 256-bit / 32-byte SIMD registers.
3. **OPTIONAL** If you have time and are curious, you may want to experiment with using C++ intrinsics to hard code SIMD behaviour without having to rely on the compiler to optimise it for you. This can give you more control **but it is not platform independent**. There are different intrinsics for x86 and ARM processors.
Let's consider how a iteration of our vectorised addition works at a processor level:
1. You load 4 (or 8) floats into a 16 (or 32) byte register from your first array.
2. You load 4 (or 8) floats into a 16 (or 32) byte register from your second array.
3. You perform a vectorised addition operation.
4. You place the resulting 4 (or 8) float buffer into memory.
For x86 we need the commands / includes:
```cpp
// 128 bit definitions
#include <xmmintrin.h>
// 256 bit definitions
#include <immintrin.h>
// _mm_load_ps takes a pointer to the first float of a pack of four floats
__m128 loaded_floats = _mm_load_ps(address);
// vectorised addition, takes 2 __mm128 arguments
__mm128 result = _mm_add_ps(loaded_x, loaded_y);
// store value into memory address, takes a pointer and an __mm128
_mm_store_ps(address, result)
//256 bit intrinsics
// load a float buffer. __mm256 i
__m256 loaded_floats = _mm256_load_ps(address);
// performing a vector addition
__mm256 result = _mm256_add_ps(loaded_x, loaded_y)
//store
_mm256_store_ps(address, result);
```
for ARM we need the commands / includes:
```cpp
#include <arm_neon.h>
// Load data: takes a pointer to the first of the four floats
// float32x4_t is for 4 floats i.e. 128 bit
float32x4_t loaded_floats = vld1q_f32(addres);
// Vectorised add: takes 2 float32x4_t type arguments
float32x4_t result = vaddq_f32(loaded_x, loaded_y);
// store in memory: takes a pointer and a float32x4_t argument and stores the result at that address
vst1q_f32(address, result);
```
Try implementing your loop using intrinsics and compile **with optimisations turned off**. Check that your code is still vectorised.
- You should check your code is working correctly by comparing the result with a simple non-vectorised version.
- You can also time it to check that it is indeed speeding up and vectorising properly, or you can inspect the assembly code in the debugger or by using `objdump`.
Because this code is not platform independent, we should generally avoid this approach unless we are sure that we only want to develop for a specific family of CPUs _or_ we can use CMake to detect the architecture and do system-specific compilation, compiling ARM code for ARM CPUs and x86 code for x86 CPUs. This is beyond the scope of this course, but is key to the efficiency of many popular libraries like Eigen!