# PHAS0100 - Week 7 (5th March 2021) :::info ## :tada: Welcome to the 7th live-class! ### Today Today we will cover - How to measure performance - Algorithmic complexity - Memory structures and caching - Compiler optimisation ::: ### Timetable for today | Start | End | Topic | Room | | -------- | -------- | -------- | ----- | |2pm|2:20pm |Introduction, timing, and complexity |main| |2:20pm|2:50pm| [Timing & Complexity exercises](#Breakout-session-1)| Breakouts| |2:50pm|3pm| -- 10 minute break -- | | |3pm|3:20pm| Memory and Caching | main | |3:20pm|3:50pm |[Cache optimisation exercises](#Breakout-session-2)| Breakouts| |3:50pm|4pm| -- 10 minute break -- | | |4pm |4:20pm| Compiler Optimisation| main | | 4:20pm|4:50pm| [Optimisation exercises](#Breakout-session-3) | Breakouts| |4:50pm|4:55pm| - 5 minute closeout | main| ## Breakout session 1 ## Class Exercise 30 ### Timing algorithms Clone the timing exercises from https://github.com/UCL-RITS/PHAS0100_Sorting as shown below. ```bash= git clone https://github.com/UCL-RITS/PHAS0100_Sorting.git cd PHAS0100_Sorting mkdir build ``` This will give you access to two sorting algorithms in a header file in the include folder: an insertion sort and a merge sort. Each one takes an unsorted vector and returns a sorted vector (rather than updating in place). 1. Write a program to generate a vector of random numbers of sizes between 1 and 1000. 2. Sort these vectors using `insertion_sort` and `merge_sort`. 3. Time the algorithms on different length lists using the `std::chrono` library. Do the times scale the way that you expect from the algorithms complexity? When you would use insertion sort, and when would you use merge sort? 4. Clone the profiling example repo from https://github.com/UCL-RITS/PHAS0100_Profiling 5. Check out the `CMakeLists.txt` file and notice the use of the `-pg` flag: this is what tells the compiler to insert the bookkeeping instructions into you program! 6. Compile and run your program as normal. When it's done, you should have find `gmon.out` has been created! 7. Run the following command: `gprof ProfileTest gmon.out > analysis.txt`. (The `> analysis.txt` part redirects the output into a text file.) 8. Open `analysis.txt` and take a look at the text output. The output will be divided into two main parts: the flat profiler (first) and the call graph (second). The flat profiler will order your functions by the fraction of time they took not including subroutine calls. The call graph will order your functions by the fraction of time they took _including_ subroutine calls. Compare the profiling analysis to the program and make sure you can understand it. 9. Use gprof to profile usage of the sorting algorithms on large or multiple lists (> 10000 items, or until the sorting takes >> 0.01 seconds) and compare with internal timing code. (Remember to add `SET(CMAKE_CXX_FLAGS -pg)` in your `CMakeLists.txt`.) Then compile and run your program as normal. Once it has run, use `gprof Sort gmon.out > sortAnalysis.txt` to run the profiler and store the results. Are the results what you expect? 10. Look at `merge_sort` in the flat profile and the call graph and explain the differences in what you see. ## Breakout session 2 ## Class Exercise 31 ### Cache optimisation Clone the repository from https://github.com/UCL-RITS/PHAS0100_Caching and create a build folder. 1. Compare the two matrix multiplication algorithms in `matrix_maths.h`. `rowMajor_by_rowMajor` multiplies two row major order matrices (stored row by row), and `rowMajor_by_colMajor` multiplies a row major order matrix and a column order matrix (stored column by column). Which do you expect to be more cache efficient? Time and compare them for different size matrices (up to N ~ 1000-2000 depending on the time it takes to execute). 2. A trivial matrix tranpose function has been provided. Write a new matrix transpose that is more cache efficient by transposing your matrix is smaller sub-blocks instead. 3. Compare the performance with the trivial algorithm for matrices up to the size of a few thousand elements. 4. How does the multiplication time compare with the time to tranpose the matrix? Given two row major matrices, is it more efficient to just multiply them together using the naive algorithm, or to tranpose one into column major format first? Does it depend on the size of the matrix? 5. Discuss how to write a cache oblivious algorithm for matrix multiplication using recursion. ## Breakout session 3 ## Class Exercise 32 ### Compiler optimisation Clone the repo at https://github.com/UCL-RITS/PHAS0100_Optimisation 1. Look at the `kahan_sum` and `trivial_sum` methods provided, which adds a single value `val` to a total `N` times. (We use the same value each time so that the result is easily predictable!) 2. Write a program which times and checks the result of the summation algorithms with at least a million `float` elements. 3. Now try compiling with flags `-O1`, `-O2`, `-O3`, with and without `-ffast-math`. What happens to the timing? What about the accuracy? 4. Write a function to add two arrays, and try compiling it using `-fopt-info-vec` with either `-ftree-vectorize` (at any level > 0) or `-O3` when to get details of the vectorisation. Look at the size of the vectors in bytes - how many floats or doubles can fit into a vectorised block on your machine? 5. What other optimisations might apply to this array addition function? (Refer to https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html to see what is available at each level). 6. Explore the optimisation options in the documentation! Experiment with using them with the other code you've used this week, or with new functions. ## Homework Exercise 1 Complete the recursive matrix multiplication algorithm. Compare its performance with taking the transpose and performing a row-major by column-major multiplication. Instead of recursing down to the base case of a 1x1 matrix, try switching to standard matrix multiplication once the matrices reach some size threshold. How does this affect performance? ## Homework Exercise 2 Try measuring times using the external `date` library: https://github.com/HowardHinnant/date Compare usage to `std::chrono`. # Questions Here you can post any question you have while we are going through this document. A summary of the questions and the answers will be posted in moodle after each session. Please, use a new bullet point for each question and sub-bullet points for their answers. For example writing like this: ``` - Example question - [name=student_a] Example answer - [name=TA_1] Example answer ``` produces the following result: - Example question - [name=student_a] Example answer - [name=TA_1] Example answer Write new questions below :point_down: - [] - [name=TA_1] Example answer ###### tags: `phas0100` `teaching` `class`