# Performance programming training material ## planning Karl-Filip Faxen The workshop is interactive in the sense that students read text, write programs, run and measure programs and so on. We assume Linux, since that is what almost all supercomputers run. C is the primary language with as much material as possible duplicated in Fortran. I will add Fortran code when the C code has come along a bit since that will be important to set the structure. There is an issue with what machines to use. If the students run on their local machines as the workshop is given, the videoconferencing software (Teams, Zoom,...) will affect measurements quite dramatically. We will use a minimum of tools, with a focus on tools that are readily available in Linux like the bash command `time` as well as the `perf` tool. The learning experience will be structured around program examples falling into three themes: algorithms and complexity, compilers, and hardware. These themes also need some explanatory text, which we will try to divide into small parts between the excercises. ### List of topics - Basic measurement - Sources of error: noise, overhead, other things going on, frequency scaling - Possibly: CPU time, Elapsed time, user time, system time - Complexity: Big-Oh notation (O(n) and so on) - Basic polynomial classes n, n^2^, n^3^ - Logarithmic complexity log(n) - Maybe NP and exponential if we talk about search The program structures that tend to lead to different complexity - Nested loops for polynomial complexity - Divide and conquer for nlog(n) Amortized complexity What is input size? Practical experiment: plot execution time as function of input size - Algorithms: Strategies for improving complexity - Memoization - Avoiding redundant computation - Divide and conquer - Sparseness - Spatial trees (quadtrees, octrees, Barnes-Hut and minimal spanning trees) Practical experiment: Checking for repetitions, various variants - Architecture: adapting to the machine - Caches Measuring the characteristics of your cache (size, bandwidth, ...) Spatial locality: Use the cache lines (traversal order, matrix transpose) Temporal locality: Blocking (matrix multiply, filters, convolutions in machine learning) Prefetching: Use more of available bandwidth by parallelism (but just using the scheduling window is an alternative for the L1 at least) - Branch prediction: avoid hard-to-predict branches, the rest are ok Branchless code (selecting, using sign bit, etc) Conditional moves One branch for several conditions Loop fusion (fewer total iterations) Loop nesting: high trip count deepest (not always possible) - I/O performance Binary vs text (formatted) I/O Buffering - Vectorization - Easy vectorization (vector add, matrix add) - Difficult (matrix transpose) - Compilers: what can they do? - Inlining - Constant folding - Vectorization (what amount of vectorization from just the compiler?) - Use compiler's optimizer to allow for more readable code - Completely unrolled loops and small arrays give guaranteed register promotion - Some assembly code, simple examples, do type-n-talk annotation (maybe) - What instruction set (variant of x86) does the compiler target, and how to control it ### List of possible program examples and the ideas they illuminate - Checking for repetitions - pre processing your data for future reference - hash tables - Matrix matrix multiply - IO/init time vs compute time (asymptotic complexity) - Blocking (of course) - Traversal order (by transposing the second argument) - Pre processing (transposing second argument) - Vectorization - Matrix transpose - Spatial locality (large caches will do it for you if the matrix fits) - Tricky vectorization (useful if L1 bandwidth is a major concern) - Fibonnacci - Exponential complexity - Memoization - Quicksort - nlog(n) complexity - sensitivity to input data order unless pivot is randomized ### More comments Avoid expensive operations, shift instead of ## Somewhat old material The workshop is divided into four major parts: ### Algorithms and complexity (is there a better way?) - Basic algorithm complexity (constant, logarithmic, linear, quadratic...) Example: Are the elements of an array unique ```C int unique1(int a[], int n) { for(int i = 0; i < n; i++) for(int j = 0; j < n; j++) if(i != j && a[i] == a[j]) return 0; return 1; } ``` Complexity is quadratic (n^2^) due to the nested loops. Some of the computation is clearly redundant, each pair (x,y) of distict indices is checked twice, once when `i==x` and `j==y` and once when `i==y` and `j==x`. So the first improvement is to remove this simple redundancy: ```C int unique2(int a[], int n) { for(int i = 0; i < n-1; i++) for(int j = i+1; j < n; j++) if(i != j && a[i] == a[j]) return 0; return 1 } ``` Here we insist that `i<j` at all times so we check each (x,y) pair only once and also at the same time we avoid the test ensuring that ``i != j``. Can we do better? Yes, by sorting the array first. Sorting can be accomplished in time `n*log(n)` which is much faster than `n*n`. ```C - Amortized complexity - Patterns - Driven by examples (fib, removing duplicates, Barnes-Hut) ### Computer architecture and microarchitecture (work *with* the machine) - Caches - Pipelines - Vector programming ### Compiler optimizations (that you do not need to do yourself) - Using the optimizer as a poor person's partial evaluator - Constant propagation - Resolving conditionals statically - Inlining - Checking that you got what you expected with '-S' ### Observing performance (where is my program spending its time) - The perf tool - Instrumenting your program (performance analysis for printf() debbuggers) - Other tools as available