# Optimization course planning ## Course description ### Node-Level Performance Optimization This course covers advanced topics on code optimization for x86 platforms (Intel and AMD CPUs). We discuss different techniques for analyzing and maximizing both single and multi-core performance within a single node. The topics include instruction-level parallelism, vectorization, and efficient utilization of cache and memory. The course consists of lectures and hands-on exercises. Learning outcome - Awareness of features and internal workings of x86 CPUs - Ability to analyze and assess single-node performance - Ability to vectorize computations - Ability to optimize cache and memory access Prerequisites - Good knowledge of C/C++ or Fortran - Good knowledge of threading using OpenMP - Basic knowledge of modern CPU architectures Agenda Day 1 - Overview about performance engineering - General overview of modern multicore CPU - Main memory performance - Performance analysis tools Day 2 - Deeper dive into caches - Detailed look into Intel and AMD CPUs - Advanced vectorization - Additional optimization topics ## Detailed agenda Time: Thu-Fri 23.5. - 24.5. ## Day 1 - 09:00 - 09:10 Welcome the course, introductions - 09:10 - 09:40 Overview about performance engineering (Martti) - principles, sampling, tracing - hardware counters, PAPI - roofline model - 09:40 - 09:50 break - 09:50 - 10:30 General overview of modern multicore CPU (Jussi) - front-end, back-end - fetch-decode-execute, pipelining - vectorization - short overview of cache hierarchy and NUMA - 10:30 - 11:15 Exercise - peak flops, effects of pipelining, registers - vectorization - hw counters with perf? - 11:15 - 12:00 Main memory performance (Jussi) - main memory bandwidth - brief mention about cache hierachy - memory controllers - NUMA - single vs. multiple threads - first touch - thread affinity - 12:00 - 13:00 Lunch break - 13:00 - 13:45 Exercise - stream with different number of threads / affinities - first touch effects - 13:45 - 14:15 Amduprof? (Igor) - 14:15 - 14:30 Coffee - 14:30 - 15:15 Intel VTune and Advisor (Mikko) - 15:15 - 16:00 Exercises - Playing around with tools 16:00 - 16:15 Summary / Q & A ## Day 2 - 09:00 - 10:00 Deeper dive into caches (Martti) - cache line, alignment, assosiativity - basic optimization ideas - interactive demos with HW counters - 10:00 - 11:00 Exercises - 11:00 - 12:00 Deeper dive into Intel and AMD CPUs (Mikko + Igor) - 12:00 - 13:00 Lunch - 13:00 - 13:45 Deeper dive into vectorization (Mikko) - 13:45 - 14:30 Exercises - 14:30 - 14:45 Coffee break - 14:45 - 15:15 Other optimisation topics (Jussi) - loop transformations etc. - 15:15 - 16:00 Exercises - 16:00 - 16:15 Summary / Q&A ## Potentially useful links - https://github.com/google/benchmark - https://crd.lbl.gov/departments/computer-science/par/research/roofline/ - https://github.com/Kobzol/hardware-effects - https://gcc.godbolt.org