6 Shared Memory Performance Issues

# 6 Shared Memory Performance Issues ## Performance Analysis ### Tool Types * Data acquisition * Event based data: triggered by explicit events * **`Sampling`** based data: triggered by external events like timers * **`Instrumentation`** * Source code instrumentation * Compiler instrumentation * Binary instrumentation * Link-level instrumentation * **Tradeoff**: invasiveness vs. overhead vs. ability to correlate * **Big question**: **`granularity`** * Aggregation * Aggregation over time and space: **simplified profile** ## Sequential Performance ### Starting Point: Flat Profile * Profiles show **`computationally intensive`** code regions * First views: `Time` spent per functions or per statements * Identify **`bottleneck`** components * View the `profile aggregated` by shared objects * Correct/expected modules? * Impact of support and runtime libraries * Typical implementation: **`sampling`** * Take a sample: `address` inside a function * Record sample and build histograms of `all PCs seen` * Over time: `statistically` cover all of the code * Perf: On-CPU time ### Adding Context through `Stack Traces` ```mermaid graph LR; id1[Function A]-->id2[Function B]; id1[Function A]-->id3[Function C]; id3[Function C]-->id4[Function D]; id4[Function D]-->id5[Function E]; id2[Function B]-->id5[Function E]; ``` * Missing information in `flat profiles` * Distinguish routines called from `multiple callers` * Understand the call `invocation history` * `Context` for performance data * Critical technique: **`Stack traces`** * Gather stack trace for each performance sample * Aggregate only samples with equal trace ### Inclusive vs. Exclusive Timing ```mermaid graph LR; id1[Function A]-->id2[Function B]; id1[Function A]-->id3[Function C]; subgraph Inclusive Time for C; id3[Function C]-->id4[Function D]; id4[Function D]-->id5[Function E]; end subgraph Exclusive Time for B; id2[Function B]-->id5[Function E]; end ``` * Implementation similar to flat profiles * Sample PC information * Additionally collect `call stack information` at every sample * **`Tradeoffs`** * Pro: Obtain `additional context information` * Con: `Higher overhead`/`lower sampling rate` ### Interpreting Call Context Data #### Inclusive versus exclusive times * `inclusive time ~= exclusive time` * not useful to profile below this layer * **`inclusive time >> exclusive time`** * Focus attention to the execution times of the **`children`** #### Hotpath analysis * Which paths takes the most time? * Path time might be ok/expected, but could point to a problem #### Butterfly analysis (best known from `gprof`) * Should be done on “suspicious” functions * Functions with `large execution time` * Functions with `large difference` between `inclusive` and `exclusive` time * Functions of interest * Functions that “take unexpectedly long” * ... * Shows `split of time` in callees and callers * Butterfly View (O|SS, similar in other tools) ![](https://i.imgur.com/U0D0LuE.png) ## Performance Counters * Timing information doesn’t tell you why it is slow * Next: Investigate hardware/application interaction ### Hardware Performance Counters * Architectural Features * Typically/Mostly `packaged inside the CPU` * Count hardware events `transparently without overhead` * Newer platforms also provide system counters * Network cards and switches * Environmental sensors * Drawbacks * Availability `differs` between platform & processors * Slight semantic differences between platforms * In some cases : `requires` `privileged access` & `kernel patches` * Recommended: Access through **`PAPI`** (Alternative `likwid`) * `API` for tools + simple runtime tools * Abstractions for system specific layers * More information: http://icl.cs.utk.edu/papi/ ### Examples of Typical Counters ![](https://i.imgur.com/MhwrUQL.png) ### Useful Metric-Ratio: IPC > Ratios derived from a combination of hardware events can sometimes provide more useful information than raw metrics ```c // IPC= PAPI_TOT_INS/PAPI_TOT_CYCLES do i = 1, n1 do j = 1, n3 do k = 1, n2 a(i,j) = a(i,j) + b(i,k) * c(k,j) end do end do end do ``` ![](https://i.imgur.com/srR5rYE.png) ## Shared Memory Performance ### What to Look For: Shared Memory * Threading overhead * Limit starting/destroying threads * Runtimes can help * Limited work per thread * Scheduling can cause overheads * Programmer’s responsibility * Synchronization Overhead * Next section * Cache Behavior and Locality * Next section * Thread and Data Locality / Mapping * Next section ## Synchronization Overhead * Only use locks when needed * Avoid `deadlocks` * If critical sections have to be substantial, try `overlapping` ### Lock Contention ![](https://i.imgur.com/Aeq62ti.png) ### Lock-Free Data Structures ![](https://i.imgur.com/kIHGG6t.png) * Hardware support essential * Compare and swap * ABA Problem * A tries to remove an element * Has to wait * B removes the element * Frees it * Reallocates it (new element, same address) * B inserts element again * A compares against the old/new element * ABA can happen when elements are `pointers` * With just values it doesn’t matter * Ensure that data isn’t reused * use languages with implicit memory mngt. * use `generational counters` * [Lock-free 程式設計議題 - HackMD](https://hackmd.io/@sysprog/lock-free-prog#2-Compare-And-Swap-Loops) ### Wait-Free Programming * Every operation finishes in a finite number of steps * Wait free is not lock free * Wait free requires lock free * Needed for hard real time scenarios * Much harder to achieve, in some cases impossible ## Cache Behavior and Locality * More threads need bandwidth * More pressure on caches * Contention on main memory ### Problem of False Sharing ![](https://i.imgur.com/m6YCp49.png) ### Example Impact of False Sharing ![](https://i.imgur.com/SkNyVOd.png) ![](https://i.imgur.com/Ad950FZ.png) ### Example: Improving Data Locality ![](https://i.imgur.com/3L5zJgQ.png) ![](https://i.imgur.com/NlxJID6.png) ![](https://i.imgur.com/m42HQQ7.png) ## Thread and Data Locality / Mapping * NUMA systems are dominating * All memory **`globally addressable`** * Different data access latencies * Local accesses: Short access latency * Remote accesses: Long access latency * Location of data and threads and if data vs. threads important * Coordinate threads and memory * At least logical association * Know where your data is and where it is used ### Placing Memory * Typical policy: first touch * First access allocates physical memory * Important if using large pages * Example: Stencil Code ![](https://i.imgur.com/PSwBsNF.png) ![](https://i.imgur.com/hEP5CGA.png) ### Thread Locations in OpenMP * Control by `ICV OMP_PROC_BIND` * `true`: threads are bound to cores/hw-threads * `false`: threads are not bound and can be migrated * `master`: new threads are always co-located with master * `close`: new threads are located “close” to their master threads * `spread`: new threads are spread out as much as possible * Also `OMP_PLACES` * Example ![](https://i.imgur.com/4T4u1N5.png) ### NUMA Management via libnuma * `man numa` * Examples * Memory allocation * `void *numa_alloc_onnode(size_t size, int node);` * `void *numa_alloc_local(size_t size);` * Thread binding * `void numa_bind(struct bitmask *nodemask);` * Query routines * `int numa_num_configured_nodes();` ## Scaling Systems ### Scaling Cache Coherent Shared Memory Machines * UMA * Central memory system becomes the main `bottleneck` * `Memory consistency` rules are necessary, but expensive * Must sustain `bandwidth` for ALL CPUs * NUMA * Memory system becomes distributed * More bandwidth, reduced bottleneck * `CC protocol` becomes `huge overhead` * Each memory access potentially invalidates data in any cache * `Cannot scale` ### Alternative: Distributed Memory ![](https://i.imgur.com/aaV92nG.png) * Separate compute nodes * Independent “computers” / “Shared Nothing Architecture” * Separate processors * Separate memories * Connected through an explicit network