# 6 Shared Memory Performance Issues
## Performance Analysis
### Tool Types
* Data acquisition
* Event based data: triggered by explicit events
* **`Sampling`** based data: triggered by external events like timers
* **`Instrumentation`**
* Source code instrumentation
* Compiler instrumentation
* Binary instrumentation
* Link-level instrumentation
* **Tradeoff**: invasiveness vs. overhead vs. ability to correlate
* **Big question**: **`granularity`**
* Aggregation
* Aggregation over time and space: **simplified profile**
## Sequential Performance
### Starting Point: Flat Profile
* Profiles show **`computationally intensive`** code regions
* First views: `Time` spent per functions or per statements
* Identify **`bottleneck`** components
* View the `profile aggregated` by shared objects
* Correct/expected modules?
* Impact of support and runtime libraries
* Typical implementation: **`sampling`**
* Take a sample: `address` inside a function
* Record sample and build histograms of `all PCs seen`
* Over time: `statistically` cover all of the code
* Perf: On-CPU time
### Adding Context through `Stack Traces`
```mermaid
graph LR;
id1[Function A]-->id2[Function B];
id1[Function A]-->id3[Function C];
id3[Function C]-->id4[Function D];
id4[Function D]-->id5[Function E];
id2[Function B]-->id5[Function E];
```
* Missing information in `flat profiles`
* Distinguish routines called from `multiple callers`
* Understand the call `invocation history`
* `Context` for performance data
* Critical technique: **`Stack traces`**
* Gather stack trace for each performance sample
* Aggregate only samples with equal trace
### Inclusive vs. Exclusive Timing
```mermaid
graph LR;
id1[Function A]-->id2[Function B];
id1[Function A]-->id3[Function C];
subgraph Inclusive Time for C;
id3[Function C]-->id4[Function D];
id4[Function D]-->id5[Function E];
end
subgraph Exclusive Time for B;
id2[Function B]-->id5[Function E];
end
```
* Implementation similar to flat profiles
* Sample PC information
* Additionally collect `call stack information` at every sample
* **`Tradeoffs`**
* Pro: Obtain `additional context information`
* Con: `Higher overhead`/`lower sampling rate`
### Interpreting Call Context Data
#### Inclusive versus exclusive times
* `inclusive time ~= exclusive time`
* not useful to profile below this layer
* **`inclusive time >> exclusive time`**
* Focus attention to the execution times of the **`children`**
#### Hotpath analysis
* Which paths takes the most time?
* Path time might be ok/expected, but could point to a problem
#### Butterfly analysis (best known from `gprof`)
* Should be done on “suspicious” functions
* Functions with `large execution time`
* Functions with `large difference` between `inclusive` and `exclusive` time
* Functions of interest
* Functions that “take unexpectedly long”
* ...
* Shows `split of time` in callees and callers
* Butterfly View (O|SS, similar in other tools)

## Performance Counters
* Timing information doesn’t tell you why it is slow
* Next: Investigate hardware/application interaction
### Hardware Performance Counters
* Architectural Features
* Typically/Mostly `packaged inside the CPU`
* Count hardware events `transparently without overhead`
* Newer platforms also provide system counters
* Network cards and switches
* Environmental sensors
* Drawbacks
* Availability `differs` between platform & processors
* Slight semantic differences between platforms
* In some cases : `requires` `privileged access` & `kernel patches`
* Recommended: Access through **`PAPI`** (Alternative `likwid`)
* `API` for tools + simple runtime tools
* Abstractions for system specific layers
* More information: http://icl.cs.utk.edu/papi/
### Examples of Typical Counters

### Useful Metric-Ratio: IPC
> Ratios derived from a combination of hardware events can sometimes provide more useful information than raw metrics
```c
// IPC= PAPI_TOT_INS/PAPI_TOT_CYCLES
do i = 1, n1
do j = 1, n3
do k = 1, n2
a(i,j) = a(i,j) + b(i,k) * c(k,j)
end do
end do
end do
```

## Shared Memory Performance
### What to Look For: Shared Memory
* Threading overhead
* Limit starting/destroying threads
* Runtimes can help
* Limited work per thread
* Scheduling can cause overheads
* Programmer’s responsibility
* Synchronization Overhead
* Next section
* Cache Behavior and Locality
* Next section
* Thread and Data Locality / Mapping
* Next section
## Synchronization Overhead
* Only use locks when needed
* Avoid `deadlocks`
* If critical sections have to be substantial, try `overlapping`
### Lock Contention

### Lock-Free Data Structures

* Hardware support essential
* Compare and swap
* ABA Problem
* A tries to remove an element
* Has to wait
* B removes the element
* Frees it
* Reallocates it (new element, same address)
* B inserts element again
* A compares against the old/new element
* ABA can happen when elements are `pointers`
* With just values it doesn’t matter
* Ensure that data isn’t reused
* use languages with implicit memory mngt.
* use `generational counters`
* [Lock-free 程式設計議題 - HackMD](https://hackmd.io/@sysprog/lock-free-prog#2-Compare-And-Swap-Loops)
### Wait-Free Programming
* Every operation finishes in a finite number of steps
* Wait free is not lock free
* Wait free requires lock free
* Needed for hard real time scenarios
* Much harder to achieve, in some cases impossible
## Cache Behavior and Locality
* More threads need bandwidth
* More pressure on caches
* Contention on main memory
### Problem of False Sharing

### Example Impact of False Sharing


### Example: Improving Data Locality



## Thread and Data Locality / Mapping
* NUMA systems are dominating
* All memory **`globally addressable`**
* Different data access latencies
* Local accesses: Short access latency
* Remote accesses: Long access latency
* Location of data and threads and if data vs. threads important
* Coordinate threads and memory
* At least logical association
* Know where your data is and where it is used
### Placing Memory
* Typical policy: first touch
* First access allocates physical memory
* Important if using large pages
* Example: Stencil Code


### Thread Locations in OpenMP
* Control by `ICV OMP_PROC_BIND`
* `true`: threads are bound to cores/hw-threads
* `false`: threads are not bound and can be migrated
* `master`: new threads are always co-located with master
* `close`: new threads are located “close” to their master threads
* `spread`: new threads are spread out as much as possible
* Also `OMP_PLACES`
* Example

### NUMA Management via libnuma
* `man numa`
* Examples
* Memory allocation
* `void *numa_alloc_onnode(size_t size, int node);`
* `void *numa_alloc_local(size_t size);`
* Thread binding
* `void numa_bind(struct bitmask *nodemask);`
* Query routines
* `int numa_num_configured_nodes();`
## Scaling Systems
### Scaling Cache Coherent Shared Memory Machines
* UMA
* Central memory system becomes the main `bottleneck`
* `Memory consistency` rules are necessary, but expensive
* Must sustain `bandwidth` for ALL CPUs
* NUMA
* Memory system becomes distributed
* More bandwidth, reduced bottleneck
* `CC protocol` becomes `huge overhead`
* Each memory access potentially invalidates data in any cache
* `Cannot scale`
### Alternative: Distributed Memory

* Separate compute nodes
* Independent “computers” / “Shared Nothing Architecture”
* Separate processors
* Separate memories
* Connected through an explicit network