# Profiling
[The Basics of Profiling - Mathieu Ropert - CppCon 2021](https://www.youtube.com/watch?v=dToaepIXW4s)
---
### Before talking profiling
* The real problem is that programmers have spent too much time worrying about efficiency in the wrong places and at the wrong times.
* Premature optimization is the root of all evil. - Donald Knuth
* [premature optimization](https://www.youtube.com/watch?v=tKbV6BpH-C8)
---
### Why profiling
1. Figuring why a program is slow is hard
2. Reading the code can easily mislead
3. Modern CPUs are quite complex
4. Measure!
---
### Profiling vs Optimization
1. Profilers are one of the tools that can be used during an optimization iteration cycle.
2. Better used to investigate where to optimzie
3. Can be used to measure if an optimization was effective, within limits

---
### Profiler usage
1. Identify hotspots & bottlenecks
2. Visualize execution timeline
3. Collect & compute metrics
---
### Types of profiler
1. Sampling profiling
2. Instrumentation profiling
---
### Sampling profiling
1. Attach to program, periodically interrupt and record the stack trace
2. Sampling frequency is customizable
3. Results are statistical averages
4. Example tool: vTune, Oracle Collect
---
### Sampling profiling
* Pros
1. Only needs to be able to read stack trace
2. Minimal debug info is enough
3. Works out of the box on any executable
* Cons
1. Inlined functions are usually invisible
---
### Instrumentation Profiling
* Add code hooks to explicitly record metrics
* Can provide both averages and exact breakdown by execution frame
* Not affected by inlining or statistical anomalies
* Example tool: Optick
---
### Instrumentation Profiling
* Requires programmers to add collection macros in tactical places in the code
* Supports adding extra business metadata
* Can fallback on sampling
* Build implications
* The hooks may affect the metrics
---
### Sampling vs Instrumentation
* Sampling
* Execution time is less
* The output database is smaller
* Statistical results are less precise
* Suitable for initial analysis
* Instrumentation
* Record every instruction accurately
* Execution is slow
* The output database is bigger
* Suitable for finely analysis
---
### Using the right tool
* Instrumentation (+ some sampling) is the recommended way to go
* Sampling alone is cheaper to start with
* Consider adding instrumentation as an investment
* VTune supports instrumentation lib in the new version.
---
### Setting up goals
1. Set up a reproducible scenario
2. Measure its performance
3. Define an objective

---
### Finding the needle
* First time look at a profile can be overwhelming
* Look at what sticks out
* Domain knowledge is key
---
### Know the program
* A profiler can tell what takes the most time
* It can explain why
* But it cannot tell if it should
* What takes time and what should take time.
---
### profiling metrics
* CPU time
* Wait time
* System time
---
### High CPU Time
* Inefficient algorithms or data structures
* Spin locks
* Single threaded code
* Branch misprediction, cache misses
---
### High Wait time
* Disk I/O
* Network calls
* Locks
* Synchronization
---
### Inefficient algorithm
* Time spent in loops and recursive calls
* Check the Big O
* Can some computations be cached and reused?
---
unordered_set: emplace vs insert
```cpp=
void insert(const std::vector<int>& v, bool insert) {
std::unordered_set<int> s;
if (insert) {
for(auto value: v) {
s.insert(value);
}
} else {
for(auto value: v) {
s.emplace(value);
}
}
}
```
---
Profiler:

* No inline function information
---
### Inefficient Data structures
* Time spent in **find**, **insert**, or **operator[]**
* Easier to spot without inling
* Know your data structures strengths and weaknesses
---
std::vector vs tbb::concurrent_vector
```cpp=
void set(Container& vec, int index, int value) {
if (vec.size() <= index) {
vec.resize(index+1);
}
vec[index]=value;
}
for(index in 1..n) set(vec, index, index);
```

---
Spin_lock vs no lock
```cpp=
void set(Container& vec, int index, int value) {
tbb::spin_mutex::scoped_lock lock{mu};
if (vec.size() <= index) {
vec.resize(index+1);
}
vec[index]=value;
}
for(index in 1..n) set(vec, index, index);
```

---
### spin and overhead time
* [Reference](https://www.isus.jp/file/VTuneHelp2018/GUID-AB714EB2-BD4C-11E2-927D-E02A82CB6E13.html)
* Overhead time is the time the system takes to deliver a shared resource from a releasing owner to an acquiring owner. Ideally, the Overhead time should be close to zero because it means the resource is not being wasted through idleness.
---
### spin and overhead time
* [Reference](https://www.isus.jp/file/VTuneHelp2018/GUID-AB714EB2-BD4C-11E2-927D-E02A82CB6E13.html)
* Spin time is the Wait time during which the CPU is busy. This often occurs when a synchronization API causes the CPU to poll while the software thread is waiting.
---
### spin and overhead time
* [Reference](https://www.isus.jp/file/VTuneHelp2018/GUID-AB714EB2-BD4C-11E2-927D-E02A82CB6E13.html)
* VTune Amplifier provides the combined ==Overhead and Spin Time== metric in the grid and Timeline view of the Hotspots by CPU Usage, Hotspots by Thread Concurrency, and Hotspots viewpoints. This metric represents the sum of the Overhead and Spin time values calculated as ==CPU Time where Call Site Type is Overhead + CPU Time where Call Site Type is Synchronization==.
---
### spin and overhead time

---
### wait time on mutex
* high wait time on synchronization functions
* it shouldn't be called mutex, it should be called bottleneck
* Consider changing concurrency model
---
### Synchronization API
* I/O
* Lock
* Allocate, example: new operator.
```cpp=
void set(Container<int*>& vec, int index, int value) {
if (vec.size() <= index) {
vec.resize(index+1);
}
delete vec[index];
vec[index] = new int{value};
}
for(index in 1..n) set(vec, index, index);
```
---
### Synchronization API


---
### Oracle collect and VTune
0. binary
* collect and analyzer
* /depotbld/RHEL5.5/SUNWspro-SS12.5/solarisstudio12.5/bin
* vtune and vtune-gui
* /depotbld/RHEL7.0/intel/inspector-v-2022.2.0.262/oneapi/vtune/2022.2.0/bin64/
---
### Oracle collect and VTune
1. basic usage
* collect <command>
* dump test.1.er folder
* vtune --collect hotspots <command>
* dump r000hs folder
---
### Oracle collect and VTune
2. open analyzer gui tool
* analyzer test.1.er
* vtune-gui r000hs
3. specify the output folder name to replace *.er and r000hs
* collect -o <folder_name>.er
* vtune -r <folder_name>
4. attatch to PID
* vtune -collect hotspots -target-pid=<PID\>
* collect -P <PID\>
* usefule for hanged program (DEMO)
---
### Profiler features
* features in both oracle collect and vtune
* bottom-up
* top-down
* TimeLine
* filter
* more in vtune
* Summary report is better
* vtune -collect threading
* more information for spin and overhead time
---
### Conclusion
* Premature optimization is evil
* Consider the design level performance but not fine tune when developing.
* Choose the suitable profiler
* Sampling for hotspots.
* Instrumentation for finely profiling.
---
# THANks
---
{"slideOptions":"{\"theme\":\"white\"}","title":"Profiling","description":"basic of profiling","contributors":"[{\"id\":\"09379b25-db04-47a4-8912-78e722b7a548\",\"add\":13647,\"del\":5776}]"}