Does CIS benchmark run in shell scripts affect the enforcer cpu usage ?

# Does CIS benchmark run in shell scripts affect the enforcer cpu usage ? ## Problem ![top output (2)](https://hackmd.io/_uploads/r1PGHMi3p.png) ## Conclusion 1. High CPU usage may be linked to the CIS benchmark shell scripts version. 2. Rewriting in Go could offer the following estimated benefits: 1. Average CPU usage reduce (compare to base) * go 20-30% * goOpt 30-50% 2. sysblock time can save from 24s -> less than 1 sec 3. Benchmark execute time save from 50s -> 1 sec ## Experiment setup - A self-set k8s cluster, 2 master 1 worker. - kubernetes version 1.27 ## Analysis ### kube metric-server In kube metric server, we run kubectl top pod -n neuvector | grep enforcer, we can get the following. ```bash= neuvector-enforcer-pod-556m5 1170m 42Mi neuvector-enforcer-pod-w6fkz 1189m 48Mi neuvector-enforcer-pod-xn8zd 1665m 45Mi ``` Performance is compared across modes: 1. noneMode, we just let the enforcer not run the benchmark as baseline. 2. yamlMode, current implemetation, shell scripts read target yaml folder then run the test. 3. scriptsMode, back to initial implementation (I've done a scripts to auto generate the code from yaml to shell) 4. sleepMode, just sleep for 5 sec (yamlMode run 5 sec for kube_master.sh) Comparing performance over 3 nodes shows running CIS benchmarks impacts CPU usage peaks, but scriptMode reduces high CPU usage duration by approximately 10 seconds. ![Overview_node1](https://hackmd.io/_uploads/Hyo35YKh6.png) ![Overview_node2](https://hackmd.io/_uploads/Byo2qFY2a.png) ![Overview_node3](https://hackmd.io/_uploads/SJs2qtK3T.png) ### pprof && trace Monitoring via the pprof go-routine file shows low CPU usage, yet tracing reveals extensive blocking syscall waits. This suggests exec.command usage may not be fully captured in pprof metrics. ![Screen Shot 2024-02-22 at 11.40.47 PM](https://hackmd.io/_uploads/rJ-oxcF3p.png) ### bpftrace The suspicion is that high CPU usage stems from commands within shell scripts that excessively fork, impacting CPU performance. Through three experimental setups and subsequent tracing, it was discovered that the scripts indeed trigger over 1000 forks each time the CIS benchmark is run. ```bash= bpftrace -e ' tracepoint:syscalls:sys_enter_execve /cgroup == 18443/ { @exec_calls[comm, str(args->filename)] = count(); } tracepoint:syscalls:sys_enter_execve /cgroup == 18443/ { @total_exec_calls[comm] = count(); } tracepoint:syscalls:sys_enter_fork /cgroup == 18443/ { @fork_calls[comm] = count(); } tracepoint:syscalls:sys_enter_vfork /cgroup == 18443/ { @vfork_calls[comm] = count(); } kprobe:__x64_sys_clone /cgroup == 18443/ { @clone_calls[comm] = count(); } ' ``` 1. Run CIS benchmark for master ```bash= @exec_calls[sh, /bin/sh]: 1 @exec_calls[sh, /usr/bin/paste]: 1 @exec_calls[sh, /usr/bin/find]: 5 @exec_calls[sh, /bin/ps]: 10 @exec_calls[sh, /bin/stat]: 19 @exec_calls[sh, /usr/bin/pgrep]: 55 @exec_calls[sh, /usr/bin/tr]: 55 @exec_calls[sh, /bin/sed]: 102 @exec_calls[sh, /usr/bin/cut]: 107 @exec_calls[sh, /bin/grep]: 588 @fork_calls[sh]: 1760 @total_exec_calls[sh]: 943 ``` 2. Run CIS benchmark for worker ```bash= @exec_calls[sh, /bin/sh]: 1 @exec_calls[sh, /bin/stat]: 6 @exec_calls[sh, /usr/bin/tr]: 20 @exec_calls[sh, /usr/bin/pgrep]: 20 @exec_calls[sh, /usr/bin/cut]: 24 @exec_calls[sh, /bin/sed]: 24 @exec_calls[sh, /bin/grep]: 117 @fork_calls[sh]: 386 @total_exec_calls[sh]: 212 ``` 3. Run sleep commands for 5 sec ```bash= @exec_calls[sh, /bin/sh]: 1 @exec_calls[sh, /bin/sleep]: 1 @fork_calls[sh]: 2 @total_exec_calls[sh]: 2 ``` ## POC for Golang version ### Experiments 1. base, run the benchmark in shell. 2. go, run the benchmark in go version without cache. 3. goOtp, version 2 with cache, since most of the benchmark test item is to check the setting of kube command, thus we can build a cache to improve performance. ### Performace 1. kube metric-server for 200 seconds ![Overview_node1](https://hackmd.io/_uploads/HkYymfsnT.png) ![Overview_node2](https://hackmd.io/_uploads/H1Y17Gjha.png) ![Overview_node3](https://hackmd.io/_uploads/SkFy7Gj2a.png) 2. total cpu usage in 200 seconds ```bash= node 1: total_cpu_usage: {'base': 482570, 'go': 300580, 'goOpt': 272180}) node 2: total_cpu_usage: {'base': 621619, 'go': 301207, 'goOpt': 227524}) node 3: total_cpu_usage: {'base': 736397, 'go': 384894, 'goOpt': 289292}) ``` 3. run time for execute the benchmark | | base | go | goOpt | | --------------- | ------ | ------|--------| | master scripts | 50.83s | 15s | 20ms | | worker scripts | 41.85s | 10.82s| 21ms | 4. trace sysblock time 1. base ![Screen Shot 2024-02-27 at 3.14.19 PM](https://hackmd.io/_uploads/BySC4Mohp.png) 2. go ![Screen Shot 2024-02-27 at 2.58.55 PM](https://hackmd.io/_uploads/SkBJrGi2T.png) 3. goOpt ![Screen Shot 2024-02-27 at 2.50.12 PM](https://hackmd.io/_uploads/rJh1HMs2T.png) ### Benchmark ```bash= i=1 while [ $i -le 500 ] do id="1.2.6" description="Ensure that the --authorization-mode argument is not set to AlwaysAllow (Automated)" check="$id - $description" if get_argument_value "$CIS_APISERVER_CMD" '--authorization-mode'| grep 'AlwaysAllow' >/dev/null 2>&1; then warn "$check" else pass "$check" fi i=$((i + 1)) done ```