Kernel Tuning - HackMD

Kernel Tuning === - 檔案系統: XFS or BTRFS - I/O排程器: 改成deadline或noop - echo deadline > /sys/block/sda/queue/scheduler - 或是 GRUB_CMDLINE_LINUX="crashkernel=auto rhgb quiet elevator=noop" https://shape.host/resources/kernel-configuration-for-high-performance-computing-on-linux --- 電源管理 GRUB_CMDLINE_LINUX="intel_idle.max_cstate=0 processor.max_cstate=0" 以及 cpupower -c all frequency-set --governor performance 或是 cpufreq-set -r -g performance https://blog.csdn.net/y33988979/article/details/107361365 --- tickless http://kernel-tour.org/time/tickless.html --- 關閉處理器漏洞修補 GRUB_CMDLINE_LINUX="(其它省略) spectre_v2=off nopti spec_store_bypass_disable=off" https://ssorc.tw/8223/google-performance-tuning-implementation-for-linux/ --- Transparent Huge Pages echo always > /sys/kernel/mm/transparent_hugepage/enabled https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/49283082/Linux+Configuration+for+Performance+Tuning 以下配置选项可作为设置所有内存模式时的参考： • 禁用内存交换。这项尤其适用于容量有限的“仅 HBM”模式。内存交换会严重影响性能。如运行应用导致内存交换，可以考虑释放内存（例如，清理文件系统缓存）或扩展至更多节点。 • 启用“zone_reclaim_mode”减少 NUMA 失中。该模式非常适合 NUMA 节点规模较小的情形（例如 “SNC4” 集群模式）。启用该模式时，Linux 页面分配器先在请求的 NUMA 节点上回收容易用的页面，然后再从其他 NUMA 节点上获得内存。这可减少不必要的 NUMA 交叉，从而避免性能下降。但回收活动可能会导致性能发生小幅波动。使用以下命令即可启用“zone_reclaim_mode”选项。由于该操作必须在每次重启后执行，因此建议使用初始化脚本进行自动设置。 ``` echo 2 > /proc/sys/vm/zone_reclaim_mode ``` • 每次运行前，（如果此前运行过程中缓存的内容无使用价值）建议清理文件系统缓存，并使用以下命令规整内存。由于这些命令需要 root 权限（即根权限），系统管理员应考虑将其纳入批处理系统作业的前期处理工作 (Job Prologue)，或将其以具有 setuid 权限的二进制文件形式提供。 ``` sync; echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory ``` • 建议启用透明大页 (THP)。大多数科学计算应用都会从使用 THP 获益。虽然创建 THP 等大页时可能会因需要进行内存规整产生开销，但管理员可按照上文介绍的方法在每次运行前规整内存，以此减少这种开销。 • 避免使用 /dev/shm (tmpfs) 来存储文件，因为这样做会减少可用的内存。建议系统管理员将清除 /dev/shm 纳入作业的前期处理工作，以减少作业之间相互影响。 • 建议使用最新且稳定的 Linux 内核（当前为 5.15）。 --- ### 100G Tuning (2016年的資料可能過時) ``` # add to /etc/sysctl.conf # allow testing with 2GB buffers net.core.rmem_max = 2147483647 net.core.wmem_max = 2147483647 # allow auto-tuning up to 2GB buffers net.ipv4.tcp_rmem = 4096 87380 2147483647 net.ipv4.tcp_wmem = 4096 65536 2147483647 ``` Don’t Forget about NUMA Issues - Up to 2x performance difference if you use the wrong core. - If you have a 2 CPU socket NUMA host, be sure to: - Turn off irqbalance - Figure out what socket your NIC is connected to: cat /sys/class/net/ethN/device/numa_node - Run Mellanox IRQ script: /usr/sbin/set_irq_affinity_bynode.sh 1 ethN - Bind your program to the same CPU socket as the NIC: numactl -N 1 program_name - Which cores belong to a NUMA socket? - cat /sys/devices/system/node/node0/cpulist - (note: on some Dell servers, that might be: 0,2,4,6,.. https://www.es.net/assets/Uploads/100G-Tuning-TechEx2016.tierney.pdf ### Another 100G Tuning CPU is important intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll intel_pstate=disable https://forum.proxmox.com/threads/mellanox-connect-x-6-100g-is-limited-to-bitrate-34gbits-s.96378/