### [Peilin Ye's blog](https://hackmd.io/@ypl/Sk8YAobw9)
# 关于 `nproc`
平时 `make` 加个 `-j$(nproc)` 已经成习惯了,今天来看看 `nproc` 是怎么在(我这台)Linux x86_64 机器上工作的。`nproc` 是 Coreutils(GNU Core Utilities)的一部分,我用的 Debian 10 上的 `8.30-3`:
```shell
$ dpkg -S $(which nproc)
coreutils: /usr/bin/nproc
$ apt-cache show coreutils | grep -i version
Version: 8.30-3
$ nproc --version | head -1
nproc (GNU coreutils) 8.30
```
`man 1 nproc` 说:
```
NAME
nproc - print the number of processing units available
```
> 以下「CPU」说的都是「逻辑 CPU」哈
不加 `--all` 的话,`nproc` 打印当前线程可用的 CPU 总数;加了 `--all` 的话,报我这台机器上的 CPU 总数。比如,我开一个 Bash,绑到 CPU #9 上:
```shell
$ nproc
64
$ nproc --all
64
$ taskset -c 9 bash
$ nproc
1
$ nproc --all
64
```
`nproc` 报 `1`,`nproc --all` 还是报 `64`。
## `nproc`
先说不加 `--all` 的时候。直接 `strace` 一下吧:
```shell!
$ strace nproc 2>&1 | grep sched
sched_getaffinity(0, 128, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]) = 64
```
> 闲话,`= 64` 应该是 `strace` 在说 `sched_getaffinity()` 返回了 `64`,可是 man page 却说 `sched_getaffinity()` 成功了应该返回 `0` ,呃…
`nproc` 通过 `sched_getaffinity()` 这个系统调用,获取当前线程的 CPU affinity mask(「Hello 内核,我这个线程能在哪个(哪些个)CPU 上跑?」),然后数这个 mask 里有几个 CPU。
感兴趣的话,可以读一下 Gnulib(GNU portability library) `lib/nproc.c` 里 `num_processors_via_affinity_mask()` 的实现([Gnulib stable-202301 GitWeb 链接](http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/nproc.c;h=2740c458c11f0dc2e22c7e38e835d6d5a3af6ed4;hb=refs/heads/stable-202301#l69))。
## `nproc --all`
加 `--all` 的时候和 glibc 版本有关。我用的 2.28,也先 `strace` 一下:
```shell
openat(AT_FDCWD, "/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
getdents64(3, /* 83 entries */, 32768) = 2560
getdents64(3, /* 0 entries */, 32768) = 0
```
…这是在干什么?来看 glibc `sysdeps/unix/sysv/linux/getsysstats.c` 里这个叫 `__get_nprocs_conf()` 的函数([glibc-2.28 GitWeb 链接](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/getsysstats.c;h=05533bcc3bebb1e7a79b3322c2418a18c478d23e;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b#l233)):
```c=240
/* Try to use the sysfs filesystem. It has actual information about
online processors. */
DIR *dir = __opendir ("/sys/devices/system/cpu");
if (dir != NULL)
{
int count = 0;
struct dirent64 *d;
while ((d = __readdir64 (dir)) != NULL)
/* NB: the sysfs has d_type support. */
if (d->d_type == DT_DIR && strncmp (d->d_name, "cpu", 3) == 0)
{
char *endp;
unsigned long int nr = strtoul (d->d_name + 3, &endp, 10);
if (nr != ULONG_MAX && endp != d->d_name + 3 && *endp == '\0')
++count;
```
它打开 `/sys/devices/system/cpu/`,然后数里头有几个叫 `cpu` + `数字` 的文件夹:
```shell
$ ls /sys/devices/system/cpu/
cpu0 cpu12 cpu16 cpu2 cpu23 cpu27 cpu30 cpu34 cpu38 cpu41 cpu45 cpu49 cpu52 cpu56 cpu6 cpu63 cpufreq isolated nohz_full power vulnerabilities
cpu1 cpu13 cpu17 cpu20 cpu24 cpu28 cpu31 cpu35 cpu39 cpu42 cpu46 cpu5 cpu53 cpu57 cpu60 cpu7 cpuidle kernel_max offline present
cpu10 cpu14 cpu18 cpu21 cpu25 cpu29 cpu32 cpu36 cpu4 cpu43 cpu47 cpu50 cpu54 cpu58 cpu61 cpu8 hotplug microcode online smt
cpu11 cpu15 cpu19 cpu22 cpu26 cpu3 cpu33 cpu37 cpu40 cpu44 cpu48 cpu51 cpu55 cpu59 cpu62 cpu9 intel_pstate modalias possible uevent
```
呃…
新的 glibc 里好一点,改读 `/sys/devices/system/cpu/possible` 这个文件了([glibc-2.37 GitWeb 链接](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/getsysstats.c;h=b0b6c154acab6ff635399835459401387700778c;hb=a704fd9a133bfb10510e18702f48a6a9c88dbbd5#l226)):
```c=228
int
__get_nprocs_conf (void)
{
int result = read_sysfs_file ("/sys/devices/system/cpu/possible");
if (result != 0)
return result;
/* Fall back to /proc/stat and sched_getaffinity. */
return get_nprocs_fallback ();
}
```
## 从 `/proc/kcore` 读 `nr_cpu_ids`
其实,Linux 内核(初始化完毕以后)`nr_cpu_ids` 这个全局变量里就存了我们想从 `nproc --all` 获取的信息,即系统里的 CPU 总数。
> 防杠,`nr_cpu_ids` 也可以是个宏,看 `include/linux/cpumask.h`([Linux v6.2 GitWeb 链接](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/cpumask.h?h=v6.2#n38))。没开 `CONFIG_FORCE_NR_CPUS` 的 SMP 机器上应该不用在意这个。
`procfs` 里有个叫 `kcore` 的文件,看 `man 5 proc`:
```
/proc/kcore
This file represents the physical memory of the system and
is stored in the ELF core file format. With this pseudo-
file, and an unstripped kernel (/usr/src/linux/vmlinux)
binary, GDB can be used to examine the current state of
any kernel data structures.
```
如 man page 所说,有了当前内核(带调试信息)的 vmlinux,和 `/proc/kcore`,我们用 GDB 就能直接打印 `nr_cpu_ids`:
```gdb
$ gdb -q /usr/lib/debug/lib/modules/$(uname -r)/vmlinux -c /proc/kcore
Reading symbols from /usr/lib/debug/lib/modules/.../vmlinux...
[New process 1]
Core was generated by ...
#0 0x0000000000000000 in fixed_percpu_data ()
(gdb) print nr_cpu_ids
$1 = 512
```
尬住了,不是 `64` 么,怎么成 `512` 了…
### KASLR
是 [KASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Kernel_address_space_layout_randomization),我开了 KASLR(`CONFIG_RANDOMIZE_BASE`,内核地址空间布局随机化):
```shell
$ grep CONFIG_RANDOMIZE_BASE /boot/config-$(uname -r)
CONFIG_RANDOMIZE_BASE=y
```
vmlinux 本以为自己的 `_text` 段会被 load 到 `0xffffffff81000000` :
```shell
$ readelf -l /usr/lib/debug/lib/modules/$(uname -r)/vmlinux
Elf file type is EXEC (Executable file)
Entry point 0x1000000
There are 5 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000200000 0xffffffff81000000 0x0000000001000000
0x000000000151ed44 0x000000000151ed44 R E 0x200000
...
```
```shell
$ nm /usr/lib/debug/lib/modules/$(uname -r)/vmlinux | grep " \_text"
ffffffff81000000 T _text
```
KASLR 却把它 load 到了 `0xffffffffb2800000`:
```gdb
$ grep " \_text" /proc/kallsyms
ffffffffb2800000 T _text
```
GDB 没听说我开了 KASLR,所以读歪了。我可以用 GDB `add-symbol-file` 命令的 `-o` 选项加一个 offset,`0xffffffffb2800000` 减 `0xffffffff81000000` 等于 `0x31800000` :
```gdb
$ gdb -q -c /proc/kcore
[New process 1]
Core was generated by ...
#0 0x0000000000000000 in ?? ()
(gdb) add-symbol-file /usr/lib/debug/lib/modules/.../vmlinux -o 0x31800000
add symbol table from file "/usr/lib/debug/lib/modules/.../vmlinux" with all sections offset by 0x31800000
(y or n) y
Reading symbols from /usr/lib/debug/lib/modules/.../vmlinux...done.
(gdb) print nr_cpu_ids
$1 = 64
```
这样就准了。
### LibElf
不用 GDB 了,读都会读歪来,我们手写个 C 程序,一样的!kcore 是一个 ELF core 文件(所以我会用到 [LibElf](https://sourceware.org/elfutils/) ):
```shell
$ readelf -l /proc/kcore
Elf file type is CORE (Core file)
Entry point 0x0
There are 10 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
NOTE 0x0000000000000270 0x0000000000000000 0x0000000000000000
0x0000000000001de8 0x0000000000000000 0x0
LOAD 0x00007fffb2803000 0xffffffffb2800000 0x0000001e75200000
0x000000000262c000 0x000000000262c000 RWE 0x1000
...
```
```shell
$ grep " \_text" /proc/kallsyms
ffffffffb2800000 T _text
```
`VirtAddr` 等于 `0xffffffffb2800000`(`_text`)的那一段就对应我们的 vmlinux,看内核 `fs/proc/kcore.c` 里 `proc_kcore_text_init()` 这个函数([Linux v6.2 GitWeb 链接](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/kcore.c?h=v6.2#n635)):
```c=641
static void __init proc_kcore_text_init(void)
{
kclist_add(&kcore_text, _text, _end - _text, KCORE_TEXT);
}
```
x86_64 上,至少从 Linux v2.6 commit [9492587cf35d ("kcore: register text area in generic way")](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9492587cf35d370db33ef4b38375dfb35a105b61) 开始,就有这个段了。我们找到它在 kcore 里的 `Offset`,加上从 `_text` 到 `nr_cpu_ids` 的偏移量,然后读就可以了:
```c=
#define _LARGEFILE64_SOURCE
#include <fcntl.h>
#include <libelf.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
const unsigned long long vaddr_text = 0xffffffffb2800000; /* $ grep " \_text" /proc/kallsyms */
const unsigned long long vaddr_nr_cpu_ids = 0xffffffffb3fb09c4; /* $ grep " nr_cpu_ids" /proc/kallsyms */
int main(void)
{
int fd = open("/proc/kcore", O_RDONLY);
Elf *elf = elf_begin(fd, ELF_C_READ, NULL);
Elf64_Ehdr ehdr; /* ELF Executable Header */
lseek64(fd, 0, SEEK_SET);
read(fd, &ehdr, sizeof(ehdr));
Elf64_Phdr phdr; /* ELF Program Header(s) */
for (int i = 0; i < ehdr.e_phnum; i++) {
lseek64(fd, ehdr.e_phoff + (i * sizeof(phdr)), SEEK_SET);
read(fd, &phdr, sizeof(phdr));
if (phdr.p_vaddr == vaddr_text)
break;
}
int nr_cpu_ids;
lseek64(fd, phdr.p_offset + vaddr_nr_cpu_ids - vaddr_text, SEEK_SET);
read(fd, &nr_cpu_ids, sizeof(int));
printf("%d\n", nr_cpu_ids);
elf_end(elf);
close(fd);
return 0;
}
```
```shell
$ gcc -lelf -o nr_cpu_ids nr_cpu_ids.c
$ ./nr_cpu_ids
64
```
不过我还是 grep 了 kallsyms,不知道 [crash](https://crash-utility.github.io/) 和 [drgn](https://drgn.readthedocs.io/en/latest/) 是怎么处理 KASLR 的。
## 参考链接
1. [debugging - Linux Kernel symbol addresses don't match between /proc/kcore and /proc/kallsyms - Stack Overflow](https://stackoverflow.com/questions/55583655/linux-kernel-symbol-addresses-dont-match-between-proc-kcore-and-proc-kallsyms)