Peilin Ye's blog

关于 nproc

平时 make 加个 -j$(nproc) 已经成习惯了,今天来看看 nproc 是怎么在(我这台)Linux x86_64 机器上工作的。nproc 是 Coreutils(GNU Core Utilities)的一部分,我用的 Debian 10 上的 8.30-3

$ dpkg -S $(which nproc)
coreutils: /usr/bin/nproc
$ apt-cache show coreutils | grep -i version
Version: 8.30-3
$ nproc --version | head -1
nproc (GNU coreutils) 8.30

man 1 nproc 说:

NAME
       nproc - print the number of processing units available

以下「CPU」说的都是「逻辑 CPU」哈

不加 --all 的话,nproc 打印当前线程可用的 CPU 总数;加了 --all 的话,报我这台机器上的 CPU 总数。比如,我开一个 Bash,绑到 CPU #9 上:

$ nproc
64
$ nproc --all
64
$ taskset -c 9 bash
$ nproc
1
$ nproc --all
64

nproc1nproc --all 还是报 64

nproc

先说不加 --all 的时候。直接 strace 一下吧:

$ strace nproc 2>&1 | grep sched
sched_getaffinity(0, 128, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]) = 64

闲话,= 64 应该是 strace 在说 sched_getaffinity() 返回了 64,可是 man page 却说 sched_getaffinity() 成功了应该返回 0 ,呃…

nproc 通过 sched_getaffinity() 这个系统调用,获取当前线程的 CPU affinity mask(「Hello 内核,我这个线程能在哪个(哪些个)CPU 上跑?」),然后数这个 mask 里有几个 CPU。

感兴趣的话,可以读一下 Gnulib(GNU portability library) lib/nproc.cnum_processors_via_affinity_mask() 的实现(Gnulib stable-202301 GitWeb 链接)。

nproc --all

--all 的时候和 glibc 版本有关。我用的 2.28,也先 strace 一下:

openat(AT_FDCWD, "/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
getdents64(3, /* 83 entries */, 32768)  = 2560
getdents64(3, /* 0 entries */, 32768)   = 0

…这是在干什么?来看 glibc sysdeps/unix/sysv/linux/getsysstats.c 里这个叫 __get_nprocs_conf() 的函数(glibc-2.28 GitWeb 链接):

/* Try to use the sysfs filesystem. It has actual information about online processors. */ DIR *dir = __opendir ("/sys/devices/system/cpu"); if (dir != NULL) { int count = 0; struct dirent64 *d; while ((d = __readdir64 (dir)) != NULL) /* NB: the sysfs has d_type support. */ if (d->d_type == DT_DIR && strncmp (d->d_name, "cpu", 3) == 0) { char *endp; unsigned long int nr = strtoul (d->d_name + 3, &endp, 10); if (nr != ULONG_MAX && endp != d->d_name + 3 && *endp == '\0') ++count;

它打开 /sys/devices/system/cpu/,然后数里头有几个叫 cpu + 数字 的文件夹:

$ ls /sys/devices/system/cpu/
cpu0   cpu12  cpu16  cpu2   cpu23  cpu27  cpu30  cpu34  cpu38  cpu41  cpu45  cpu49  cpu52  cpu56  cpu6   cpu63  cpufreq       isolated    nohz_full  power    vulnerabilities
cpu1   cpu13  cpu17  cpu20  cpu24  cpu28  cpu31  cpu35  cpu39  cpu42  cpu46  cpu5   cpu53  cpu57  cpu60  cpu7   cpuidle       kernel_max  offline    present
cpu10  cpu14  cpu18  cpu21  cpu25  cpu29  cpu32  cpu36  cpu4   cpu43  cpu47  cpu50  cpu54  cpu58  cpu61  cpu8   hotplug       microcode   online     smt
cpu11  cpu15  cpu19  cpu22  cpu26  cpu3   cpu33  cpu37  cpu40  cpu44  cpu48  cpu51  cpu55  cpu59  cpu62  cpu9   intel_pstate  modalias    possible   uevent

呃…

新的 glibc 里好一点,改读 /sys/devices/system/cpu/possible 这个文件了(glibc-2.37 GitWeb 链接):

int __get_nprocs_conf (void) { int result = read_sysfs_file ("/sys/devices/system/cpu/possible"); if (result != 0) return result; /* Fall back to /proc/stat and sched_getaffinity. */ return get_nprocs_fallback (); }

/proc/kcorenr_cpu_ids

其实,Linux 内核(初始化完毕以后)nr_cpu_ids 这个全局变量里就存了我们想从 nproc --all 获取的信息,即系统里的 CPU 总数。

防杠,nr_cpu_ids 也可以是个宏,看 include/linux/cpumask.hLinux v6.2 GitWeb 链接)。没开 CONFIG_FORCE_NR_CPUS 的 SMP 机器上应该不用在意这个。

procfs 里有个叫 kcore 的文件,看 man 5 proc

       /proc/kcore
              This file represents the physical memory of the system and
              is stored in the ELF core file format.  With this pseudo-
              file, and an unstripped kernel (/usr/src/linux/vmlinux)
              binary, GDB can be used to examine the current state of
              any kernel data structures.

如 man page 所说,有了当前内核(带调试信息)的 vmlinux,和 /proc/kcore,我们用 GDB 就能直接打印 nr_cpu_ids

$ gdb -q /usr/lib/debug/lib/modules/$(uname -r)/vmlinux -c /proc/kcore
Reading symbols from /usr/lib/debug/lib/modules/.../vmlinux...
[New process 1]
Core was generated by ...
#0  0x0000000000000000 in fixed_percpu_data ()
(gdb) print nr_cpu_ids
$1 = 512

尬住了,不是 64 么,怎么成 512 了…

KASLR

KASLR,我开了 KASLR(CONFIG_RANDOMIZE_BASE,内核地址空间布局随机化):

$ grep CONFIG_RANDOMIZE_BASE /boot/config-$(uname -r)
CONFIG_RANDOMIZE_BASE=y

vmlinux 本以为自己的 _text 段会被 load 到 0xffffffff81000000

$ readelf -l /usr/lib/debug/lib/modules/$(uname -r)/vmlinux

Elf file type is EXEC (Executable file)
Entry point 0x1000000
There are 5 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000200000 0xffffffff81000000 0x0000000001000000
                 0x000000000151ed44 0x000000000151ed44  R E    0x200000
...
$ nm /usr/lib/debug/lib/modules/$(uname -r)/vmlinux | grep " \_text"
ffffffff81000000 T _text

KASLR 却把它 load 到了 0xffffffffb2800000

$ grep " \_text" /proc/kallsyms
ffffffffb2800000 T _text

GDB 没听说我开了 KASLR,所以读歪了。我可以用 GDB add-symbol-file 命令的 -o 选项加一个 offset,0xffffffffb28000000xffffffff81000000 等于 0x31800000

$ gdb -q -c /proc/kcore
[New process 1]
Core was generated by ...
#0  0x0000000000000000 in ?? ()
(gdb) add-symbol-file /usr/lib/debug/lib/modules/.../vmlinux -o 0x31800000
add symbol table from file "/usr/lib/debug/lib/modules/.../vmlinux" with all sections offset by 0x31800000
(y or n) y
Reading symbols from /usr/lib/debug/lib/modules/.../vmlinux...done.
(gdb) print nr_cpu_ids
$1 = 64

这样就准了。

LibElf

不用 GDB 了,读都会读歪来,我们手写个 C 程序,一样的!kcore 是一个 ELF core 文件(所以我会用到 LibElf ):

$ readelf -l /proc/kcore

Elf file type is CORE (Core file)
Entry point 0x0
There are 10 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  NOTE           0x0000000000000270 0x0000000000000000 0x0000000000000000
                 0x0000000000001de8 0x0000000000000000         0x0
  LOAD           0x00007fffb2803000 0xffffffffb2800000 0x0000001e75200000
                 0x000000000262c000 0x000000000262c000  RWE    0x1000
...
$ grep " \_text" /proc/kallsyms
ffffffffb2800000 T _text

VirtAddr 等于 0xffffffffb2800000_text)的那一段就对应我们的 vmlinux,看内核 fs/proc/kcore.cproc_kcore_text_init() 这个函数(Linux v6.2 GitWeb 链接):

static void __init proc_kcore_text_init(void) { kclist_add(&kcore_text, _text, _end - _text, KCORE_TEXT); }

x86_64 上,至少从 Linux v2.6 commit 9492587cf35d ("kcore: register text area in generic way") 开始,就有这个段了。我们找到它在 kcore 里的 Offset,加上从 _textnr_cpu_ids 的偏移量,然后读就可以了:

#define _LARGEFILE64_SOURCE #include <fcntl.h> #include <libelf.h> #include <unistd.h> #include <stdio.h> #include <sys/stat.h> #include <sys/types.h> const unsigned long long vaddr_text = 0xffffffffb2800000; /* $ grep " \_text" /proc/kallsyms */ const unsigned long long vaddr_nr_cpu_ids = 0xffffffffb3fb09c4; /* $ grep " nr_cpu_ids" /proc/kallsyms */ int main(void) { int fd = open("/proc/kcore", O_RDONLY); Elf *elf = elf_begin(fd, ELF_C_READ, NULL); Elf64_Ehdr ehdr; /* ELF Executable Header */ lseek64(fd, 0, SEEK_SET); read(fd, &ehdr, sizeof(ehdr)); Elf64_Phdr phdr; /* ELF Program Header(s) */ for (int i = 0; i < ehdr.e_phnum; i++) { lseek64(fd, ehdr.e_phoff + (i * sizeof(phdr)), SEEK_SET); read(fd, &phdr, sizeof(phdr)); if (phdr.p_vaddr == vaddr_text) break; } int nr_cpu_ids; lseek64(fd, phdr.p_offset + vaddr_nr_cpu_ids - vaddr_text, SEEK_SET); read(fd, &nr_cpu_ids, sizeof(int)); printf("%d\n", nr_cpu_ids); elf_end(elf); close(fd); return 0; }
$ gcc -lelf -o nr_cpu_ids nr_cpu_ids.c
$ ./nr_cpu_ids
64

不过我还是 grep 了 kallsyms,不知道 crashdrgn 是怎么处理 KASLR 的。

参考链接

  1. debugging - Linux Kernel symbol addresses don't match between /proc/kcore and /proc/kallsyms - Stack Overflow