Try   HackMD

Peilin Ye's blog

Understanding "invisible" /proc/[tid] subdirectories

Linux: 5.18-rc5, commit 1728c0567f70 ("net: phy: smsc: add LAN8742 phy support.")

ypl@home:~$ cat /proc/$$/stat | cut -d' ' -f2
(bash)
ypl@home:~$ cat /proc/self/stat | cut -d' ' -f2
(cat)
ypl@home:~$ cut -d' ' -f2 < /proc/self/stat
(cut)

This post briefly explains why ls doesn't show /proc/[tid] subdirectories for child threads.

procfs

According to man proc(5):

The proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures. It is commonly mounted at /proc.

ypl@home:~$ ls /proc
1    177  34   45   58   686  acpi           irq             net
10   18   35   450  59   69   buddyinfo      kallsyms        pagetypeinfo
11   183  36   46   60   690  bus            kcore           partitions
114  184  37   466  61   7    cgroups        key-users       schedstat
116  19   38   469  615  70   cmdline        keys            self
...

See those numbers? Each of these so-called [pid] subdirectories corresponds to a process, or a thread group leader (TGL). However, ls /proc doesn't show [tid] subdirectories. For example, imagine an application with 2 threads:

ypl@home:~$ ls /proc/662/task
662  663

Here, 662 is the thread group leader, and 663 is a child thread. ls /proc only shows 662:

ypl@home:~$ ls /proc | grep 662
662

The 663 subdirectory is not shown, but somehow you can cd into it:

ypl@home:~$ ls /proc | grep 663
ypl@home:~$ cd /proc/663
ypl@home:/proc/663$ ls
arch_status      environ    mountinfo      personality   statm
attr             exe        mounts         projid_map    status
autogroup        fd         mountstats     root          syscall
...

It's there, just "invisible" to ls, as also documented in man proc(5):

The /proc/[tid] subdirectories are not visible when iterating through /proc with getdents(2) (and thus are not visible when one uses ls(1) to view the contents of /proc).

I found this behavior very interesting. How is it implemented?

TL;DR

(disclaimer: for recreational purposes only! :-)

Apply this to your kernel:

diff --git a/fs/proc/base.c b/fs/proc/base.c
index c1031843cc6a..579ee323b797 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3420,7 +3420,7 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
        pid = find_ge_pid(iter.tgid, ns);
        if (pid) {
                iter.tgid = pid_nr_ns(pid, ns);
-               iter.task = pid_task(pid, PIDTYPE_TGID);
+               iter.task = pid_task(pid, PIDTYPE_PID);
                if (!iter.task) {
                        iter.tgid += 1;
                        goto retry;

Now ls /proc shows both [pid] and [tid] directories. Yay!

ypl@home:~$ ls /proc/662/task
662  663
ypl@home:~$ ls /proc | grep 662
662
ypl@home:~$ ls /proc | grep 663
663

It's probably gonna break a lot of stuff based on procfs though

Walk-through

(My) ls uses the getdents64(2) system call to read directory entries from /proc:

stat("/proc", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
openat(AT_FDCWD, "/proc", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
getdents64(3, /* 184 entries */, 32768) = 4808
getdents64(3, /* 0 entries */, 32768)   = 0
close(3)                                = 0

getdents64(2) is defined in fs/readdir.c:

SYSCALL_DEFINE3(getdents64, unsigned int, fd,
		struct linux_dirent64 __user *, dirent, unsigned int, count)
{
	struct fd f;
	struct getdents_callback64 buf = {
		.ctx.actor = filldir64,
		.count = count,
		.current_dir = dirent
	};
	int error;

	f = fdget_pos(fd);
	if (!f.file)
		return -EBADF;

	error = iterate_dir(f.file, &buf.ctx);
...

It calls iterate_dir(), which first checks if /proc is actually a directory:

int iterate_dir(struct file *file, struct dir_context *ctx)
{
	struct inode *inode = file_inode(file);
	bool shared = false;
	int res = -ENOTDIR;
	if (file->f_op->iterate_shared)
		shared = true;
	else if (!file->f_op->iterate)
		goto out;
...

If neither .iterate_shared nor .iterate is implemented, iterate_dir() returns -ENOTDIR. In our case though, it then calls /proc's own .iterate_shared implementation, proc_root_readdir():

static int proc_root_readdir(struct file *file, struct dir_context *ctx)
{
	if (ctx->pos < FIRST_PROCESS_ENTRY) {
		int error = proc_readdir(file, ctx);
		if (unlikely(error <= 0))
			return error;
		ctx->pos = FIRST_PROCESS_ENTRY;
	}

	return proc_pid_readdir(file, ctx);
}

Here, proc_pid_readdir() uses next_tgid() to take care of those [pid] subdirectories in a loop:

...
	for (iter = next_tgid(ns, iter);
	     iter.task;
	     iter.tgid += 1, iter = next_tgid(ns, iter)) {
		char name[10 + 1];
		unsigned int len;

		cond_resched();
		if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
			continue;

		len = snprintf(name, sizeof(name), "%u", iter.tgid);
		ctx->pos = iter.tgid + TGID_OFFSET;
		if (!proc_fill_cache(file, ctx, name, len,
				     proc_pid_instantiate, iter.task, NULL)) {
			put_task_struct(iter.task);
			return 0;
		}
	}
...

Yep! This is where our TL;DR diff comes into play. Take another look at next_gid():

...
retry:
	iter.task = NULL;
	pid = find_ge_pid(iter.tgid, ns);
	if (pid) {
		iter.tgid = pid_nr_ns(pid, ns);
		iter.task = pid_task(pid, PIDTYPE_TGID);
		if (!iter.task) {
			iter.tgid += 1;
			goto retry;
...

It skips pid if it's not a PIDTYPE_TGID (thread group ID). In other words, proc_pid_readdir() only reports thread group leaders. This is exactly why ls /proc doesn't show [tid] subdirectories!

Appendix A: Call Tree

fs/readdir.c:SYSCALL_DEFINE3(getdents64)
              :iterate_dir()    /* file->f_op->iterate_shared() */
  fs/proc/root.c:proc_root_readdir()
    fs/proc/base.c:proc_pid_readdir()
                    :next_tgid()