The Curious Case of Thread Group Identifiers

RG

Rodrigo Gomes

Tech Lead of Managed Service at SingleStore

The Curious Case of Thread Group Identifiers

At SingleStore, we are out to build awesome software and we’re always trying to solve hard problems. A few days ago, I uncovered a cool Linux mystery with some colleagues and fixed it. We thought sharing that experience might benefit others.

The scene of the crimeWhile developing an internal tool to get stack traces, we decided to use the `SYS_tgkill` Linux system call to send signals to specific threads. The tgkill syscall sends a signal to a specific thread based on its “thread group” identifier (tgid) and thread identifier (tid). We store the thread identifier for every thread, so that was simple to obtain, but the “thread group” identifier was a new concept to me.

A Google search suggested that a simple way to get the tgid is to read it from the Linux pseudo-file /proc/self/status, which has some information about the process:

cat /proc/self/status
Name:   cat
State:  R (running)
Tgid:   26473
Ngid:   0
Pid:    26473
PPid:   26378
... <snip> ...

The first prototype of this internal tool used the `SYS_getpid` Linux system call to obtain the process identifier, find the correct status pseudo-file, and read directly from that file in a rudimentary way. The prototype assumed that the tgid was always in the third line.

This worked for most developers at SingleStore, but some developers were running environments with newer Linux distributions and it didn’t seem to work in those instances.

A recent Linux commit added a new field to /proc/self/status before Tgid, which broke the prototype:

commit 3e42979e65dace1f9268dd5440e5ab096b8dee59
Author: Richard W.M. Jones <rjones@redhat.com>
Date:   Fri May 20 17:00:05 2016 -0700
   procfs: expose umask in /proc/<PID>/status
... <snip> ...
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
       ngid = task_numa_group_id(p);
       cred = get_task_cred(p);
+       umask = get_task_umask(p);
+       if (umask >= 0)
+               seq_printf(m, "Umask:\t%#04o\n", umask);
+
       task_lock(p);
       if (p->files)
               max_fds = files_fdtable(p->files)->max_fds;
       task_unlock(p);
       rcu_read_unlock();
       seq_printf(m,
               "State:\t%s\n"
               "Tgid:\t%d\n"
               "Ngid:\t%d\n"

The motive
In order to stabilize the tool, we decided to learn more about thread group identifiers to try to find a more stable way to read them. My colleagues noticed that the tgid always seemed to match the pid of a process (the thread id of the original parent thread), so we started looking at the relationship between the attributes. Indeed, the Linux reference on thread groups states:

Thread groups were a feature added in Linux 2.4 to support the POSIX threads notion of a set of threads that share a single PID. Internally, this shared PID is the so-called thread group identifier (TGID) for the thread group. Since Linux 2.4, calls to getpid(2) return the TGID of the caller.

We were left wondering, why did /proc/self/status report both Tgid and Pid?

The culprit
I looked at the implementation of /proc/self/status in fs/proc/array.c to understand the difference between Tgid and Pid:

tgid = task_tgid_nr_ns(p, ns);
... <snip> ...
seq_put_decimal_ull(m, "\nTgid:\t", tgid);
seq_put_decimal_ull(m, "\nNgid:\t", ngid);
seq_put_decimal_ull(m, "\nPid:\t", pid_nr_ns(pid, ns));

Looking at the implementation of task_tgid_nr_ns inside kernel/pid.c, I saw that the Pid and the Tgid are in fact the same:

pid_t task_tgid_nr_ns(struct task_struct *tsk, struct pid_namespace *ns){return pid_nr_ns(task_tgid(tsk), ns);} EXPORT_SYMBOL(task_tgid_nr_ns);

task_tgid in include/linux/sched.h does exactly as I’d expect, merely reading the pid of the lead process:

static inline struct pid *task_tgid(struct task_struct *task){return task->group_leader->pids[PIDTYPE_PID].pid;} 

Elementary, my dear Watson It turned out that we didn’t need to read the status pseudo-file at all, and could instead use getpid directly. That change made it work on all environments we tested, and simplified the code significantly.At SingleStore I’ve had the chance to investigate little systems mysteries like this one, and also design and work on state-of-the-art systems. If that sounds like something you would enjoy, we are currently looking for engineers. Apply and join the team [http://www.singlestore.com/careers/jobs/](/careers/jobs/).


Share