In a nutshell, if you want to use the death signal, which is very handy and useful, you also need to lock an OS thread so that it can't be destroyed. Fortunately I'm only spawning one process so I don't need to jump through hoops, I can just dedicate a thread to it, but it would be inconvenient to want to spawn lots of processes that way.
Speaking more generally, a lot of things that I learned in the 200xs apply to "processes", and things I just osmosed over the years as applying to "processes", were changed to apply to "threads" over the decades and a lot of people have not noticed that, even now. Even though I know this, my mental model of what is associated to a thread and what is associated to a process is quite weak, since I've not yet needed to acquire a deep understanding. In general I would suggest to people that if you are dealing with this sort of system programming that you at least keep this general idea in your head so that the thought pops up that if you're having trouble, it may be related to your internal beliefs that things related to "processes" are actually related to "threads" and in fact just because you did something like set a UID or something somewhere in your code doesn't necessarily mean that that UID will be in effect somewhere else.
I may be mistaken, but I believe the bug still exists, but in a more esoteric manner; and a future change might cause the bug to exist again. The author might want to warn against usage of `tokio::task::block_in_place`, if the underlying issue can't be fixed.
The reason the current approach works is it runs on tokio's worker threads, which last the lifetime of the tokio runtime. However, if `tokio::task::block_in_place`, the current worker thread is demoted to a blocking thread pool, and the new worker thread is spawned in it's place.
There can be a situation when the stars align that:
1. Thread A spawns Process X.
2. N minutes/hours/days pass, and Thread A hits a section of code that calls `tokio::task::block_in_place`
3. Thread A goes into the blocking pool.
4. After some idle time, Thread A dies, prematurely killing Process X, causing the same bug again.
You can imagine that this would be much harder to reproduce and debug, because thread lifetime will be completely divorced from when you spawned the process. It's actually pretty lucky that the author reached for spawn_blocking, instead of block_in_place as when doing benchmarking it's a bit more tempting to use block_in_place. Had they used block_in_place it may have been harder to catch this bug.
My knowledge isn't very good here, but I assumed since they're using the single thread executor, everything was being spawned on the main thread. The only time new (temporary) threads were created was when calling `spawn_blocking`. And the main thread can't be moved because it's part of the `main()` call stack? Maybe...
> It is called PR_SET_DEATHSIG, and we configure it when spawning tasks using the prctl syscall like this
PDEATHSIG was to my knowledge (85% confidence) created for the original Linux userspace pthreads implementation (LinuxThreads¹, before NPTL) that was created back when it was implemented via kernel processes (the kernel had no concept of threads yet). This is AFAIK also why it behaves oddly in regards to later-added kernel threads. I have a flag for "don't use this, it's highly fragile" in my head but don't remember where that's from.
If the receiving side can be controlled, there's always the option of opening a pipe; if the other end dies that's always detectable. Doesn't work with arbitrary processes though (random other code won't care if some fd ≥3 is suddenly closed…)
There's been no fundamental change in the kernel level representation of pthreads, they are still clone()d processes with just some sharing flags set differently that eg affect how PIDs work.
Digressing a little, but Glibc’s pthreads implementation is painful, because they don’t provide any public API to map a pthread_t to the kernel TID, except for the horrendously awful thread_db. Of course, for the current thread, you can just call gettid() - but if you want to map pthread_t to TID for another thread, the thread_db abomination is the only supported way. Bionic supplies a nice simple pthread_gettid_np() for this, macOS has that too (albeit sadly with an incompatible prototype).
Now, pthread_t is actually a pointer to an undocumented structure, and the TID is stored at a certain offset in it… so it is easy to pull the TID from there. Until some day the glibc developers change the layout of the structure and suddenly that code breaks.
Huh I never really thought about that before. Seems like a glaring oversight but then again do any POSIX APIs even involve threads? Which itself illustrates the absurdity because what modern OS doesn't support multiple scheduling entities per virtual address space? Or should I have said per thread group? What was a process supposed to be again? (And what was the point of the thing?)
Digressing the conversation further, I notice in the docs that CLONE_SIGHAND requires CLONE_VM and CLONE_THREAD requires CLONE_SIGHAND. Any idea if there's a technical reason for that or is it just POSIX constraints needlessly infecting the kernel?
It's particularly confusing that the TID is the real identifier but the documentation generally refers to scheduling entities as processes. So you use a TID to refer to a process and a PID to refer to a thread group ... right. Very straightforward.
I guess one reason why it doesn’t have any TID concept, is although Linux nowadays uses 1:1 threading (one kernel thread per user-space thread), historically many Unix thread libraries were designed to use 1:N threading (a single kernel thread runs multiple user space threads) or M:N threading (a pool of kernel threads runs a pool of user space threads where the two pools differ in size). Plus, while Linux went with the model that processes and threads are basically two slightly different variants of the same thing, in other POSIX implementations they are completely distinct object types. Since pthreads are designed to support such a wide variety of implementation strategies, they can’t assume threads have any kernel-maintained unique ID, because in some of those implementation strategies there might not be one.
> I notice in the docs that CLONE_SIGHAND requires CLONE_VM
I think this is necessary? If it wasn’t, the child might load new code (dlopen or JIT) and then install a signal handler pointing to it. With CLONE_SIGHAND, it shares signal handler with parent. But without CLONE_VM, the memory mapping containing the new code wouldn’t exist in the parent, meaning instant segfault as soon as the signal is delivered
> and CLONE_THREAD requires CLONE_SIGHAND. Any idea if there's a technical reason for that or is it just POSIX constraints needlessly infecting the kernel?
Well, this one is more POSIX (and historical Unix before it). Signals are primarily a process-level construct in Unix/POSIX, not thread-level – since back when signals were invented, threads hadn’t been invented yet (on Unix–PL/I running under OS/360 MVT already had multithreading, which it called 'multitasking', in 1968, Unix development didn't start until 1969). Although we’ve now got per-thread signal masks and thread-directed signals, the actual handlers are still per-process
Also, I think another reason for disallowing certain combination of clone() flags: obscure combinations can expose bugs, possibly even security vulnerabilities; if there is no great demand for a specific combination, the kernel devs may conclude it is safest to disallow it
Can you actually substantiate your 85% confident claim? Because it doesn't ring the slightest bell here, and I don't see any mention of "deathsig" in glibc's LinuxThreads fork of Xavier's found @ https://ftp.gnu.org/gnu/libc/glibc-linuxthreads-2.5.tar.bz2
I used LinuxThreads back in the 90s, and its main problem ISTR was hijacking SIGUSR[12]. My interests back then involved demo programming using SVGAlib, and mixing LinuxThreads with SVGAlib was a mess due to both wanting to use SIGUSR1. Endless corrupt consoles...
> Can you actually substantiate your 85% confident claim?
I unfortunately can't, it was apparently added in 2.1.57, which would be somewhere around 1998~1999. I started working with Linux around 2001~2002, and this association of PDEATHSIG with LinuxThreads has at some point embedded itself into my brain… I can't reconstruct when or why. And I can't seem to find the specific patch that added PDEATHSIG, and can't find a versioned history of LinuxThreads either…
Probably best to treat my comment as "grandpa tells weird stories that may or may not be true" :'(
[… I'm not even that old T_T]
Best reference I can find is in MAINTAINERS:
N: Richard E. Gooch
E: rgooch@atnf.csiro.au
D: parent process death signal to children
D: prctl() syscall
S: CSIRO Australia Telescope National Facility
S: P.O. Box 76, Epping
S: N.S.W., 2121
S: Australia
If you want to be able to spawn processes that fask then `fork()` is NOT your friend. You want either `vfork()` (or `clone()` equivalent) or `posix_spawn()`.
`fork()` is inherently very slow due to the need to either copy the VM of the parent, or arrange to copy pages on write, or copy the resident set of the parent (then copy any pages paged-in when those page-in events happen -- all three of these options are very expensive.
Also, what I might recommend here is to create a `posix_spawn()`-like API that gives you asynchronous notification of the exec starting or failing, that way you don't block even for that. I'd use a `pipe()` that will be set to close on exec and which will therefore close when the exec starts, but if the exec fails I'd write the `errno` value into the pipe, that way EOF on the pipe implies the exec started while read on the pipe implies the exec failed and you can read the error number out of the pipe.
Copying the page table isn't free but it isn't particularly expensive either. At least unless the parent is a real behemoth.
Unless you enjoy footguns posix_spawn is probably a better idea than vfork. (Unless you actually need vfork of course.)
The async pipe idea sounds interesting but I'm not clear how it would work. It seems like you'd have to use vfork to implement it but vfork is blocking until you call exec so doesn't that defeat the purpose?
CoW is extremely expensive for threaded processes on multi-processor systems since you need TLB shootdowns.
> The async pipe idea sounds interesting but I'm not clear how it would work. It seems like you'd have to use vfork to implement it but vfork is blocking until you call exec so doesn't that defeat the purpose?
Yes, so have the `vfork()` happen in worker threads.
> Unless you enjoy footguns posix_spawn is probably a better idea than vfork. (Unless you actually need vfork of course.)
I have proof that `vfork()` can be used safely: the several `posix_spawn()` implementations that use it.
Normally I'd stay away from job control posix APIs - but since HyperQueue is a job control system, it might be appropriate if the worker was a session leader. If it dies than all its subprocesses would receive SIGHUP - which is fatal by default.
Generally you'd use this functionality to implement something like sshd or an interactive shell. HQ seems roughly analogous.
I do use setsid when spawning the children (I omitted it from the post, but I set it in the dsme pre_exec call where I configure DEATHSIG) but they don't receive any signal, IIRC. Or if they do, it does not seem to be propagated to their children.
I think they don't want PR_SET_PDEATHSIG but rather PR_SET_CHILD_SUBREAPER, which I think would be both more correct than PDEATHSIG for letting them wait on grand-children / preventing grand-child-zombies, while also avoiding the issue they ran into here entirely.
They would need one special "main thread" that deals with reaping and that isn't subject to tokio's runtime cleaning it up, but presumably they already have that, or else the fix they did apply wouldn't have worked.
Alternatively, if they want they could integrate with systemd, even just by wrapping the children all in 'systemd-run', which would reliably allow cleaning up of children (via cgroups).
If you run `anakin <some command>` it will kill any orphan processes that <some command> makes.
However is still isn't the true "orphans of this process must automatically die" option that everyone writing job control software wants - if `anakin` itself somehow crashes then the orphans can live again.
Still it was the best I could come up with that didn't need root.
> I think they don't want PR_SET_PDEATHSIG but rather PR_SET_CHILD_SUBREAPER, which I think would be both more correct than PDEATHSIG for letting them wait on grand-children / preventing grand-child-zombies, while also avoiding the issue they ran into here entirely.
PR_SET_PDEATHSIG automatically kills your children if you die, but unfortunately doesn’t extend to their descendants
As far as I’m aware, PR_SET_CHILD_SUBREAPER doesn’t do anything if you die. Assuming you yourself don’t crash, it can be used to help clean up orphaned descendant processes, by ensuring they reparent to you instead of init; but in the event you do crash, it doesn’t do anything to help.
PID namespaces do exactly what you want - if their init process dies it automatically kills all its descendants. However, they require privilege - unless you use an unprivileged user namespace - but those are frequently disabled, and even when enabled, using them potentially introduces a whole host of other issues
> Alternatively, if they want they could integrate with systemd
The problem is a lot of code runs in environments without systemd-e.g. code running in containers (Docker, K8S, etc), most containers don’t contain systemd. So any systemd-centric solution is only going to work for some people
Really, it would be great if Linux added some new process grouping construct which included the “kill all members of this group if its leader dies” semantic of PID namespaces without any of its other semantics. It is those other semantics (especially the new PID number semantics) which are the primary source of the security concerns, so a construct which offered only the “kill-if-leader-dies” semantic should be safe to allow for unprivileged access. (The one complexity is setuid/setgid/file capabilities - allowing an unprivileged process to effectively kill a privileged process at an arbitrary point in its execution is a security risk-plausible solutions include refuse to execute any setuid/setgid/caps executable, or else allow them to run but remove the process from this grouping when it executes one)
> PR_SET_PDEATHSIG automatically kills your children if you die, but unfortunately doesn’t extend to their descendants
It indirectly does, unless you unset it the child dying will trigger another run of PDEATHSIG on the grandchildren, and so on. (The setting is retained across forks, as shown in the original article.)
I tries the subreaper approach, but it doesn't help. The children are reparented to the worker, but when the worker dies, they are then just reparented to init, like normally.
You also need to specifically have the subreaper process call the "wait" syscall, and wait for all children, otherwise of course they'll end up reparented to init.
If you want to write a process manager, one of the process manager's responsibilities is waiting on its children.
Just a nitpick: They don’t get reparented to init regardless of whether you call wait or not, so long as the parent process exists. They’ll be in a zombie state waiting to be reaped via a parent call to wait. Only if the parent dies/exits without reaping will they be reparented to init.
> when the orphan terminates, it is the subreaper process that will receive a SIGCHLD signal and will be able to wait(2) on the process to discover its termination status
Seems like you don’t need a dedicated “always alive” thread if it’s being delivered to the process and tokio automatically does masking for threads so that you register for listening to signals using it’s asynchronous mechanisms & don’t have issues around signal safety which it abstracts away for you (i.e. as long as you’re handling the SIGCHILD signal somewhere or even just ignoring it as I don’t think they actually care?).
That being said, it’s not clear PR_SET_CHILD_SUBREAPER actually causes grand children to be killed when the reaper process dies which is the effect they’re looking for here (not the reverse where you reap forked children as they die). So you may need to spawn a dedicated reaper process rather than thread to manage the lifetime of children which is much more complicated.
> That being said, it’s not clear PR_SET_CHILD_SUBREAPER actually causes grand children to be killed when the reaper process dies
CHILD_SUBREAPER kills neither children nor grandchildren. It's effect is in the other direction, inteded for sub-service-managers that want to keep track of all children. If the subreaper dies, children are reparented to the next subreaper up (or init).
Yeah, I was assuming they have something calling `wait` somewhere since they say "HyperQueue is essentially a process manager", and to me "process manager" implies pretty strongly "spawns and waits for processes".
> In particular, it is not always possible for HQ to ensure that when a process that spawns tasks (called worker) quits unexpectedly (e.g. when it receives SIGKILL), its spawned tasks will be cleaned up.
I have no clue about HyperQueue architecture, but is this really a problem? Normally if the main process of service exits then systemd will clean up the children. Why does HyperQueue need to implement it's own cleanup?
> I don’t know how to tell it to only send the signal when the parent process (not thread) dies
What about an `atexit` handler and maintaining a table of child processes to kill? If you need something more robust in the face of adverse termination you could instead spawn an independent process, call `wait` on the primary process, and then handle any remaining cleanup.
Presumably the author would like something that's handled by the kernel and not user space, so even a SIGKILL-ed parent process would trigger the reaping of the children.
> Edit: Someone on Reddit sent me a link to a method that can override the thread keep-alive duration. Its description makes it clear why the tasks were failing after exactly 10 seconds
> Yeah, testing if a task can run for 20 seconds isn’t great, but hey, at least it’s something
Well a reasonable thing to me is then to use the override within the test to shorten it (e.g. to 1s & use a 2s timeout).
Could be done, yeah, but 20s isn't that much, and I'd like to avoid adding more test-only magic environment variables zo configure this (our end-to-end tests are in Python and they use HQ as a binary).
> In particular, it is not always possible for HQ to ensure that when a process that spawns tasks (called worker) quits unexpectedly (e.g. when it receives SIGKILL), its spawned tasks will be cleaned up. Sadly, Linux does not seem to provide any way of implementing perfect structured process management in user space. In other words, when a parent process dies, it is possible for its (grand)children to continue executing.
Uh, it does? It's called pid-namesp—
> There is a solution for this called PID namespaces,
I think maybe you've got the wrong idea about what "in user space" means? — processes running as root are still "in user space". The opposite of "user space" is "in the kernel".
> but it requires elevated privileges
I think that's only technically true. I believe you can unshare the PID namespace if you first unshare the user namespace — which causes the thing doing the unsharing of the user namespace to become "root" within that new namespace, and from there is permitted to unshare the pid namespace. I think: https://unix.stackexchange.com/a/672462/6013
I have no idea why that hoop has to be jumped through / I don't know what is being protected against by preventing unprivileged processes from making pid namespaces.
Whether or not that fits well with HQ's design … you'd have to be the judge of that.
Yeah by user space I just meant without root, sorry. HQ runs on supercomputers where the environment is heavily locked up, even Docker doesn't work. I think that PID namespaces aren't really possible, but I haven't tried it yet.
Subreaper doesn't help, because if the worker dies, the children aren't killed, even if they are the children of the worker, they will be just reparented to init.
> I think that PID namespaces aren't really possible
Depends on the cluster. If they're using nix or guix then they presumably enabled user namespaces but a few years ago guix had an article about (generally shitty) workarounds for people running in environments where those were disabled.
Edit: Maybe you should have two code paths. A fast namespaced one and the slower old one as a fallback.
> Sadly, Linux does not seem to provide any way of implementing perfect structured process management in user space. In other words, when a parent process dies, it is possible for its (grand)children to continue executing. There is a solution for this called PID namespaces, but it requires elevated privileges, and also seems a bit too heavyweight for HyperQueue.
Yeah Linux process management is a bit of a shit show. I didn't know about this sigdeath thing though. That sounds maybe useful. Is it transitive though?
Leaving PDEATHSIG enabled would make it harder for me to sleep at night, but I understand why the alternatives probably aren't appealing. Seems like a future bug waiting to happen. At least the author knows what to expect now.
Good writeup of yet another bug different from all the other bugs.
The Linux kernel isn't really bothered by the difference between threads and processes. Threads are just processes that happen to share an address space, file descriptor table, and thread group ID (what most tools call a PID). I think there are some subtle things related to the thread group ID, but they're subtle. The rest is implemented in glibc.
The distinction isn't quite as subtle as you believe, it also shows up in e.g. file locks, AF_UNIX SO_PEERCRED, and with any process-directed signal.
As a matter of fact, the original implementation of POSIX threads for Linux was userspace based and had unfixable bugs and issues that necessitated introducing the concept of threads into the Linux kernel.
Are there any differences between threads and processes in how signals are handled?
I recently learned that aside from processes there are process groups, process sessions (setsid), process group and session leaders, trees have associated VT ownership data, systemd sessions (which seem to be inherited by the entire subtree and can't be purged), and possibly other layered metadata spaces that I haven't heard of yet.
And I feel like there's got to be some way to tag or associate custom metadata with processes, but I haven't found it yet.
I really wish there were an overview of all these things and how they interact with eachother somewhere.
> Are there any differences between threads and processes in how signals are handled?
Yes. As signal(7) notes [0], Linux has both “process-directed signals” (which can be handled by any thread in a process), and “thread-directed signals” (which are targeted at a specific thread and only handled by that thread). For user-generated signals, the classification depends on which syscall you use (kill/rt_sigqueueinfo generate process-directed signals, tgkill/rt_tsigqueueinfo generate thread-directed). For system-generated signals, it is up to the kernel code generating the signal to decide. So the same signal number can be thread-directed in some cases and process-directed in others
> systemd sessions (which seem to be inherited by the entire subtree and can't be purged)
At a kernel level those are implemented with cgroups.
> I really wish there were an overview of all these things
Unfortunately I think Linux has grown a complex mess of different features in this area, all of which are full of complicated limitations and gotchas. Despite attempts to introduce orthogonality (e.g. with several different types of namespaces), the end result is still a long way from any ideal of orthogonality
Oh thanks! I was recently having `runuser -l` silently not do the session setup because of the systemd thing, so maybe there's a better way (than laundering it through a process launcher daemon in a separate tree) to handle that.
I forgot capabilities with another 5 layers (+) of different flags and applied differently to processes and files... (and then namespaces, etc)
> Are there any differences between threads and processes in how signals are handled?
Yes, absolutely, there are thread-directed and process-directed signals; for the latter a thread is chosen at random (more or less) to handle the signal.
A similar issue in Go, that I've encountered in real code: https://github.com/golang/go/issues/27505#issuecomment-71370...
In a nutshell, if you want to use the death signal, which is very handy and useful, you also need to lock an OS thread so that it can't be destroyed. Fortunately I'm only spawning one process so I don't need to jump through hoops, I can just dedicate a thread to it, but it would be inconvenient to want to spawn lots of processes that way.
Speaking more generally, a lot of things that I learned in the 200xs apply to "processes", and things I just osmosed over the years as applying to "processes", were changed to apply to "threads" over the decades and a lot of people have not noticed that, even now. Even though I know this, my mental model of what is associated to a thread and what is associated to a process is quite weak, since I've not yet needed to acquire a deep understanding. In general I would suggest to people that if you are dealing with this sort of system programming that you at least keep this general idea in your head so that the thought pops up that if you're having trouble, it may be related to your internal beliefs that things related to "processes" are actually related to "threads" and in fact just because you did something like set a UID or something somewhere in your code doesn't necessarily mean that that UID will be in effect somewhere else.
I may be mistaken, but I believe the bug still exists, but in a more esoteric manner; and a future change might cause the bug to exist again. The author might want to warn against usage of `tokio::task::block_in_place`, if the underlying issue can't be fixed.
The reason the current approach works is it runs on tokio's worker threads, which last the lifetime of the tokio runtime. However, if `tokio::task::block_in_place`, the current worker thread is demoted to a blocking thread pool, and the new worker thread is spawned in it's place.
There can be a situation when the stars align that:
1. Thread A spawns Process X.
2. N minutes/hours/days pass, and Thread A hits a section of code that calls `tokio::task::block_in_place`
3. Thread A goes into the blocking pool.
4. After some idle time, Thread A dies, prematurely killing Process X, causing the same bug again.
You can imagine that this would be much harder to reproduce and debug, because thread lifetime will be completely divorced from when you spawned the process. It's actually pretty lucky that the author reached for spawn_blocking, instead of block_in_place as when doing benchmarking it's a bit more tempting to use block_in_place. Had they used block_in_place it may have been harder to catch this bug.
My knowledge isn't very good here, but I assumed since they're using the single thread executor, everything was being spawned on the main thread. The only time new (temporary) threads were created was when calling `spawn_blocking`. And the main thread can't be moved because it's part of the `main()` call stack? Maybe...
That's a very good point! But yeah, we use the single threaded runtime, so this shouldn't be a concern.
> It is called PR_SET_DEATHSIG, and we configure it when spawning tasks using the prctl syscall like this
PDEATHSIG was to my knowledge (85% confidence) created for the original Linux userspace pthreads implementation (LinuxThreads¹, before NPTL) that was created back when it was implemented via kernel processes (the kernel had no concept of threads yet). This is AFAIK also why it behaves oddly in regards to later-added kernel threads. I have a flag for "don't use this, it's highly fragile" in my head but don't remember where that's from.
If the receiving side can be controlled, there's always the option of opening a pipe; if the other end dies that's always detectable. Doesn't work with arbitrary processes though (random other code won't care if some fd ≥3 is suddenly closed…)
¹ https://en.wikipedia.org/wiki/LinuxThreads
There's been no fundamental change in the kernel level representation of pthreads, they are still clone()d processes with just some sharing flags set differently that eg affect how PIDs work.
More precisely, distinguishing a process and a thread is a pointless overspecification. Unfortunately POSIX mandates it and glibc accepts it.
If you want to register per-thread signal handlers you're forced to step outside the bounds of glibc and pthreads which I think is quite unfortunate.
Digressing a little, but Glibc’s pthreads implementation is painful, because they don’t provide any public API to map a pthread_t to the kernel TID, except for the horrendously awful thread_db. Of course, for the current thread, you can just call gettid() - but if you want to map pthread_t to TID for another thread, the thread_db abomination is the only supported way. Bionic supplies a nice simple pthread_gettid_np() for this, macOS has that too (albeit sadly with an incompatible prototype).
Now, pthread_t is actually a pointer to an undocumented structure, and the TID is stored at a certain offset in it… so it is easy to pull the TID from there. Until some day the glibc developers change the layout of the structure and suddenly that code breaks.
There’s an entry in glibc’s bug tracker for this - https://sourceware.org/bugzilla/show_bug.cgi?id=27880 - but it doesn’t look like it will be implemented any time soon
Huh I never really thought about that before. Seems like a glaring oversight but then again do any POSIX APIs even involve threads? Which itself illustrates the absurdity because what modern OS doesn't support multiple scheduling entities per virtual address space? Or should I have said per thread group? What was a process supposed to be again? (And what was the point of the thing?)
Digressing the conversation further, I notice in the docs that CLONE_SIGHAND requires CLONE_VM and CLONE_THREAD requires CLONE_SIGHAND. Any idea if there's a technical reason for that or is it just POSIX constraints needlessly infecting the kernel?
It's particularly confusing that the TID is the real identifier but the documentation generally refers to scheduling entities as processes. So you use a TID to refer to a process and a PID to refer to a thread group ... right. Very straightforward.
> do any POSIX APIs even involve threads?
pthreads (POSIX threads) is itself a POSIX API
I guess one reason why it doesn’t have any TID concept, is although Linux nowadays uses 1:1 threading (one kernel thread per user-space thread), historically many Unix thread libraries were designed to use 1:N threading (a single kernel thread runs multiple user space threads) or M:N threading (a pool of kernel threads runs a pool of user space threads where the two pools differ in size). Plus, while Linux went with the model that processes and threads are basically two slightly different variants of the same thing, in other POSIX implementations they are completely distinct object types. Since pthreads are designed to support such a wide variety of implementation strategies, they can’t assume threads have any kernel-maintained unique ID, because in some of those implementation strategies there might not be one.
> I notice in the docs that CLONE_SIGHAND requires CLONE_VM
I think this is necessary? If it wasn’t, the child might load new code (dlopen or JIT) and then install a signal handler pointing to it. With CLONE_SIGHAND, it shares signal handler with parent. But without CLONE_VM, the memory mapping containing the new code wouldn’t exist in the parent, meaning instant segfault as soon as the signal is delivered
> and CLONE_THREAD requires CLONE_SIGHAND. Any idea if there's a technical reason for that or is it just POSIX constraints needlessly infecting the kernel?
Well, this one is more POSIX (and historical Unix before it). Signals are primarily a process-level construct in Unix/POSIX, not thread-level – since back when signals were invented, threads hadn’t been invented yet (on Unix–PL/I running under OS/360 MVT already had multithreading, which it called 'multitasking', in 1968, Unix development didn't start until 1969). Although we’ve now got per-thread signal masks and thread-directed signals, the actual handlers are still per-process
Also, I think another reason for disallowing certain combination of clone() flags: obscure combinations can expose bugs, possibly even security vulnerabilities; if there is no great demand for a specific combination, the kernel devs may conclude it is safest to disallow it
Linux isn't the only kernel in the world. Posix needs to be kernel agnostic. Also need common abstractions to have unique names.
Can you actually substantiate your 85% confident claim? Because it doesn't ring the slightest bell here, and I don't see any mention of "deathsig" in glibc's LinuxThreads fork of Xavier's found @ https://ftp.gnu.org/gnu/libc/glibc-linuxthreads-2.5.tar.bz2
I used LinuxThreads back in the 90s, and its main problem ISTR was hijacking SIGUSR[12]. My interests back then involved demo programming using SVGAlib, and mixing LinuxThreads with SVGAlib was a mess due to both wanting to use SIGUSR1. Endless corrupt consoles...
> Can you actually substantiate your 85% confident claim?
I unfortunately can't, it was apparently added in 2.1.57, which would be somewhere around 1998~1999. I started working with Linux around 2001~2002, and this association of PDEATHSIG with LinuxThreads has at some point embedded itself into my brain… I can't reconstruct when or why. And I can't seem to find the specific patch that added PDEATHSIG, and can't find a versioned history of LinuxThreads either…
Probably best to treat my comment as "grandpa tells weird stories that may or may not be true" :'(
[… I'm not even that old T_T]
Best reference I can find is in MAINTAINERS:
[ed.] wait! — https://man7.org/conf/piter2019/once_upon_an_API-Linux-Piter..."(Of course, there was no explanation of why the feature was needed)"
If you want to be able to spawn processes that fask then `fork()` is NOT your friend. You want either `vfork()` (or `clone()` equivalent) or `posix_spawn()`.
`fork()` is inherently very slow due to the need to either copy the VM of the parent, or arrange to copy pages on write, or copy the resident set of the parent (then copy any pages paged-in when those page-in events happen -- all three of these options are very expensive.
Also, what I might recommend here is to create a `posix_spawn()`-like API that gives you asynchronous notification of the exec starting or failing, that way you don't block even for that. I'd use a `pipe()` that will be set to close on exec and which will therefore close when the exec starts, but if the exec fails I'd write the `errno` value into the pipe, that way EOF on the pipe implies the exec started while read on the pipe implies the exec failed and you can read the error number out of the pipe.
The stdlib already mostly does all of that :)
Check out https://kobzol.github.io/rust/2024/01/28/process-spawning-pe....
Copying the page table isn't free but it isn't particularly expensive either. At least unless the parent is a real behemoth.
Unless you enjoy footguns posix_spawn is probably a better idea than vfork. (Unless you actually need vfork of course.)
The async pipe idea sounds interesting but I'm not clear how it would work. It seems like you'd have to use vfork to implement it but vfork is blocking until you call exec so doesn't that defeat the purpose?
CoW is extremely expensive for threaded processes on multi-processor systems since you need TLB shootdowns.
> The async pipe idea sounds interesting but I'm not clear how it would work. It seems like you'd have to use vfork to implement it but vfork is blocking until you call exec so doesn't that defeat the purpose?
Yes, so have the `vfork()` happen in worker threads.
> Unless you enjoy footguns posix_spawn is probably a better idea than vfork. (Unless you actually need vfork of course.)
I have proof that `vfork()` can be used safely: the several `posix_spawn()` implementations that use it.
Seeing the first mention of 10 seconds, I thought (jokingly) - Why not grep the source for the 10 second value.
Jokingly, because I thought it would be an emergent property, not a literal value.
Turns out it was a literal value after all and grepping would have helped!
There was no mention of a ten second interval anywhere in the code, only in the tests they wrote while debugging.
As mentioned in the article, there is a 10 second value in Tokio - the default thread timeout.
Normally I'd stay away from job control posix APIs - but since HyperQueue is a job control system, it might be appropriate if the worker was a session leader. If it dies than all its subprocesses would receive SIGHUP - which is fatal by default.
Generally you'd use this functionality to implement something like sshd or an interactive shell. HQ seems roughly analogous.
https://notes.shichao.io/apue/ch9/#sessions
I do use setsid when spawning the children (I omitted it from the post, but I set it in the dsme pre_exec call where I configure DEATHSIG) but they don't receive any signal, IIRC. Or if they do, it does not seem to be propagated to their children.
Yeah I suspect there may be a solution involving setsid
> This makes sense, of course, because there’s not exactly an asynchronous version of fork.
IORING_OP_CLONE and IORING_OP_EXEC have been proposed: https://lwn.net/Articles/1002371/
I think they don't want PR_SET_PDEATHSIG but rather PR_SET_CHILD_SUBREAPER, which I think would be both more correct than PDEATHSIG for letting them wait on grand-children / preventing grand-child-zombies, while also avoiding the issue they ran into here entirely.
They would need one special "main thread" that deals with reaping and that isn't subject to tokio's runtime cleaning it up, but presumably they already have that, or else the fix they did apply wouldn't have worked.
Alternatively, if they want they could integrate with systemd, even just by wrapping the children all in 'systemd-run', which would reliably allow cleaning up of children (via cgroups).
> PR_SET_CHILD_SUBREAPER
I wrote a tool that does just this: https://github.com/timmmm/anakin
If you run `anakin <some command>` it will kill any orphan processes that <some command> makes.
However is still isn't the true "orphans of this process must automatically die" option that everyone writing job control software wants - if `anakin` itself somehow crashes then the orphans can live again.
Still it was the best I could come up with that didn't need root.
The name of the tool is on point.
> I think they don't want PR_SET_PDEATHSIG but rather PR_SET_CHILD_SUBREAPER, which I think would be both more correct than PDEATHSIG for letting them wait on grand-children / preventing grand-child-zombies, while also avoiding the issue they ran into here entirely.
PR_SET_PDEATHSIG automatically kills your children if you die, but unfortunately doesn’t extend to their descendants
As far as I’m aware, PR_SET_CHILD_SUBREAPER doesn’t do anything if you die. Assuming you yourself don’t crash, it can be used to help clean up orphaned descendant processes, by ensuring they reparent to you instead of init; but in the event you do crash, it doesn’t do anything to help.
PID namespaces do exactly what you want - if their init process dies it automatically kills all its descendants. However, they require privilege - unless you use an unprivileged user namespace - but those are frequently disabled, and even when enabled, using them potentially introduces a whole host of other issues
> Alternatively, if they want they could integrate with systemd
The problem is a lot of code runs in environments without systemd-e.g. code running in containers (Docker, K8S, etc), most containers don’t contain systemd. So any systemd-centric solution is only going to work for some people
Really, it would be great if Linux added some new process grouping construct which included the “kill all members of this group if its leader dies” semantic of PID namespaces without any of its other semantics. It is those other semantics (especially the new PID number semantics) which are the primary source of the security concerns, so a construct which offered only the “kill-if-leader-dies” semantic should be safe to allow for unprivileged access. (The one complexity is setuid/setgid/file capabilities - allowing an unprivileged process to effectively kill a privileged process at an arbitrary point in its execution is a security risk-plausible solutions include refuse to execute any setuid/setgid/caps executable, or else allow them to run but remove the process from this grouping when it executes one)
> PR_SET_PDEATHSIG automatically kills your children if you die, but unfortunately doesn’t extend to their descendants
It indirectly does, unless you unset it the child dying will trigger another run of PDEATHSIG on the grandchildren, and so on. (The setting is retained across forks, as shown in the original article.)
It is sadly not propagated to grandchildren.
I tries the subreaper approach, but it doesn't help. The children are reparented to the worker, but when the worker dies, they are then just reparented to init, like normally.
You also need to specifically have the subreaper process call the "wait" syscall, and wait for all children, otherwise of course they'll end up reparented to init.
If you want to write a process manager, one of the process manager's responsibilities is waiting on its children.
Just a nitpick: They don’t get reparented to init regardless of whether you call wait or not, so long as the parent process exists. They’ll be in a zombie state waiting to be reaped via a parent call to wait. Only if the parent dies/exits without reaping will they be reparented to init.
> The setting is retained across forks, as shown in the original article
That’s not what the man page says:
> The parent-death signal setting is cleared for the child of a fork(2).
https://man7.org/linux/man-pages/man2/pr_set_pdeathsig.2cons...
Unless the man page is wrong?
I wonder if this is difference between libc fork (which calls clone syscall) and kernel fork syscall.
No, it isn’t. Neither glibc fork nor kernel fork syscall provide any special handling for PDEATHSIG beyond what clone syscall does.
Yeah, I misremembered/misread and didn't check. Bleh.
(The article sets it after forking.)
> when the orphan terminates, it is the subreaper process that will receive a SIGCHLD signal and will be able to wait(2) on the process to discover its termination status
Seems like you don’t need a dedicated “always alive” thread if it’s being delivered to the process and tokio automatically does masking for threads so that you register for listening to signals using it’s asynchronous mechanisms & don’t have issues around signal safety which it abstracts away for you (i.e. as long as you’re handling the SIGCHILD signal somewhere or even just ignoring it as I don’t think they actually care?).
That being said, it’s not clear PR_SET_CHILD_SUBREAPER actually causes grand children to be killed when the reaper process dies which is the effect they’re looking for here (not the reverse where you reap forked children as they die). So you may need to spawn a dedicated reaper process rather than thread to manage the lifetime of children which is much more complicated.
> That being said, it’s not clear PR_SET_CHILD_SUBREAPER actually causes grand children to be killed when the reaper process dies
CHILD_SUBREAPER kills neither children nor grandchildren. It's effect is in the other direction, inteded for sub-service-managers that want to keep track of all children. If the subreaper dies, children are reparented to the next subreaper up (or init).
Yeah, I was assuming they have something calling `wait` somewhere since they say "HyperQueue is essentially a process manager", and to me "process manager" implies pretty strongly "spawns and waits for processes".
> In particular, it is not always possible for HQ to ensure that when a process that spawns tasks (called worker) quits unexpectedly (e.g. when it receives SIGKILL), its spawned tasks will be cleaned up.
I have no clue about HyperQueue architecture, but is this really a problem? Normally if the main process of service exits then systemd will clean up the children. Why does HyperQueue need to implement it's own cleanup?
Although we will need to add one additional unsafe block once we migrate to the 2024 edition because we use std::env::set_var in main :laughing:
This is unsafe rightfully and should not be used without checking. It's just undefined behavior when using threads as it's not threadsafe. https://ttimo.typepad.com/blog/2024/11/the-steam-client-upda...
> I don’t know how to tell it to only send the signal when the parent process (not thread) dies
What about an `atexit` handler and maintaining a table of child processes to kill? If you need something more robust in the face of adverse termination you could instead spawn an independent process, call `wait` on the primary process, and then handle any remaining cleanup.
Presumably the author would like something that's handled by the kernel and not user space, so even a SIGKILL-ed parent process would trigger the reaping of the children.
That would be a pid namespace but was passed over for compatibility reasons I guess.
Addressing a SIGKILL'd parent specifically, what about daemonizing the cleanup process but instead of fork use clone with the CLONE_VM flag?
> Edit: Someone on Reddit sent me a link to a method that can override the thread keep-alive duration. Its description makes it clear why the tasks were failing after exactly 10 seconds
> Yeah, testing if a task can run for 20 seconds isn’t great, but hey, at least it’s something
Well a reasonable thing to me is then to use the override within the test to shorten it (e.g. to 1s & use a 2s timeout).
Could be done, yeah, but 20s isn't that much, and I'd like to avoid adding more test-only magic environment variables zo configure this (our end-to-end tests are in Python and they use HQ as a binary).
> In particular, it is not always possible for HQ to ensure that when a process that spawns tasks (called worker) quits unexpectedly (e.g. when it receives SIGKILL), its spawned tasks will be cleaned up. Sadly, Linux does not seem to provide any way of implementing perfect structured process management in user space. In other words, when a parent process dies, it is possible for its (grand)children to continue executing.
Uh, it does? It's called pid-namesp—
> There is a solution for this called PID namespaces,
I think maybe you've got the wrong idea about what "in user space" means? — processes running as root are still "in user space". The opposite of "user space" is "in the kernel".
> but it requires elevated privileges
I think that's only technically true. I believe you can unshare the PID namespace if you first unshare the user namespace — which causes the thing doing the unsharing of the user namespace to become "root" within that new namespace, and from there is permitted to unshare the pid namespace. I think: https://unix.stackexchange.com/a/672462/6013
I have no idea why that hoop has to be jumped through / I don't know what is being protected against by preventing unprivileged processes from making pid namespaces.
Whether or not that fits well with HQ's design … you'd have to be the judge of that.
There's also prctl(PR_SET_CHILD_SUBREAPER, ...)
Yeah by user space I just meant without root, sorry. HQ runs on supercomputers where the environment is heavily locked up, even Docker doesn't work. I think that PID namespaces aren't really possible, but I haven't tried it yet.
Subreaper doesn't help, because if the worker dies, the children aren't killed, even if they are the children of the worker, they will be just reparented to init.
> I think that PID namespaces aren't really possible
Depends on the cluster. If they're using nix or guix then they presumably enabled user namespaces but a few years ago guix had an article about (generally shitty) workarounds for people running in environments where those were disabled.
Edit: Maybe you should have two code paths. A fast namespaced one and the slower old one as a fallback.
> Sadly, Linux does not seem to provide any way of implementing perfect structured process management in user space. In other words, when a parent process dies, it is possible for its (grand)children to continue executing. There is a solution for this called PID namespaces, but it requires elevated privileges, and also seems a bit too heavyweight for HyperQueue.
Yeah Linux process management is a bit of a shit show. I didn't know about this sigdeath thing though. That sounds maybe useful. Is it transitive though?
Async sucks, but threads suck more, but processes suck most
n:m preemptively scheduled green threads (goroutines, erlang processes) suck much less!
Leaving PDEATHSIG enabled would make it harder for me to sleep at night, but I understand why the alternatives probably aren't appealing. Seems like a future bug waiting to happen. At least the author knows what to expect now.
Good writeup of yet another bug different from all the other bugs.
The Linux kernel isn't really bothered by the difference between threads and processes. Threads are just processes that happen to share an address space, file descriptor table, and thread group ID (what most tools call a PID). I think there are some subtle things related to the thread group ID, but they're subtle. The rest is implemented in glibc.
The distinction isn't quite as subtle as you believe, it also shows up in e.g. file locks, AF_UNIX SO_PEERCRED, and with any process-directed signal.
As a matter of fact, the original implementation of POSIX threads for Linux was userspace based and had unfixable bugs and issues that necessitated introducing the concept of threads into the Linux kernel.
Are there any differences between threads and processes in how signals are handled?
I recently learned that aside from processes there are process groups, process sessions (setsid), process group and session leaders, trees have associated VT ownership data, systemd sessions (which seem to be inherited by the entire subtree and can't be purged), and possibly other layered metadata spaces that I haven't heard of yet.
And I feel like there's got to be some way to tag or associate custom metadata with processes, but I haven't found it yet.
I really wish there were an overview of all these things and how they interact with eachother somewhere.
> Are there any differences between threads and processes in how signals are handled?
Yes. As signal(7) notes [0], Linux has both “process-directed signals” (which can be handled by any thread in a process), and “thread-directed signals” (which are targeted at a specific thread and only handled by that thread). For user-generated signals, the classification depends on which syscall you use (kill/rt_sigqueueinfo generate process-directed signals, tgkill/rt_tsigqueueinfo generate thread-directed). For system-generated signals, it is up to the kernel code generating the signal to decide. So the same signal number can be thread-directed in some cases and process-directed in others
> systemd sessions (which seem to be inherited by the entire subtree and can't be purged)
At a kernel level those are implemented with cgroups.
> I really wish there were an overview of all these things
Unfortunately I think Linux has grown a complex mess of different features in this area, all of which are full of complicated limitations and gotchas. Despite attempts to introduce orthogonality (e.g. with several different types of namespaces), the end result is still a long way from any ideal of orthogonality
[0] https://man7.org/linux/man-pages/man7/signal.7.html
Oh thanks! I was recently having `runuser -l` silently not do the session setup because of the systemd thing, so maybe there's a better way (than laundering it through a process launcher daemon in a separate tree) to handle that.
I forgot capabilities with another 5 layers (+) of different flags and applied differently to processes and files... (and then namespaces, etc)
> Are there any differences between threads and processes in how signals are handled?
Yes, absolutely, there are thread-directed and process-directed signals; for the latter a thread is chosen at random (more or less) to handle the signal.
> I really wish there were an overview of all these things and how they interact with eachother somewhere.
Also see `man 2 clone` and `man 7 cgroups`.