Tokio and Prctl = Nasty Bug

kobzol.github.io

200 points by ingve 5 months ago

nemothekid 5 months ago

I may be mistaken, but I believe the bug still exists, but in a more esoteric manner; and a future change might cause the bug to exist again. The author might want to warn against usage of `tokio::task::block_in_place`, if the underlying issue can't be fixed.

The reason the current approach works is it runs on tokio's worker threads, which last the lifetime of the tokio runtime. However, if `tokio::task::block_in_place`, the current worker thread is demoted to a blocking thread pool, and the new worker thread is spawned in it's place.

There can be a situation when the stars align that:

1. Thread A spawns Process X.

2. N minutes/hours/days pass, and Thread A hits a section of code that calls `tokio::task::block_in_place`

3. Thread A goes into the blocking pool.

4. After some idle time, Thread A dies, prematurely killing Process X, causing the same bug again.

You can imagine that this would be much harder to reproduce and debug, because thread lifetime will be completely divorced from when you spawned the process. It's actually pretty lucky that the author reached for spawn_blocking, instead of block_in_place as when doing benchmarking it's a bit more tempting to use block_in_place. Had they used block_in_place it may have been harder to catch this bug.

rendaw 5 months ago

My knowledge isn't very good here, but I assumed since they're using the single thread executor, everything was being spawned on the main thread. The only time new (temporary) threads were created was when calling `spawn_blocking`. And the main thread can't be moved because it's part of the `main()` call stack? Maybe...
kobzol 5 months ago

That's a very good point! But yeah, we use the single threaded runtime, so this shouldn't be a concern.

eqvinox 5 months ago

> It is called PR_SET_DEATHSIG, and we configure it when spawning tasks using the prctl syscall like this

PDEATHSIG was to my knowledge (85% confidence) created for the original Linux userspace pthreads implementation (LinuxThreads¹, before NPTL) that was created back when it was implemented via kernel processes (the kernel had no concept of threads yet). This is AFAIK also why it behaves oddly in regards to later-added kernel threads. I have a flag for "don't use this, it's highly fragile" in my head but don't remember where that's from.

If the receiving side can be controlled, there's always the option of opening a pipe; if the other end dies that's always detectable. Doesn't work with arbitrary processes though (random other code won't care if some fd ≥3 is suddenly closed…)

¹ https://en.wikipedia.org/wiki/LinuxThreads

fulafel 5 months ago

There's been no fundamental change in the kernel level representation of pthreads, they are still clone()d processes with just some sharing flags set differently that eg affect how PIDs work.
- eqvinox 5 months ago
  
  > they are still clone()d processes with just some sharing flags set differently that eg affect how PIDs work.
  I'd say this is depending on perspective both true and false¹, but also unhelpful to work with here.
  Instead, I would suggest this perspective: the kernel has neither processes nor threads; it has tasks, which are entities the scheduler can run. They're exposed to userland as processes and threads. Excluding kernel tasks/threads, which can have arbitrary rules but are also user-visible, a task is exposed as a thread, and a set of threads is exposed as a process. Both operations working with threads as well as operations working with processes exist.
  We're looking at an API in this case that works with threads on one side (parent, the signal is triggered by thread exit) and processes on the other (child, the signal is process-targeted). How these were created is irrelevant, what matters is the abstractions they refer to.
  ¹ you could equally well argue that processes do not exist in the kernel, they're just threads with sharing flags set differently.
- fc417fc802 5 months ago
  
  More precisely, distinguishing a process and a thread is a pointless overspecification. Unfortunately POSIX mandates it and glibc accepts it.
  If you want to register per-thread signal handlers you're forced to step outside the bounds of glibc and pthreads which I think is quite unfortunate.
  - skissane 5 months ago
    
    Digressing a little, but Glibc’s pthreads implementation is painful, because they don’t provide any public API to map a pthread_t to the kernel TID, except for the horrendously awful thread_db. Of course, for the current thread, you can just call gettid() - but if you want to map pthread_t to TID for another thread, the thread_db abomination is the only supported way. Bionic supplies a nice simple pthread_gettid_np() for this, macOS has that too (albeit sadly with an incompatible prototype).
    Now, pthread_t is actually a pointer to an undocumented structure, and the TID is stored at a certain offset in it… so it is easy to pull the TID from there. Until some day the glibc developers change the layout of the structure and suddenly that code breaks.
    There’s an entry in glibc’s bug tracker for this - https://sourceware.org/bugzilla/show_bug.cgi?id=27880 - but it doesn’t look like it will be implemented any time soon
    
    fc417fc802 5 months ago
    
    Huh I never really thought about that before. Seems like a glaring oversight but then again do any POSIX APIs even involve threads? Which itself illustrates the absurdity because what modern OS doesn't support multiple scheduling entities per virtual address space? Or should I have said per thread group? What was a process supposed to be again? (And what was the point of the thing?)
    Digressing the conversation further, I notice in the docs that CLONE_SIGHAND requires CLONE_VM and CLONE_THREAD requires CLONE_SIGHAND. Any idea if there's a technical reason for that or is it just POSIX constraints needlessly infecting the kernel?
    It's particularly confusing that the TID is the real identifier but the documentation generally refers to scheduling entities as processes. So you use a TID to refer to a process and a PID to refer to a thread group ... right. Very straightforward.
    
    skissane 5 months ago
    
    > do any POSIX APIs even involve threads?
    pthreads (POSIX threads) is itself a POSIX API
    I guess one reason why it doesn’t have any TID concept, is although Linux nowadays uses 1:1 threading (one kernel thread per user-space thread), historically many Unix thread libraries were designed to use 1:N threading (a single kernel thread runs multiple user space threads) or M:N threading (a pool of kernel threads runs a pool of user space threads where the two pools differ in size). Plus, while Linux went with the model that processes and threads are basically two slightly different variants of the same thing, in other POSIX implementations they are completely distinct object types. Since pthreads are designed to support such a wide variety of implementation strategies, they can’t assume threads have any kernel-maintained unique ID, because in some of those implementation strategies there might not be one.
    > I notice in the docs that CLONE_SIGHAND requires CLONE_VM
    I think this is necessary? If it wasn’t, the child might load new code (dlopen or JIT) and then install a signal handler pointing to it. With CLONE_SIGHAND, it shares signal handler with parent. But without CLONE_VM, the memory mapping containing the new code wouldn’t exist in the parent, meaning instant segfault as soon as the signal is delivered
    > and CLONE_THREAD requires CLONE_SIGHAND. Any idea if there's a technical reason for that or is it just POSIX constraints needlessly infecting the kernel?
    Well, this one is more POSIX (and historical Unix before it). Signals are primarily a process-level construct in Unix/POSIX, not thread-level – since back when signals were invented, threads hadn’t been invented yet (on Unix–PL/I running under OS/360 MVT already had multithreading, which it called 'multitasking', in 1968, Unix development didn't start until 1969). Although we’ve now got per-thread signal masks and thread-directed signals, the actual handlers are still per-process
    Also, I think another reason for disallowing certain combination of clone() flags: obscure combinations can expose bugs, possibly even security vulnerabilities; if there is no great demand for a specific combination, the kernel devs may conclude it is safest to disallow it
  - surajrmal 5 months ago
    
    Linux isn't the only kernel in the world. Posix needs to be kernel agnostic. Also need common abstractions to have unique names.
pengaru 5 months ago

Can you actually substantiate your 85% confident claim? Because it doesn't ring the slightest bell here, and I don't see any mention of "deathsig" in glibc's LinuxThreads fork of Xavier's found @ https://ftp.gnu.org/gnu/libc/glibc-linuxthreads-2.5.tar.bz2
I used LinuxThreads back in the 90s, and its main problem ISTR was hijacking SIGUSR[12]. My interests back then involved demo programming using SVGAlib, and mixing LinuxThreads with SVGAlib was a mess due to both wanting to use SIGUSR1. Endless corrupt consoles...
- eqvinox 5 months ago
  > Can you actually substantiate your 85% confident claim?
  I unfortunately can't, it was apparently added in 2.1.57, which would be somewhere around 1998~1999. I started working with Linux around 2001~2002, and this association of PDEATHSIG with LinuxThreads has at some point embedded itself into my brain… I can't reconstruct when or why. And I can't seem to find the specific patch that added PDEATHSIG, and can't find a versioned history of LinuxThreads either…
  Probably best to treat my comment as "grandpa tells weird stories that may or may not be true" :'(
  [… I'm not even that old T_T]
  Best reference I can find is in MAINTAINERS:
  N: Richard E. Gooch E: rgooch@atnf.csiro.au D: parent process death signal to children D: prctl() syscall S: CSIRO Australia Telescope National Facility S: P.O. Box 76, Epping S: N.S.W., 2121 S: Australia
  [ed.] wait! — https://man7.org/conf/piter2019/once_upon_an_API-Linux-Piter...
  "(Of course, there was no explanation of why the feature was needed)"
  Note Richard was at minimum involved in discussions about LinuxThreads: https://lkml.iu.edu/hypermail/linux/kernel/9806.2/1227.html — and it does mention prctl and "dying main thread"…
  - pengaru 5 months ago
    
    Thanks for digging this up, that kerrisk pdf is great.

cryptonector 5 months ago

If you want to be able to spawn processes that fask then `fork()` is NOT your friend. You want either `vfork()` (or `clone()` equivalent) or `posix_spawn()`.

`fork()` is inherently very slow due to the need to either copy the VM of the parent, or arrange to copy pages on write, or copy the resident set of the parent (then copy any pages paged-in when those page-in events happen -- all three of these options are very expensive.

Also, what I might recommend here is to create a `posix_spawn()`-like API that gives you asynchronous notification of the exec starting or failing, that way you don't block even for that. I'd use a `pipe()` that will be set to close on exec and which will therefore close when the exec starts, but if the exec fails I'd write the `errno` value into the pipe, that way EOF on the pipe implies the exec started while read on the pipe implies the exec failed and you can read the error number out of the pipe.

  https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234
  https://news.ycombinator.com/item?id=30502392
  https://github.com/famzah/popen-noshell

kobzol 5 months ago

The stdlib already mostly does all of that :)
Check out https://kobzol.github.io/rust/2024/01/28/process-spawning-pe....
fc417fc802 5 months ago

Copying the page table isn't free but it isn't particularly expensive either. At least unless the parent is a real behemoth.
Unless you enjoy footguns posix_spawn is probably a better idea than vfork. (Unless you actually need vfork of course.)
The async pipe idea sounds interesting but I'm not clear how it would work. It seems like you'd have to use vfork to implement it but vfork is blocking until you call exec so doesn't that defeat the purpose?
- cryptonector 5 months ago
  
  CoW is extremely expensive for threaded processes on multi-processor systems since you need TLB shootdowns.
  > The async pipe idea sounds interesting but I'm not clear how it would work. It seems like you'd have to use vfork to implement it but vfork is blocking until you call exec so doesn't that defeat the purpose?
  Yes, so have the `vfork()` happen in worker threads.
  > Unless you enjoy footguns posix_spawn is probably a better idea than vfork. (Unless you actually need vfork of course.)
  I have proof that `vfork()` can be used safely: the several `posix_spawn()` implementations that use it.
  - fc417fc802 5 months ago
    
    Good point about multiprocessor systems. He did say this is being used on HPC clusters.
    You can also use C safely if you're careful. Doesn't mean it isn't full of footguns.
    > have the `vfork()` happen in worker threads
    Fair enough. Since this is an exercise in efficiency and latency, if you're creating a worker thread isn't an atomic write by the worker cheaper than creating a pipe?
    
    cryptonector 5 months ago
    
    > Good point about multiprocessor systems. He did say this is being used on HPC clusters.
    Plus essentially all systems now are multiprocessor systems.
    > You can also use C safely if you're careful. Doesn't mean it isn't full of footguns.
    You don't have to use much C here at all.
    > Since this is an exercise in efficiency and latency, if you're creating a worker thread isn't an atomic write by the worker cheaper than creating a pipe?
    The point is to signal that `execve(...)` did not fail but did start the new program -- but, how? You can use a condition variable or whatever to signal failure to exec, but you can't run code after a successful exec that would signal a condition variable or whatever because the program that was calling exec is now not running. A close-on-exec pipe serves to asynchronously signal success to the parent: it only closes either when the exec succeeds or after the exec fails and you close it after signaling the failure however you like. This works because the "code that runs after the exec succeeds" here is not code in the new program but code in the kernel after the kernel commits to not returning from `execve()`!
    What else has this property? Only file descriptors, or you can insist that the program you exec must signal some how that it started. But the latter is intrusive, while the former is not. If it's file descriptors then it has to be something that supports async I/O, and the simplest thing would be a pipe or a socketpair. Now if you're going to use a pipe for exec success reporting then you might as well also use it to signal failure because why have two mechanisms, one for reporting success and one for reporting failure?
    You still have to deal with the process' eventual exit, and reporting that asynchronously along with the exit status. But you get to report on all of three distinct events:
    - exec failure (e.g., ENOENT) - exec start (really *exec didn't fail*) - child process exit/death
    You could choose to not report exec start, I suppose. But you often want to know that exec didn't fail, and the only ways to know that are to either wait for the process to exit (which could be a long time!), assume that if enough time has passed then the exec did not fail (a lame heuristic), or just arrange to signal exec success. Since it is possible and easy to signal exec success via a close-on-exec pipe/socketpair then you might as well just do that.
    
    fc417fc802 5 months ago
    
    Note my nomenclature here is parent (doesn't block), worker (calls vfork), and child (calls execve).
    If you used vfork then the worker doesn't unblock until execve has executed successfully (AFAIU). In that case if the exec succeeded then nothing will have been written whereas if it failed then the child will have written out the error code to the worker's stack before calling _exit.
    The worker can then use an atomic write to communicate the ultimate result back to the parent in an asynchronous manner.
    
    cryptonector 5 months ago
    
    Ah, yes, that's true. I'm so used to using pipes like this that I forgot. You're quite right. Though I'd still use something like a pipe rather than a condition variable, say, because on Unix and Linux systems you can't have an event loop handle events on file descriptors _and_ condition variables.
    
    fc417fc802 5 months ago
    
    I thought the optimization goal was latency on the parent thread so I had in mind a bare atomic_store_explicit with memory_order_relaxed. However you wouldn't care about latency in the first place if you weren't scaling this up and obviously scaling up requires additional infrastructure that I'm ignoring here.
    That said the linked article is using spawn_blocking from tokio meaning you can just call vfork directly and return the result. No need to roll your own in this case.
    > you can't have an event loop handle events on file descriptors _and_ condition variables
    Since events crossing the kernel boundary aren't fast to begin with I prefer to use a secondary event loop for the sluggish stuff and feed the output to the lower latency one.
    
    cryptonector 5 months ago
    
    And those are my favorite signalling mechanisms: async I/O and condition variables.

jerf 5 months ago

A similar issue in Go, that I've encountered in real code: https://github.com/golang/go/issues/27505#issuecomment-71370...

In a nutshell, if you want to use the death signal, which is very handy and useful, you also need to lock an OS thread so that it can't be destroyed. Fortunately I'm only spawning one process so I don't need to jump through hoops, I can just dedicate a thread to it, but it would be inconvenient to want to spawn lots of processes that way.

Speaking more generally, a lot of things that I learned in the 200xs apply to "processes", and things I just osmosed over the years as applying to "processes", were changed to apply to "threads" over the decades and a lot of people have not noticed that, even now. Even though I know this, my mental model of what is associated to a thread and what is associated to a process is quite weak, since I've not yet needed to acquire a deep understanding. In general I would suggest to people that if you are dealing with this sort of system programming that you at least keep this general idea in your head so that the thought pops up that if you're having trouble, it may be related to your internal beliefs that things related to "processes" are actually related to "threads" and in fact just because you did something like set a UID or something somewhere in your code doesn't necessarily mean that that UID will be in effect somewhere else.

LegionMammal978 5 months ago

Yeah, a lot of process vs. thread distinctions can be unclear, even in documentation. E.g., the Linux clone(2) man page mostly talks about "the child process", even though it can create either a new thread or a new process.
The weirdest case of processes vs. threads is definitely the setuid() family of functions on Linux. The underlying syscalls apply the new uid (or euid, fsuid, etc.) to the current thread, but POSIX requires them to apply to the entire process. How does glibc paper over this? It registers a realtime signal handler which calls the appropriate syscall, and sends that signal to every thread when the wrapper function is called. On top of that, it quietly removes the handler's signal number (SIGSETXID) from all calls to sigfillset(), sigprocmask(), and pthread_sigmask() to keep it from getting blocked, and bumps the value of SIGRTMIN in the userspace headers so that programs won't notice the gap. I believe musl libc does something very similar.

bhawks 5 months ago

Normally I'd stay away from job control posix APIs - but since HyperQueue is a job control system, it might be appropriate if the worker was a session leader. If it dies than all its subprocesses would receive SIGHUP - which is fatal by default.

Generally you'd use this functionality to implement something like sshd or an interactive shell. HQ seems roughly analogous.

https://notes.shichao.io/apue/ch9/#sessions

kobzol 5 months ago

I do use setsid when spawning the children (I omitted it from the post, but I set it in the dsme pre_exec call where I configure DEATHSIG) but they don't receive any signal, IIRC. Or if they do, it does not seem to be propagated to their children.
- bhawks 5 months ago
  Calling setsid in pre_exec isn't exactly correct, you'd do this to daemonize a process in order to prevent it from getting signals from the process which spawned you exited or its terminal disconnected.
  If you read exit(3) you'll see that you also need a controlling terminal for the kernel to send SIGHUP to the processes in your foreground process group.
  In python it'd look roughly like:
  master, slave = os.openpty() # ensure we can call setsid successfully try: pid = os.fork() if pid > 0: os.close(master) os.close(slave) sys.exit(0) finally: sys.exit(1) os.setsid() # we are now session leader and process group leader fcntl.ioctl(master, termios.TIOCSCTTY, 0) # we now have a controlling pty - closing master will send us a SIGHUP os.close(slave) # go start spawning subprocesses - if _this_ process is killed, they will receive SIGHUPs. Unless they take themselves out of your session or the foreground process group.
ahepp 5 months ago

Yeah I suspect there may be a solution involving setsid

TheDong 5 months ago

I think they don't want PR_SET_PDEATHSIG but rather PR_SET_CHILD_SUBREAPER, which I think would be both more correct than PDEATHSIG for letting them wait on grand-children / preventing grand-child-zombies, while also avoiding the issue they ran into here entirely.

They would need one special "main thread" that deals with reaping and that isn't subject to tokio's runtime cleaning it up, but presumably they already have that, or else the fix they did apply wouldn't have worked.

Alternatively, if they want they could integrate with systemd, even just by wrapping the children all in 'systemd-run', which would reliably allow cleaning up of children (via cgroups).

timhh 5 months ago

> PR_SET_CHILD_SUBREAPER
I wrote a tool that does just this: https://github.com/timmmm/anakin
If you run `anakin <some command>` it will kill any orphan processes that <some command> makes.
However is still isn't the true "orphans of this process must automatically die" option that everyone writing job control software wants - if `anakin` itself somehow crashes then the orphans can live again.
Still it was the best I could come up with that didn't need root.
- badmintonbaseba 5 months ago
  
  The name of the tool is on point.
skissane 5 months ago

> I think they don't want PR_SET_PDEATHSIG but rather PR_SET_CHILD_SUBREAPER, which I think would be both more correct than PDEATHSIG for letting them wait on grand-children / preventing grand-child-zombies, while also avoiding the issue they ran into here entirely.
PR_SET_PDEATHSIG automatically kills your children if you die, but unfortunately doesn’t extend to their descendants
As far as I’m aware, PR_SET_CHILD_SUBREAPER doesn’t do anything if you die. Assuming you yourself don’t crash, it can be used to help clean up orphaned descendant processes, by ensuring they reparent to you instead of init; but in the event you do crash, it doesn’t do anything to help.
PID namespaces do exactly what you want - if their init process dies it automatically kills all its descendants. However, they require privilege - unless you use an unprivileged user namespace - but those are frequently disabled, and even when enabled, using them potentially introduces a whole host of other issues
> Alternatively, if they want they could integrate with systemd
The problem is a lot of code runs in environments without systemd-e.g. code running in containers (Docker, K8S, etc), most containers don’t contain systemd. So any systemd-centric solution is only going to work for some people
Really, it would be great if Linux added some new process grouping construct which included the “kill all members of this group if its leader dies” semantic of PID namespaces without any of its other semantics. It is those other semantics (especially the new PID number semantics) which are the primary source of the security concerns, so a construct which offered only the “kill-if-leader-dies” semantic should be safe to allow for unprivileged access. (The one complexity is setuid/setgid/file capabilities - allowing an unprivileged process to effectively kill a privileged process at an arbitrary point in its execution is a security risk-plausible solutions include refuse to execute any setuid/setgid/caps executable, or else allow them to run but remove the process from this grouping when it executes one)
- eqvinox 5 months ago
  
  > PR_SET_PDEATHSIG automatically kills your children if you die, but unfortunately doesn’t extend to their descendants
  It indirectly does, unless you unset it the child dying will trigger another run of PDEATHSIG on the grandchildren, and so on. (The setting is retained across forks, as shown in the original article.)
  - kobzol 5 months ago
    
    It is sadly not propagated to grandchildren.
    I tries the subreaper approach, but it doesn't help. The children are reparented to the worker, but when the worker dies, they are then just reparented to init, like normally.
    
    TheDong 5 months ago
    
    You also need to specifically have the subreaper process call the "wait" syscall, and wait for all children, otherwise of course they'll end up reparented to init.
    If you want to write a process manager, one of the process manager's responsibilities is waiting on its children.
    
    ComputerGuru 5 months ago
    
    Just a nitpick: They don’t get reparented to init regardless of whether you call wait or not, so long as the parent process exists. They’ll be in a zombie state waiting to be reaped via a parent call to wait. Only if the parent dies/exits without reaping will they be reparented to init.
  - skissane 5 months ago
    
    > The setting is retained across forks, as shown in the original article
    That’s not what the man page says:
    > The parent-death signal setting is cleared for the child of a fork(2).
    https://man7.org/linux/man-pages/man2/pr_set_pdeathsig.2cons...
    Unless the man page is wrong?
    
    zokier 5 months ago
    
    I wonder if this is difference between libc fork (which calls clone syscall) and kernel fork syscall.
    
    skissane 5 months ago
    
    No, it isn’t. Neither glibc fork nor kernel fork syscall provide any special handling for PDEATHSIG beyond what clone syscall does.
    
    eqvinox 5 months ago
    
    Yeah, I misremembered/misread and didn't check. Bleh.
    (The article sets it after forking.)
vlovich123 5 months ago

> when the orphan terminates, it is the subreaper process that will receive a SIGCHLD signal and will be able to wait(2) on the process to discover its termination status
Seems like you don’t need a dedicated “always alive” thread if it’s being delivered to the process and tokio automatically does masking for threads so that you register for listening to signals using it’s asynchronous mechanisms & don’t have issues around signal safety which it abstracts away for you (i.e. as long as you’re handling the SIGCHILD signal somewhere or even just ignoring it as I don’t think they actually care?).
That being said, it’s not clear PR_SET_CHILD_SUBREAPER actually causes grand children to be killed when the reaper process dies which is the effect they’re looking for here (not the reverse where you reap forked children as they die). So you may need to spawn a dedicated reaper process rather than thread to manage the lifetime of children which is much more complicated.
- eqvinox 5 months ago
  
  > That being said, it’s not clear PR_SET_CHILD_SUBREAPER actually causes grand children to be killed when the reaper process dies
  CHILD_SUBREAPER kills neither children nor grandchildren. It's effect is in the other direction, inteded for sub-service-managers that want to keep track of all children. If the subreaper dies, children are reparented to the next subreaper up (or init).
- TheDong 5 months ago
  
  Yeah, I was assuming they have something calling `wait` somewhere since they say "HyperQueue is essentially a process manager", and to me "process manager" implies pretty strongly "spawns and waits for processes".

emmelaich 5 months ago

Seeing the first mention of 10 seconds, I thought (jokingly) - Why not grep the source for the 10 second value.

Jokingly, because I thought it would be an emergent property, not a literal value.

Turns out it was a literal value after all and grepping would have helped!

ricardobeat 5 months ago

There was no mention of a ten second interval anywhere in the code, only in the tests they wrote while debugging.
- AntiRush 5 months ago
  
  As mentioned in the article, there is a 10 second value in Tokio - the default thread timeout.
  - ricardobeat 5 months ago
    
    That wouldn’t be greppable in the source though?

mkesper 5 months ago

Although we will need to add one additional unsafe block once we migrate to the 2024 edition because we use std::env::set_var in main :laughing:

This is unsafe rightfully and should not be used without checking. It's just undefined behavior when using threads as it's not threadsafe. https://ttimo.typepad.com/blog/2024/11/the-steam-client-upda...

immibis 5 months ago

Good writeup of yet another bug different from all the other bugs.

The Linux kernel isn't really bothered by the difference between threads and processes. Threads are just processes that happen to share an address space, file descriptor table, and thread group ID (what most tools call a PID). I think there are some subtle things related to the thread group ID, but they're subtle. The rest is implemented in glibc.

rendaw 5 months ago

Are there any differences between threads and processes in how signals are handled?
I recently learned that aside from processes there are process groups, process sessions (setsid), process group and session leaders, trees have associated VT ownership data, systemd sessions (which seem to be inherited by the entire subtree and can't be purged), and possibly other layered metadata spaces that I haven't heard of yet.
And I feel like there's got to be some way to tag or associate custom metadata with processes, but I haven't found it yet.
I really wish there were an overview of all these things and how they interact with eachother somewhere.
- skissane 5 months ago
  
  > Are there any differences between threads and processes in how signals are handled?
  Yes. As signal(7) notes [0], Linux has both “process-directed signals” (which can be handled by any thread in a process), and “thread-directed signals” (which are targeted at a specific thread and only handled by that thread). For user-generated signals, the classification depends on which syscall you use (kill/rt_sigqueueinfo generate process-directed signals, tgkill/rt_tsigqueueinfo generate thread-directed). For system-generated signals, it is up to the kernel code generating the signal to decide. So the same signal number can be thread-directed in some cases and process-directed in others
  > systemd sessions (which seem to be inherited by the entire subtree and can't be purged)
  At a kernel level those are implemented with cgroups.
  > I really wish there were an overview of all these things
  Unfortunately I think Linux has grown a complex mess of different features in this area, all of which are full of complicated limitations and gotchas. Despite attempts to introduce orthogonality (e.g. with several different types of namespaces), the end result is still a long way from any ideal of orthogonality
  [0] https://man7.org/linux/man-pages/man7/signal.7.html
  - rendaw 5 months ago
    
    Oh thanks! I was recently having `runuser -l` silently not do the session setup because of the systemd thing, so maybe there's a better way (than laundering it through a process launcher daemon in a separate tree) to handle that.
    I forgot capabilities with another 5 layers (+) of different flags and applied differently to processes and files... (and then namespaces, etc)
- eqvinox 5 months ago
  
  > Are there any differences between threads and processes in how signals are handled?
  Yes, absolutely, there are thread-directed and process-directed signals; for the latter a thread is chosen at random (more or less) to handle the signal.
- oguz-ismail 5 months ago
  > I really wish there were an overview of all these things and how they interact with eachother somewhere.
  man 7 signal
  - fc417fc802 5 months ago
    
    Also see `man 2 clone` and `man 7 cgroups`.
eqvinox 5 months ago

The distinction isn't quite as subtle as you believe, it also shows up in e.g. file locks, AF_UNIX SO_PEERCRED, and with any process-directed signal.
As a matter of fact, the original implementation of POSIX threads for Linux was userspace based and had unfixable bugs and issues that necessitated introducing the concept of threads into the Linux kernel.

vlovich123 5 months ago

> Edit: Someone on Reddit sent me a link to a method that can override the thread keep-alive duration. Its description makes it clear why the tasks were failing after exactly 10 seconds

> Yeah, testing if a task can run for 20 seconds isn’t great, but hey, at least it’s something

Well a reasonable thing to me is then to use the override within the test to shorten it (e.g. to 1s & use a 2s timeout).

kobzol 5 months ago

Could be done, yeah, but 20s isn't that much, and I'd like to avoid adding more test-only magic environment variables zo configure this (our end-to-end tests are in Python and they use HQ as a binary).
- vlovich123 5 months ago
  
  I appreciate the sentiment, but I think it’s making the wrong pragmatism/purity tradeoff. The test is brittle - what happens when a future update of the dependency in a couple of years makes a change to the default timeout value? Aside from making test runs quicker which is good for anyone running the test suite without caring about this 1 test itself, it future proofs the test flakiness better against defaults changing out from under you.

yencabulator 5 months ago

> This makes sense, of course, because there’s not exactly an asynchronous version of fork.

IORING_OP_CLONE and IORING_OP_EXEC have been proposed: https://lwn.net/Articles/1002371/

kevingadd 5 months ago

Leaving PDEATHSIG enabled would make it harder for me to sleep at night, but I understand why the alternatives probably aren't appealing. Seems like a future bug waiting to happen. At least the author knows what to expect now.

deathanatos 5 months ago

> In particular, it is not always possible for HQ to ensure that when a process that spawns tasks (called worker) quits unexpectedly (e.g. when it receives SIGKILL), its spawned tasks will be cleaned up. Sadly, Linux does not seem to provide any way of implementing perfect structured process management in user space. In other words, when a parent process dies, it is possible for its (grand)children to continue executing.

Uh, it does? It's called pid-namesp—

> There is a solution for this called PID namespaces,

I think maybe you've got the wrong idea about what "in user space" means? — processes running as root are still "in user space". The opposite of "user space" is "in the kernel".

> but it requires elevated privileges

I think that's only technically true. I believe you can unshare the PID namespace if you first unshare the user namespace — which causes the thing doing the unsharing of the user namespace to become "root" within that new namespace, and from there is permitted to unshare the pid namespace. I think: https://unix.stackexchange.com/a/672462/6013

I have no idea why that hoop has to be jumped through / I don't know what is being protected against by preventing unprivileged processes from making pid namespaces.

Whether or not that fits well with HQ's design … you'd have to be the judge of that.

There's also prctl(PR_SET_CHILD_SUBREAPER, ...)

kobzol 5 months ago

Yeah by user space I just meant without root, sorry. HQ runs on supercomputers where the environment is heavily locked up, even Docker doesn't work. I think that PID namespaces aren't really possible, but I haven't tried it yet.
Subreaper doesn't help, because if the worker dies, the children aren't killed, even if they are the children of the worker, they will be just reparented to init.
- zamalek 5 months ago
  
  FWIW you can unshare PID and user at the same time: https://github.com/porkg/porkg/blob/rs/crates/porkg-linux/sr...
  If you don't care about being able to use different uids and gids then simply become root in the new namespace: https://github.com/porkg/porkg/blob/rs/crates/porkg-linux/sr... . Root inside the namespace will then be equivalent to the original uid+gid outside.
  I am using clone, which has the very important caveat: more than one thread running is UB. That's why I use a zygote (a process forked from the root very early on - i.e. before starting the tokio runtime). You can probably avoid all of that by using exec+unshare.
  But, given you're running on old kernels and constrained environments this may be not possible at all. Maybe make it configurable?
  - LegionMammal978 5 months ago
    
    Ubuntu [0] and some other distros have been trending towards disabling unprivileged user namespaces, unless you have specific AppArmor capabilities or other such mechanisms. So it's not something you can count on being available, unfortunately. (At least, not without jumping through many hoops to satisfy every distro's maintainers.) I've also had some ideas that have been stymied by a lack of unprivileged user namespaces.
    [0] https://ubuntu.com/blog/ubuntu-23-10-restricted-unprivileged...
- fc417fc802 5 months ago
  
  > I think that PID namespaces aren't really possible
  Depends on the cluster. If they're using nix or guix then they presumably enabled user namespaces but a few years ago guix had an article about (generally shitty) workarounds for people running in environments where those were disabled.
  Edit: Maybe you should have two code paths. A fast namespaced one and the slower old one as a fallback.

zokier 5 months ago

I have no clue about HyperQueue architecture, but is this really a problem? Normally if the main process of service exits then systemd will clean up the children. Why does HyperQueue need to implement it's own cleanup?

fc417fc802 5 months ago

> I don’t know how to tell it to only send the signal when the parent process (not thread) dies

What about an `atexit` handler and maintaining a table of child processes to kill? If you need something more robust in the face of adverse termination you could instead spawn an independent process, call `wait` on the primary process, and then handle any remaining cleanup.

badmintonbaseba 5 months ago

Presumably the author would like something that's handled by the kernel and not user space, so even a SIGKILL-ed parent process would trigger the reaping of the children.
- fc417fc802 5 months ago
  
  That would be a pid namespace but was passed over for compatibility reasons I guess.
  Addressing a SIGKILL'd parent specifically, what about daemonizing the cleanup process but instead of fork use clone with the CLONE_VM flag?

IshKebab 5 months ago

> Sadly, Linux does not seem to provide any way of implementing perfect structured process management in user space. In other words, when a parent process dies, it is possible for its (grand)children to continue executing. There is a solution for this called PID namespaces, but it requires elevated privileges, and also seems a bit too heavyweight for HyperQueue.

Yeah Linux process management is a bit of a shit show. I didn't know about this sigdeath thing though. That sounds maybe useful. Is it transitive though?

01HNNWZ0MV43FF 5 months ago

Async sucks, but threads suck more, but processes suck most

smw 5 months ago

n:m preemptively scheduled green threads (goroutines, erlang processes) suck much less!

stmw 5 months ago

Great bug story!