Nix daemon can kill unrelated processes in containers including independent builders #667
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
7 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#667
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Describe the bug
When a lix build finishes, the lix daemon tries to kill everything under the lix build user's UID.
But, when concurrently using a docker image where there are processes with that same UID, those processes will also get killed.
This is especially likely when using the official lix docker image, since it uses the same UID ranges for lix build users as the daemon.
This leads to an extremely confusing
failed due to signal 9 (Killed)
error.Steps To Reproduce
Expected behavior
The builds inside the docker image shouldn't be killed
Suggestion
We could mitigate this by doing one of the following:
A principled fix seems trickier, but it would be good to do something to make this at least not so easy to trigger.
If yous have a preferred fix/mitigation and its not too difficult then I'd be happy to give this a go.
nix --version
outputnix (Lix, like Nix) 2.92.0
Additional context
See NixCpp issue here: https://github.com/NixOS/nix/issues/9142
This was originally discovered (after a great deal of strace-ing) because the GHC CI tends to run jobs in docker images on a NixOS host.
IIRC this is caused by
kill(-1, SIGKILL)
(which is run as the build user's UID) inkillUser
:if (kill(-1, SIGKILL) == 0) break;
If not that, it was caused by another call to
kill(-1, SIGKILL)
somewhere, but I definitely remember seeing that in my strace logsTo add to the unprincipled workarounds:
auto-allocate-uids
(only on the host) will eliminate these collisions. See #387 for why it's experimental and what other caveats that comes with.ids.uids.nixbld
in the NixOS config. You'll probably need to hack around NixOS's uid state management by removing the users from/etc/passwd
and/var/lib/nixos/uid-map
for this to take effect.I've done a bit more digging into this. In particular, we call
killUser
here as part of sandbox cleanup.I'm wondering why we are even trying to kill all processes belong to the build user if PID namespaces are being used, since the documentation for PID namespaces says that when the main ("init") process is killed, the kernel will kill all the others, so this should be a no-op.
See: https://man7.org/linux/man-pages/man7/pid_namespaces.7.html#:~:text=If%20the%20%22init%22%20process%20of%20a%20PID%20namespace%20terminates%2C%20the%20kernel%0A%20%20%20%20%20%20%20terminates%20all%20of%20the%20processes%20in%20the%20namespace%20via%20a%20SIGKILL%0A%20%20%20%20%20%20%20signal.
I went through the git blame and some variant of this code has been around for 20 years. So, I think this is a holdover from before
nix
had support for using PID namespaces.Recall that non-Linux platforms exist :) but certainly it can be nuked on Linux
This issue was mentioned on Gerrit on the following CLs:
I'm going to keep this open since the merged change only fixes this when using Linux with sandboxes. This will still trigger if either of those is not true.
@nrabulinski is this fixed or did you mistakenly close this?
@raito this is not fixed I have no clue how my account closed this sowwy
@nrabulinski no worries, this ticket kept getting closed, so I think it's haunted :D
is @qyriad also haunted?
i swear to god, this is a meme
We are affected by https://codeberg.org/forgejo/forgejo/issues/4010 on this issue…
Ah that makes sense! I just assumed this was cursed because it is 666 off-by-one