Nix daemon can kill unrelated processes in containers including independent builders #667

Open
opened 2025-02-12 11:02:10 +00:00 by teofilc · 13 comments

Describe the bug

When a lix build finishes, the lix daemon tries to kill everything under the lix build user's UID.
But, when concurrently using a docker image where there are processes with that same UID, those processes will also get killed.

This is especially likely when using the official lix docker image, since it uses the same UID ranges for lix build users as the daemon.

This leads to an extremely confusing failed due to signal 9 (Killed) error.

Steps To Reproduce

  1. Start running a long lix build in a lix docker image
  2. Run a quick lix build on the host
  3. Notice that the containerized build was killed with a "failed due to signal 9 (Killed)" error.

Expected behavior

The builds inside the docker image shouldn't be killed

Suggestion

We could mitigate this by doing one of the following:

  • shift over the build user group UIDs in order to avoid them overlapping with the default NixOS config. This wouldn't solve the issue but it would mitigate against it.
  • disable the build user group in the official docker image. I'm not sure if this is a good idea or not. The docker image has the sandbox disabled anyway

A principled fix seems trickier, but it would be good to do something to make this at least not so easy to trigger.

If yous have a preferred fix/mitigation and its not too difficult then I'd be happy to give this a go.

nix --version output

nix (Lix, like Nix) 2.92.0

Additional context

See NixCpp issue here: https://github.com/NixOS/nix/issues/9142

This was originally discovered (after a great deal of strace-ing) because the GHC CI tends to run jobs in docker images on a NixOS host.

## Describe the bug When a lix build finishes, the lix daemon tries to kill everything under the lix build user's UID. But, when concurrently using a docker image where there are processes with that same UID, those processes will also get killed. This is especially likely when using the official lix docker image, since it uses the same UID ranges for lix build users as the daemon. This leads to an extremely confusing `failed due to signal 9 (Killed)` error. ## Steps To Reproduce 1. Start running a long lix build in a lix docker image 2. Run a quick lix build on the host 3. Notice that the containerized build was killed with a "failed due to signal 9 (Killed)" error. ## Expected behavior The builds inside the docker image shouldn't be killed ## Suggestion We could mitigate this by doing one of the following: - shift over the build user group UIDs in order to avoid them overlapping with the default NixOS config. This wouldn't solve the issue but it would mitigate against it. - disable the build user group in the official docker image. I'm not sure if this is a good idea or not. The docker image has the sandbox disabled anyway A principled fix seems trickier, but it would be good to do something to make this at least not so easy to trigger. If yous have a preferred fix/mitigation and its not too difficult then I'd be happy to give this a go. ## `nix --version` output nix (Lix, like Nix) 2.92.0 ## Additional context See NixCpp issue here: https://github.com/NixOS/nix/issues/9142 This was originally discovered (after a great deal of strace-ing) because the GHC CI tends to run jobs in docker images on a NixOS host.
Author

IIRC this is caused by kill(-1, SIGKILL) (which is run as the build user's UID) in killUser:

if (kill(-1, SIGKILL) == 0) break;

If not that, it was caused by another call to kill(-1, SIGKILL) somewhere, but I definitely remember seeing that in my strace logs

IIRC this is caused by `kill(-1, SIGKILL)` (which is run as the build user's UID) in `killUser`: https://git.lix.systems/lix-project/lix/src/commit/a987d92bd0162cadb36b76f8dfd5ac2cbddc97fb/lix/libutil/processes.cc#L143 If not that, it was caused by another call to `kill(-1, SIGKILL)` somewhere, but I definitely remember seeing that in my strace logs
Member

To add to the unprincipled workarounds:

  • Using auto-allocate-uids (only on the host) will eliminate these collisions. See #387 for why it's experimental and what other caveats that comes with.
  • In a similar vein, you should be able to change the base uid for nixbld users by setting ids.uids.nixbld in the NixOS config. You'll probably need to hack around NixOS's uid state management by removing the users from /etc/passwd and /var/lib/nixos/uid-map for this to take effect.
To add to the unprincipled workarounds: - Using `auto-allocate-uids` (only on the host) will eliminate these collisions. See https://git.lix.systems/lix-project/lix/issues/387 for why it's experimental and what other caveats that comes with. - In a similar vein, you should be able to change the base uid for nixbld users by setting `ids.uids.nixbld` in the NixOS config. You'll probably need to hack around NixOS's uid state management by removing the users from `/etc/passwd` and `/var/lib/nixos/uid-map` for this to take effect.
Author

I've done a bit more digging into this. In particular, we call killUser here as part of sandbox cleanup.

I'm wondering why we are even trying to kill all processes belong to the build user if PID namespaces are being used, since the documentation for PID namespaces says that when the main ("init") process is killed, the kernel will kill all the others, so this should be a no-op.

See: https://man7.org/linux/man-pages/man7/pid_namespaces.7.html#:~:text=If%20the%20%22init%22%20process%20of%20a%20PID%20namespace%20terminates%2C%20the%20kernel%0A%20%20%20%20%20%20%20terminates%20all%20of%20the%20processes%20in%20the%20namespace%20via%20a%20SIGKILL%0A%20%20%20%20%20%20%20signal.

I went through the git blame and some variant of this code has been around for 20 years. So, I think this is a holdover from before nix had support for using PID namespaces.

I've done a bit more digging into this. In particular, we call `killUser` [here](https://git.lix.systems/lix-project/lix/src/commit/1077bc626e8dfc153524da40eddad46ef893d66e/lix/libstore/build/local-derivation-goal.cc#L150) as part of sandbox cleanup. I'm wondering why we are even trying to kill all processes belong to the build user if PID namespaces are being used, since the documentation for PID namespaces says that when the main ("init") process is killed, the kernel will kill all the others, so this should be a no-op. See: https://man7.org/linux/man-pages/man7/pid_namespaces.7.html#:~:text=If%20the%20%22init%22%20process%20of%20a%20PID%20namespace%20terminates%2C%20the%20kernel%0A%20%20%20%20%20%20%20terminates%20all%20of%20the%20processes%20in%20the%20namespace%20via%20a%20SIGKILL%0A%20%20%20%20%20%20%20signal. I went through the git blame and some variant of this code has been around for 20 years. So, I think this is a holdover from before `nix` had support for using PID namespaces.
Owner

Recall that non-Linux platforms exist :) but certainly it can be nuked on Linux

Recall that non-Linux platforms exist :) but certainly it can be nuked on Linux
Member

This issue was mentioned on Gerrit on the following CLs:

  • commit message in cl/2576 ("Avoid lix daemon killing unrelated processes when using sandboxes under Linux")
<!-- GERRIT_LINKBOT: {"cls": [{"backlink": "https://gerrit.lix.systems/c/lix/+/2576", "number": 2576, "kind": "commit message"}], "cl_meta": {"2576": {"change_title": "Avoid lix daemon killing unrelated processes when using sandboxes under Linux"}}} --> This issue was mentioned on Gerrit on the following CLs: * commit message in [cl/2576](https://gerrit.lix.systems/c/lix/+/2576) ("Avoid lix daemon killing unrelated processes when using sandboxes under Linux")
Author

I'm going to keep this open since the merged change only fixes this when using Linux with sandboxes. This will still trigger if either of those is not true.

I'm going to keep this open since the merged change only fixes this when using Linux with sandboxes. This will still trigger if either of those is not true.
teofilc reopened this issue 2025-02-18 16:51:35 +00:00
jade closed this issue 2025-02-22 01:58:32 +00:00
teofilc reopened this issue 2025-03-10 15:57:17 +00:00
Owner

@nrabulinski is this fixed or did you mistakenly close this?

@nrabulinski is this fixed or did you mistakenly close this?
Member

@raito this is not fixed I have no clue how my account closed this sowwy

@raito this is not fixed I have no clue how my account closed this sowwy
Owner

@nrabulinski no worries, this ticket kept getting closed, so I think it's haunted :D

@nrabulinski no worries, this ticket kept getting closed, so I think it's haunted :D
Owner

is @qyriad also haunted?

is @qyriad also haunted?
Owner

i swear to god, this is a meme

i swear to god, this is a meme
raito reopened this issue 2025-06-09 11:57:08 +00:00
Owner

We are affected by https://codeberg.org/forgejo/forgejo/issues/4010 on this issue…

We are affected by https://codeberg.org/forgejo/forgejo/issues/4010 on this issue…
Author

Ah that makes sense! I just assumed this was cursed because it is 666 off-by-one

Ah that makes sense! I just assumed this was cursed because it is 666 off-by-one
Sign in to join this conversation.
No milestone
No project
No assignees
7 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#667
No description provided.