cgroups xp feature blocks daemon restarts when any child process is stilll running #1030

Open
opened 2025-11-07 06:55:20 +00:00 by r-vdp · 4 comments

Describe the bug

When rebuilding my system, I run into this issue when the lix daemon gets restarted:

nov 05 16:51:01 framework systemd[1]: Stopping Nix Daemon...
nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Deactivated successfully.
nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Unit process 445241 (ssh) remains running after unit stopped.
nov 05 16:51:01 framework systemd[1]: Stopped Nix Daemon.
nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Consumed 8min 33.419s CPU time, 2.7G memory peak, 2.4G read from disk, 1.9G written to disk.
nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Found left-over process 445241 (ssh) in control group while starting unit. Ignoring.
nov 05 16:51:01 framework systemd[1]: nix-daemon.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Failed to spawn executor: Device or resource busy
nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Failed to spawn 'start' task: Device or resource busy
nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Failed with result 'resources'.
nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Unit process 445241 (ssh) remains running after unit stopped.
nov 05 16:51:01 framework systemd[1]: Failed to start Nix Daemon.

When this happens, there is a single SSH process left behind in the supervisor cgroup:

➜ command cat /sys/fs/cgroup/system.slice/nix-daemon.service/supervisor/cgroup.procs
445241

➜ command cat /proc/445241/comm
ssh

This process is most likely a connection to a remote builder that is being kept alive because of my SSH multiplexing configuration. Killing that process manually and starting nix-daemon.socket again, fixes the issue and allows for the daemon to be started again.

Steps To Reproduce

  1. Configure SSH with ControlPersist 60m for connections made by the root user
  2. Run a nixos-rebuild that rebuilds lix and uses remote builders to build (I'm using ssh-ng)
  3. See error upon switching configurations

I have configured SSH with this snippet in /etc/ssh/ssh_config:

Match localuser root
  ControlMaster auto
  ControlPath %d/.%C
  ControlPersist 60m

Expected behavior

Lix should properly tear down its child processes so that it can restart without such issues.

nix --version output

nix (Lix, like Nix) 2.94.0-pre20251106-dev_f00d720
System type: x86_64-linux
Additional system types: i686-linux, x86_64-v1-linux, x86_64-v2-linux, x86_64-v3-linux, x86_64-v4-linux
Features: gc, signed-caches
System configuration file: /etc/nix/nix.conf
User configuration files: /home/ramses/.config/nix/nix.conf:/etc/xdg/nix/nix.conf:/home/ramses/.nix-profile/etc/xdg/nix/nix.conf:/nix/profile/etc/xdg/nix/nix.conf:/home/ramses/.local/state/nix/profile/etc/xdg/nix/nix.conf:/etc/profiles/per-user/ramses/etc/xdg/nix/nix.conf:/nix/var/nix/profiles/default/etc/xdg/nix/nix.conf:/run/current-system/sw/etc/xdg/nix/nix.conf
Store directory: /nix/store
State directory: /nix/var/nix
Data directory: /nix/store/2kmnn2zs98qrfwfkfg6kln77i82d9kwv-lix-2.94.0-pre20251106-dev_f00d720/share
## Describe the bug When rebuilding my system, I run into this issue when the lix daemon gets restarted: ``` nov 05 16:51:01 framework systemd[1]: Stopping Nix Daemon... nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Deactivated successfully. nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Unit process 445241 (ssh) remains running after unit stopped. nov 05 16:51:01 framework systemd[1]: Stopped Nix Daemon. nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Consumed 8min 33.419s CPU time, 2.7G memory peak, 2.4G read from disk, 1.9G written to disk. nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Found left-over process 445241 (ssh) in control group while starting unit. Ignoring. nov 05 16:51:01 framework systemd[1]: nix-daemon.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies. nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Failed to spawn executor: Device or resource busy nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Failed to spawn 'start' task: Device or resource busy nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Failed with result 'resources'. nov 05 16:51:01 framework systemd[1]: nix-daemon.service: Unit process 445241 (ssh) remains running after unit stopped. nov 05 16:51:01 framework systemd[1]: Failed to start Nix Daemon. ``` When this happens, there is a single SSH process left behind in the supervisor cgroup: ``` ➜ command cat /sys/fs/cgroup/system.slice/nix-daemon.service/supervisor/cgroup.procs 445241 ➜ command cat /proc/445241/comm ssh ``` This process is most likely a connection to a remote builder that is being kept alive because of my SSH multiplexing configuration. Killing that process manually and starting nix-daemon.socket again, fixes the issue and allows for the daemon to be started again. ## Steps To Reproduce 1. Configure SSH with `ControlPersist 60m` for connections made by the root user 2. Run a nixos-rebuild that rebuilds lix and uses remote builders to build (I'm using ssh-ng) 4. See error upon switching configurations I have configured SSH with this snippet in /etc/ssh/ssh_config: ``` Match localuser root ControlMaster auto ControlPath %d/.%C ControlPersist 60m ``` ## Expected behavior Lix should properly tear down its child processes so that it can restart without such issues. ## `nix --version` output ``` nix (Lix, like Nix) 2.94.0-pre20251106-dev_f00d720 System type: x86_64-linux Additional system types: i686-linux, x86_64-v1-linux, x86_64-v2-linux, x86_64-v3-linux, x86_64-v4-linux Features: gc, signed-caches System configuration file: /etc/nix/nix.conf User configuration files: /home/ramses/.config/nix/nix.conf:/etc/xdg/nix/nix.conf:/home/ramses/.nix-profile/etc/xdg/nix/nix.conf:/nix/profile/etc/xdg/nix/nix.conf:/home/ramses/.local/state/nix/profile/etc/xdg/nix/nix.conf:/etc/profiles/per-user/ramses/etc/xdg/nix/nix.conf:/nix/var/nix/profiles/default/etc/xdg/nix/nix.conf:/run/current-system/sw/etc/xdg/nix/nix.conf Store directory: /nix/store State directory: /nix/var/nix Data directory: /nix/store/2kmnn2zs98qrfwfkfg6kln77i82d9kwv-lix-2.94.0-pre20251106-dev_f00d720/share ```
Owner

that's (unfortunately) working exactly as it should: lix never explicitly requests multiplexing and as such can never explicitly stop multiplexing without disturbing the rest of the system. ssh decides on its own to daemonize a process lix doesn't know about, and the cgroups capture that process (as they should). even if lix did explicitly request multiplexing it couldn't tear down the multiplexers because restarting a daemon does not kill running builds, and those may still need the multiplexer process in question. lix could at best attempt to move the multiplexer process it has inadvertently spawned out of its own cgroup, but where to? the daemon may be running in single-user mode where the cgroup hierarchy is different, it may even be running in a configuration where moving the mux process is entirely impossible.

at this point the only safe way to use ssh muxing with lix is to set the persist timeout as high as possible and to ensure that the mux process always runs outside of the lix cgroup, either by running an outside service that keeps the mux alive or by somehow socket-activating the mux connection :(

that's (unfortunately) working exactly as it should: lix never explicitly requests multiplexing and as such can never explicitly *stop* multiplexing without disturbing the rest of the system. ssh decides on its own to daemonize a process lix doesn't know about, and the cgroups capture that process (as they should). even if lix did explicitly request multiplexing it couldn't tear down the multiplexers because restarting a daemon does not kill running builds, and those may still need the multiplexer process in question. lix could *at best* attempt to move the multiplexer process it has inadvertently spawned out of its own cgroup, but where to? the daemon may be running in single-user mode where the cgroup hierarchy is different, it may even be running in a configuration where moving the mux process is entirely impossible. at this point the only safe way to use ssh muxing with lix is to set the persist timeout as high as possible and to ensure that the mux process always runs outside of the lix cgroup, either by running an outside service that keeps the mux alive or by somehow socket-activating the mux connection :(
Author

But so if builds are kept running when the daemon exits, then those too would prevent the daemon from being restarted? How would that work differently then from the issue that I'm seeing?

This issue only started happening recently, before this configuration didn't cause issues.

But so if builds are kept running when the daemon exits, then those too would prevent the daemon from being restarted? How would that work differently then from the issue that I'm seeing? This issue only started happening recently, before this configuration didn't cause issues.
Owner

good grief, we didn't fully realize what's happening here without digging into systemd source code. what systemd is complaining about is that the sub-cgroup has controllers enabled when it's restarting the daemon. you're right that running builds would behave the same way as the ssh mux does, but those aren't the problem. do you have systemd-level resource control enabled for the daemon that isn't turned on by default?

good grief, we didn't fully realize what's happening here without digging into systemd source code. what systemd is complaining about is that the sub-cgroup has *controllers enabled* when it's restarting the daemon. you're right that running builds would behave the same way as the ssh mux does, but those aren't the problem. do you have systemd-level resource control enabled for the daemon that isn't turned on by default?
Owner

update: it's ultimately the cgroup xp feature's fault. with that enabled we change the cgroup layout in such a way that systemd ultimately can't restart the daemon as long as any daemon child process is still alive. you'll have to turn that off to fix your problem

update: it's ultimately the cgroup xp feature's fault. with that enabled *we* change the cgroup layout in such a way that systemd ultimately can't restart the daemon as long as any daemon child process is still alive. you'll have to turn that off to fix your problem
pennae changed title from Lix fails to restart when using SSH multiplexing with persistent connections to cgroups xp feature blocks daemon restarts when any child process is stilll running 2025-11-07 15:45:00 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lix-project/lix#1030
No description provided.