cgroups xp feature blocks daemon restarts when any child process is stilll running #1030
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
Feature/S3
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lix-project/lix#1030
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Describe the bug
When rebuilding my system, I run into this issue when the lix daemon gets restarted:
When this happens, there is a single SSH process left behind in the supervisor cgroup:
This process is most likely a connection to a remote builder that is being kept alive because of my SSH multiplexing configuration. Killing that process manually and starting nix-daemon.socket again, fixes the issue and allows for the daemon to be started again.
Steps To Reproduce
ControlPersist 60mfor connections made by the root userI have configured SSH with this snippet in /etc/ssh/ssh_config:
Expected behavior
Lix should properly tear down its child processes so that it can restart without such issues.
nix --versionoutputthat's (unfortunately) working exactly as it should: lix never explicitly requests multiplexing and as such can never explicitly stop multiplexing without disturbing the rest of the system. ssh decides on its own to daemonize a process lix doesn't know about, and the cgroups capture that process (as they should). even if lix did explicitly request multiplexing it couldn't tear down the multiplexers because restarting a daemon does not kill running builds, and those may still need the multiplexer process in question. lix could at best attempt to move the multiplexer process it has inadvertently spawned out of its own cgroup, but where to? the daemon may be running in single-user mode where the cgroup hierarchy is different, it may even be running in a configuration where moving the mux process is entirely impossible.
at this point the only safe way to use ssh muxing with lix is to set the persist timeout as high as possible and to ensure that the mux process always runs outside of the lix cgroup, either by running an outside service that keeps the mux alive or by somehow socket-activating the mux connection :(
But so if builds are kept running when the daemon exits, then those too would prevent the daemon from being restarted? How would that work differently then from the issue that I'm seeing?
This issue only started happening recently, before this configuration didn't cause issues.
good grief, we didn't fully realize what's happening here without digging into systemd source code. what systemd is complaining about is that the sub-cgroup has controllers enabled when it's restarting the daemon. you're right that running builds would behave the same way as the ssh mux does, but those aren't the problem. do you have systemd-level resource control enabled for the daemon that isn't turned on by default?
update: it's ultimately the cgroup xp feature's fault. with that enabled we change the cgroup layout in such a way that systemd ultimately can't restart the daemon as long as any daemon child process is still alive. you'll have to turn that off to fix your problem
Lix fails to restart when using SSH multiplexing with persistent connectionstocgroupsxp feature blocks daemon restarts when any child process is stilll running