Lix client keeps onto eval-cache lock while building #608
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#608
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Describe the bug
Process with id 3112563 is holding a write lock.
A stack trace from this process tells us that it is busy building something.
Steps To Reproduce
Start multiple builds of firefox from staging on the same machine
The second and following processes get stuck evaluating
Expected behavior
First process should drop the write lock as soon as it's done evaluating
nix --version
outputnix (Lix, like Nix) 2.92.0-dev-pre20241211-92ed9fe
Additional context
Add any other context about the problem here.
Investigating this by using rr to trace builds.
It looks like we actually regressed this in 2.91 over 2.90.
Previously, in 2.90, the eval cache was destroyed in
toDerivedPaths
fromnix::Installable::build2
.Now it is destroyed with the Command instance, which is definitely broken.
2.90:
2.91: destroyed with the InstallablesCommand:
main
:Proximate cause
We observe that the eval caches are being held alive by the installables
themselves as of 2.91. In 2.92, this is in CachingEvaluator.
The problem is that CachingEvaluator is intentionally holding these references alive longer than they otherwise would live.
That being said, neither CachingEvaluator nor the evaluator itself should be living as long as to start a build, so we still have a way to fix this.
The reason that this happened is this optimization:
Which is indeed contained in 2.91 but not in 2.90. Cool!
Fixing it
We need to kill the evaluator object once the evaluation phase is done. That's my next step.
Implementation plan: obliterate
getEvaluator()
from the entire CLI, so that it has an explicit lifetime. Unfortunately this is a couple of days of work by itself, and it might be better to just remove the optimization. But also, even if we do remove this, I am not sure to what extent the evaluator is getting torn down in a meaningful way when it really should be.cc #313, which is the bug that the optimization causing this problem fixes.
edit: hm, maybe I can actually move the state out of CachingEvaluator and pass it explicitly, since it seems quite sparsely used.
Another possible way to fix this is to make the EvalCache discard its State when not actively in use. This might result in excess sqlite connect/disconnect cycles, but it would solve this bug without having to massively refactor the entire CLI.
All of this said, I no longer believe this should be a release blocker after discussion with @raito:
For these reasons, I don't think it's worth further delaying the release on this bug. It should still be fixed, but doing so is quite hard since flakes/new-cli are a hot mess of overbroad-scoped state.
The annoying thing is: the second process does not respond properly to Ctrl-C