Deadlock on daemon interruption during path substitution #577
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#577
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Synopsis
A Lix daemon, version
c859d03013
, got stuck after its corresponding client got Ctrl-C'd (I believe that was proximate to the sadness, at least).This stuck state leads to further clients trying to build the paths getting stuck on path locks, for which we have poor diagnostics (#535), which is somewhat confusing.
Diagnosis
Involved stack traces:
PathSubstitutionGoal threads:
Thread "puppy":
Thread "kitty":
Main thread:
Notably, there is no thread with
workerThreadEntry
in the call stack. This is almost certainly related.More detailed execution state description:
The main thread is waiting for
PathSubstitutionGoal::thr.get()
:where
thr
is one of the PathSubstitutionGoal threads mentioned above. It's waiting on joining that thread.Those threads are either "puppy" or "kitty". Puppy threads are waiting on
transfer->downloadEvent
:Kitty threads are waiting on
transfer->metadataPromise.get_future().get()
:In both instances, these are states that are progressed by the download thread.
For puppy threads, it would be progressed if the download got the data callback called for more data.
For kitty threads, it would be progressed by receiving headers (or finishing the request with headers?).
The main thread is stuck because it is waiting on either a puppy or kitty thread.
So, it appears that the root cause is that the download thread can exit without fulfilling all its expectations or notifying them that it cannot complete them.
How did this happen?
Well, and this is the question. I have no idea the correct order of operations to be able to cleanly tear down the download thread while rejecting all its expectations.
Effectively the waiting on synchronization primitives needs to have that channel multiplexed with a possible error that probably gets rethrown.
I also don't know why this used to work. At one point, the answer to this is that the daemon crashed if the client hangs up due to buggy signal handling.
Now that the signal handling is a bit less buggy, it seems like the problem is that the async runtime is being bypassed while waiting for a future on the main thread, and so it cannot switch from the task and is stuck in blocking code.
The puppy/kitty worker threads getting abandoned appears to be intentional based on the TODO above, and this is not itself a big problem. The issue is that the main thread is getting stuck waiting on one of them in a way that does not allow the async runtime to do other things and thus not being able to tear the whole thing down properly.
(assigning pennae per discussions since they are fixing it at the moment)
This issue was mentioned on Gerrit on the following CLs: