CI jobs timing out due to plausible daemon bug? #549
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
Feature/S3
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lix-project/lix#549
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Various CI runs time out after 3 hours due to what appears to be a deadlock or broken async code. I have a core dump of a bad build, but we don't have debuginfo on kj, so it's not practical to figure out the async state that led to it getting busted.
Generally the jobs that have this problem have been nixos tests. The CI cluster has a bunch of interesting properties that might be relevant: it is building with something like:
The client side has been observed to be waiting for stderr, I believe, somewhere around here: https://git.lix.systems/lix-project/lix/src/ee0c195eba7d16b796fd9883e3fe88c0d64ff0bf/src/libstore/remote-store.cc#L902
So it's almost certainly stuck on the daemon side.
Anyway here is a stack trace of the daemon from the core dump I pulled:
This issue was mentioned on Gerrit on the following CLs:
the CI builder has an
ssh://localhostremote builder configured. if it ever chooses this builder to build a derivation it'll deadlock on the derivation lock (which is at that point held by the original daemon instance, and the faux ssh store tries to acquire it as well)in fact, haven't we seen things like this break before already? only perhaps less frequently for some reason, which could totally be related to the scheduling algorithm changing
cc @raito