rpc doesn't handle disconnection gracefully (or something) #1016
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
awaiting
author
awaiting
contributors
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
Feature/S3
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lix-project/lix#1016
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
a recent ci failure looks a lot like either disconnection not being handled properly or lifetimes being wrong.
was "reproduced" on the same chain in testAsan too: https://buildkite.com/lix-project/lix/builds/5259#0199e27a-7963-4e51-87aa-c1bc182f3adb/27-2931
found another one
https://buildkite.com/lix-project/lix/builds/5255#0199e238-e14f-4f46-ba04-efb6a865d09f/27-2866
we've managed to reproduce this on the cl the ci run was testing by running the failing test at a 40x oversubscription factor.
this happened because timeout handling killed the build hook while rpc calls were still in progress.
error (ignored): error: resetBlockingState: Bad file descriptorhappens because killing the hook closes the log pipe while the reader is still active,Exception: std::__exception_ptr::exception_ptr: capnp/rpc.c++:3561: disconnected: RpcSystem was destroyed.happens because the rpc calls that are still running have their connections ripped out from underneath them. this is fundamentally a hook lifetime issue that is no longer reproducible incl/4443, but we should still refactor the code to make these lifetime problems impossible.since this hasn't been observed for a while now we'll consider it resolved. in our tests
cl/4443resolved all known causes of the error and it seems that no new causes have been found.