rpc doesn't handle disconnection gracefully (or something) #1016

Open
opened 2025-10-18 20:57:27 +00:00 by pennae · 3 comments
Owner

a recent ci failure looks a lot like either disconnection not being handled properly or lifetimes being wrong.

a [recent ci failure](https://buildkite.com/lix-project/lix/builds/5259#0199e27a-7937-4593-844b-dee89bdc6e50/27-2836) looks a lot like either disconnection not being handled properly or lifetimes being wrong.
was "reproduced" on the same chain in testAsan too: https://buildkite.com/lix-project/lix/builds/5259#0199e27a-7963-4e51-87aa-c1bc182f3adb/27-2931
found another one https://buildkite.com/lix-project/lix/builds/5255#0199e238-e14f-4f46-ba04-efb6a865d09f/27-2866
Author
Owner

we've managed to reproduce this on the cl the ci run was testing by running the failing test at a 40x oversubscription factor.

this happened because timeout handling killed the build hook while rpc calls were still in progress.
error (ignored): error: resetBlockingState: Bad file descriptor happens because killing the hook closes the log pipe while the reader is still active, Exception: std::__exception_ptr::exception_ptr: capnp/rpc.c++:3561: disconnected: RpcSystem was destroyed. happens because the rpc calls that are still running have their connections ripped out from underneath them. this is fundamentally a hook lifetime issue that is no longer reproducible in cl/4443, but we should still refactor the code to make these lifetime problems impossible.

we've managed to reproduce this on the cl the ci run was testing by running the failing test at a 40x oversubscription factor. this happened because timeout handling killed the build hook while rpc calls were still in progress. `error (ignored): error: resetBlockingState: Bad file descriptor` happens because killing the hook closes the log pipe while the reader is still active, `Exception: std::__exception_ptr::exception_ptr: capnp/rpc.c++:3561: disconnected: RpcSystem was destroyed.` happens because the rpc calls that are still running have their connections ripped out from underneath them. this is fundamentally a hook lifetime issue that is no longer reproducible in cl/4443, but we should still refactor the code to make these lifetime problems impossible.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lix-project/lix#1016
No description provided.