rpc doesn't handle disconnection gracefully (or something) #1016
	
		Labels
		
	
	
	
	No labels
	
		
			
	
	Affects/CppNix
		
			Affects/Nightly
		
			Affects/Only nightly
		
			Affects/Stable
		
			Area/build-packaging
		
			Area/cli
		
			Area/evaluator
		
			Area/fetching
		
			Area/flakes
		
			Area/language
		
			Area/lix ci
		
			Area/nix-eval-jobs
		
			Area/profiles
		
			Area/protocol
		
			Area/releng
		
			Area/remote-builds
		
			Area/repl
		
			Area/repl/debugger
		
			Area/store
		
			bug
		
			Context
contributors
		
			Context
drive-by
		
			Context
maintainers
		
			Context
RFD
		
			crash 💥
		
			Cross Compilation
		
			devx
		
			docs
		
			Downstream Dependents
		
			E/easy
		
			E/hard
		
			E/help wanted
		
			E/reproducible
		
			E/requires rearchitecture
		
			Feature/S3
		
			imported
		
			Language/Bash
		
			Language/C++
		
			Language/NixLang
		
			Language/Python
		
			Language/Rust
		
			Needs Langver
		
			OS/Linux
		
			OS/macOS
		
			performance
		
			regression
		
			release-blocker
		
			stability
		
			Status
blocked
		
			Status
invalid
		
			Status
postponed
		
			Status
wontfix
		
			testing
		
			testing/flakey
		
			Topic/Large Scale Installations
		
			ux
		
		
	
		No milestone
		
			
		
	No project
	
		
	
	
	
	
		No assignees
		
	
	
		
			
		
	
	
	
		2 participants
	
	
		
		
	Notifications
	
		
	
	
	
		
	
	
	Due date
No due date set.
	
		Dependencies
		
		
	
	
	No dependencies set.
	
	
		
	
	
		
			Reference
		
	
	
		
	
	
			lix-project/lix#1016
			
		
	
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue
	
	
	No description provided.
		
		Delete branch "%!s()"
	 
	Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
a recent ci failure looks a lot like either disconnection not being handled properly or lifetimes being wrong.
was "reproduced" on the same chain in testAsan too: https://buildkite.com/lix-project/lix/builds/5259#0199e27a-7963-4e51-87aa-c1bc182f3adb/27-2931
found another one
https://buildkite.com/lix-project/lix/builds/5255#0199e238-e14f-4f46-ba04-efb6a865d09f/27-2866
we've managed to reproduce this on the cl the ci run was testing by running the failing test at a 40x oversubscription factor.
this happened because timeout handling killed the build hook while rpc calls were still in progress.
error (ignored): error: resetBlockingState: Bad file descriptorhappens because killing the hook closes the log pipe while the reader is still active,Exception: std::__exception_ptr::exception_ptr: capnp/rpc.c++:3561: disconnected: RpcSystem was destroyed.happens because the rpc calls that are still running have their connections ripped out from underneath them. this is fundamentally a hook lifetime issue that is no longer reproducible incl/4443, but we should still refactor the code to make these lifetime problems impossible.