Intermittent "got unknown message type 1 from Nix daemon" from nix build
#732
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#732
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Describe the bug
A few times recently I've observed the "got unknown message type 1 from Nix daemon" error when invoking
nix build
ornix develop
. Presumably anything that builds a derivation would trigger it.I haven't been very systematic at collecting information so far. Sometimes retrying the same command succeeds the second time, sometimes it doesn't. I resolved one persistent instance of this error by switching to lix 2.90 temporarily, but it's possible that restarting nix-daemon was the actual trigger there.
Steps To Reproduce
Unfortunately don't have any reliable reproduction steps yet. Will update this if I find them.
nix --version
outputWell that's a critical bug. Damn.
If you manage to hit this reproducibly, can you try running the client inside of
rr record
? Then it will at least be captured in a way that it can eventually be extracted somehow.Definitely! I did get an strace yesterday, but most of the actual data got cut off. These are the read calls right before the error:
It looks like that last one starts with
STDERR_ERROR
, but the protocol error almost certainly happened somewhere later in the buffer.that sounds like another FramedSource/FramedSink mismatch; based on the recent nar fuckery we're going to guess that a nar isn't read to completion and what we're seeing is processStderr of remote stores seeing nar metadata where it expects commands
That would possibly match another behavior I was seeing recently where it freezes when downloading from a substituter.
@pennae wrote in #732 (comment):
Hmmm. So reading NARs isn't cancel safe. I wonder how we can manage cancel safety in async. Or perhaps figure out how to instrument cancellations?
it's just a guess, we do have to investigate further. async nar reading cannot be made cancel-safe in general because that must rely on destructors, and destructors cannot be async. Source-based reading can be made cancel-safe (though it largely wasn't and would've aborted when canceled). instrumenting cancels is likewise impossible without touching every exit point of every function being instrumented (since destructors cannot distinguish cancellation from normal exits) or patching kj :(
Hmmm. There's different types of cancellation also, like, if we made it a crash if it's cancelled without the connection going down it might prevent this type of bug by construction perhaps? But also I suspect we're going to be stuck patching kj about stuff like tracing anyway so it's not an impossible option.
that doesn't sound like a good idea, handling ^C properly in an async runtime requires some amount of cancellation. the best we could do is mark data sources as tainted all the way down and crash if they've been used without maintaining all invariants, which we can luckily do since we don't use kj async streams
@olivia wrote in #732 (comment):
are remote builders involved here? there's a possibility that the error you're correctly pointing to is the daemon reporting a remote store error it received from another daemon, such as a remote builder
also, which daemon version did this error happen with? (we really should start adding a lot more debug info to error reports)
@pennae wrote in #732 (comment):
Oh, yeah that's definitely a possibility. I have a remote builder configured, but don't know if it was used when I got that specific error. The local daemon should be 2.93.0-dev-pre20250311-85a140a (same as the client), but the remote one may have been older.
@olivia wrote in #732 (comment):
the remote builder version and the full command line and output of the failing lix invocations would be very helpful info. at this point the only thing we can think of that may have caused this error is a bug in the nar code that has been fixed about a week ago
here's how this could fail:
\1\0\0\0\0\0\0\0)\0\0\0\0\0\0\0
. this copy throws an exception. crucially the connection is not invalidated, which is in itself outlandishly sketchy1
(LE) of the next nar tag that was never read. it then crashesHmmm. Could this happen if you serialize a path with a case hack in it from a store with case hack disabled, causing the remote to fail the nar parsing? What other causes are there for copyNAR to fail?
Case hack is darwin-only right? The possible remote builder in my case is x86_64-linux.
@jade wrote in #732 (comment):
it could happen any time the nar parser exits prematurely outside of a framed sink/source pair, in such a way that the connection is reused. that pretty much requires running in a thread pool since otherwise the exception will bubble upwards, which means most likely either a flavor of copyPaths or queryMissing. queryMissing doesn't involve the nar parser though
this is all extremely cursed and made no easier by the fact that the wire protocol has no coherent design :(
Got this again, and was able to grab debug logs and an rr recording. It is failing on a remote build, where the local nix client/daemon and remote nix client/daemon are all
2.93.0-dev-pre20250311-85a140a
. The command that triggered it wasnix develop nixpkgs#python3Packages.feedparser
, against nixpkgs85f1ba3e51676fa8cc604a3d863d729026a6b8eb
. Running the command a second time succeeds, but I can reliably trigger a failure withnix store delete /nix/store/fppv85zxdlscqwflx6cfgbm93168dbl6-python3.12-feedparser-6.0.11-env.drv
and thennix develop ...
.The rr recordings are a combined couple hundred MB, and presumably not useful without my lix binary, so I didn't want to upload them here without checking first. I can put them on my own server if that would be better.
edit: oof, my lix client build also doesn't have debug symbols. Gonna try rebuilding it and hope I can still get the error.
After switching to lix with debug symbols, I'm no longer able to reproduce the crash with feedparser. Likely this is either because it restarted nix-daemon, or because it built some other derivations first which may have put things in a different state.
awesome! we can reproduce this reliably now (requiring garbage collection between attempts, but that's fine). ssh and ssh-ng remotes get different errors; ssh-ng gets your error while ssh gets "input does not look like a nix archive". so there is definitely nar parser fuckery going on
This issue was mentioned on Gerrit on the following CLs: