Intermittent "got unknown message type 1 from Nix daemon" from nix build #732

Closed
opened 2025-03-13 06:47:48 +00:00 by olivia · 21 comments
Member

Describe the bug

A few times recently I've observed the "got unknown message type 1 from Nix daemon" error when invoking nix build or nix develop. Presumably anything that builds a derivation would trigger it.

I haven't been very systematic at collecting information so far. Sometimes retrying the same command succeeds the second time, sometimes it doesn't. I resolved one persistent instance of this error by switching to lix 2.90 temporarily, but it's possible that restarting nix-daemon was the actual trigger there.

Steps To Reproduce

Unfortunately don't have any reliable reproduction steps yet. Will update this if I find them.

nix --version output

nix (Lix, like Nix) 2.93.0-dev-pre20250311-85a140a
System type: x86_64-linux
Additional system types: i686-linux, x86_64-v1-linux, x86_64-v2-linux, x86_64-v3-linux, x86_64-v4-linux
Features: gc, signed-caches
System configuration file: /etc/nix/nix.conf
User configuration files: /home/olivia/.config/nix/nix.conf:/etc/xdg/nix/nix.conf:/home/olivia/.nix-profile/etc/xdg/nix/nix.conf:/nix/profile/etc/xdg/nix/nix.conf:/home/olivia/.local/state/nix/profile/etc/xdg/nix/nix.conf:/etc/profiles/per-user/olivia/etc/xdg/nix/nix.conf:/nix/var/nix/profiles/default/etc/xdg/nix/nix.conf:/run/current-system/sw/etc/xdg/nix/nix.conf
Store directory: /nix/store
State directory: /nix/var/nix
Data directory: /nix/store/c59s548a5agbwyq8rjn4aszq4q0r6nzi-lix-2.93.0-dev-pre20250311-85a140a/share
## Describe the bug A few times recently I've observed the "got unknown message type 1 from Nix daemon" error when invoking `nix build` or `nix develop`. Presumably anything that builds a derivation would trigger it. I haven't been very systematic at collecting information so far. Sometimes retrying the same command succeeds the second time, sometimes it doesn't. I resolved one persistent instance of this error by switching to lix 2.90 temporarily, but it's possible that restarting nix-daemon was the actual trigger there. ## Steps To Reproduce Unfortunately don't have any reliable reproduction steps yet. Will update this if I find them. ## `nix --version` output ``` nix (Lix, like Nix) 2.93.0-dev-pre20250311-85a140a System type: x86_64-linux Additional system types: i686-linux, x86_64-v1-linux, x86_64-v2-linux, x86_64-v3-linux, x86_64-v4-linux Features: gc, signed-caches System configuration file: /etc/nix/nix.conf User configuration files: /home/olivia/.config/nix/nix.conf:/etc/xdg/nix/nix.conf:/home/olivia/.nix-profile/etc/xdg/nix/nix.conf:/nix/profile/etc/xdg/nix/nix.conf:/home/olivia/.local/state/nix/profile/etc/xdg/nix/nix.conf:/etc/profiles/per-user/olivia/etc/xdg/nix/nix.conf:/nix/var/nix/profiles/default/etc/xdg/nix/nix.conf:/run/current-system/sw/etc/xdg/nix/nix.conf Store directory: /nix/store State directory: /nix/var/nix Data directory: /nix/store/c59s548a5agbwyq8rjn4aszq4q0r6nzi-lix-2.93.0-dev-pre20250311-85a140a/share ```
Owner

Well that's a critical bug. Damn.

Well that's a critical bug. Damn.
Owner

If you manage to hit this reproducibly, can you try running the client inside of rr record? Then it will at least be captured in a way that it can eventually be extracted somehow.

If you manage to hit this reproducibly, can you try running the client inside of `rr record`? Then it will at least be captured in a way that it can *eventually* be extracted somehow.
Author
Member

Definitely! I did get an strace yesterday, but most of the actual data got cut off. These are the read calls right before the error:

read(6, "TLSR\0\0\0\0\1\0\0\0t\363\5\0i\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0"..., 32768) = 320
read(6, "POTS\0\0\0\0\4\0\0\0t\363\5\0", 32768) = 16
read(6, "POTS\0\0\0\0\2\0\0\0t\363\5\0POTS\0\0\0\0\1\0\0\0t\363\5\0"..., 32768) = 48
read(6, "ptxc\0\0\0\0\5\0\0\0\0\0\0\0Error\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 224

It looks like that last one starts with STDERR_ERROR, but the protocol error almost certainly happened somewhere later in the buffer.

Definitely! I did get an strace yesterday, but most of the actual data got cut off. These are the read calls right before the error: ``` read(6, "TLSR\0\0\0\0\1\0\0\0t\363\5\0i\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0"..., 32768) = 320 read(6, "POTS\0\0\0\0\4\0\0\0t\363\5\0", 32768) = 16 read(6, "POTS\0\0\0\0\2\0\0\0t\363\5\0POTS\0\0\0\0\1\0\0\0t\363\5\0"..., 32768) = 48 read(6, "ptxc\0\0\0\0\5\0\0\0\0\0\0\0Error\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 224 ``` It looks like that last one starts with `STDERR_ERROR`, but the protocol error almost certainly happened somewhere later in the buffer.
Owner

that sounds like another FramedSource/FramedSink mismatch; based on the recent nar fuckery we're going to guess that a nar isn't read to completion and what we're seeing is processStderr of remote stores seeing nar metadata where it expects commands

that sounds like another FramedSource/FramedSink mismatch; based on the recent nar fuckery we're going to guess that a nar isn't read to completion and what we're seeing is processStderr of remote stores seeing nar metadata where it expects commands
Author
Member

That would possibly match another behavior I was seeing recently where it freezes when downloading from a substituter.

That would possibly match another behavior I was seeing recently where it freezes when downloading from a substituter.
Owner

@pennae wrote in #732 (comment):

that sounds like another FramedSource/FramedSink mismatch; based on the recent nar fuckery we're going to guess that a nar isn't read to completion and what we're seeing is processStderr of remote stores seeing nar metadata where it expects commands

Hmmm. So reading NARs isn't cancel safe. I wonder how we can manage cancel safety in async. Or perhaps figure out how to instrument cancellations?

@pennae wrote in https://git.lix.systems/lix-project/lix/issues/732#issuecomment-9445: > that sounds like another FramedSource/FramedSink mismatch; based on the recent nar fuckery we're going to guess that a nar isn't read to completion and what we're seeing is processStderr of remote stores seeing nar metadata where it expects commands Hmmm. So reading NARs isn't cancel safe. I wonder how we can manage cancel safety in async. Or perhaps figure out how to instrument cancellations?
Owner

it's just a guess, we do have to investigate further. async nar reading cannot be made cancel-safe in general because that must rely on destructors, and destructors cannot be async. Source-based reading can be made cancel-safe (though it largely wasn't and would've aborted when canceled). instrumenting cancels is likewise impossible without touching every exit point of every function being instrumented (since destructors cannot distinguish cancellation from normal exits) or patching kj :(

it's just a guess, we do have to investigate further. async nar reading *cannot* be made cancel-safe in general because that must rely on destructors, and destructors cannot be async. Source-based reading can be made cancel-safe (though it largely wasn't and would've aborted when canceled). instrumenting cancels is likewise impossible without touching every exit point of every function being instrumented (since destructors cannot distinguish cancellation from normal exits) or patching kj :(
Owner

Hmmm. There's different types of cancellation also, like, if we made it a crash if it's cancelled without the connection going down it might prevent this type of bug by construction perhaps? But also I suspect we're going to be stuck patching kj about stuff like tracing anyway so it's not an impossible option.

Hmmm. There's different types of cancellation also, like, if we made it a crash if it's cancelled without the connection going down it might prevent this type of bug by construction perhaps? But also I suspect we're going to be stuck patching kj about stuff like tracing anyway so it's not an impossible option.
Owner

that doesn't sound like a good idea, handling ^C properly in an async runtime requires some amount of cancellation. the best we could do is mark data sources as tainted all the way down and crash if they've been used without maintaining all invariants, which we can luckily do since we don't use kj async streams

that doesn't sound like a good idea, handling ^C properly in an async runtime *requires* some amount of cancellation. the best we could do is mark data sources as tainted all the way down and crash if they've been used without maintaining all invariants, which we can luckily do since we don't use kj async streams
Owner

@olivia wrote in #732 (comment):

Definitely! I did get an strace yesterday, but most of the actual data got cut off. These are the read calls right before the error:

read(6, "TLSR\0\0\0\0\1\0\0\0t\363\5\0i\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0"..., 32768) = 320
read(6, "POTS\0\0\0\0\4\0\0\0t\363\5\0", 32768) = 16
read(6, "POTS\0\0\0\0\2\0\0\0t\363\5\0POTS\0\0\0\0\1\0\0\0t\363\5\0"..., 32768) = 48
read(6, "ptxc\0\0\0\0\5\0\0\0\0\0\0\0Error\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 224

It looks like that last one starts with STDERR_ERROR, but the protocol error almost certainly happened somewhere later in the buffer.

are remote builders involved here? there's a possibility that the error you're correctly pointing to is the daemon reporting a remote store error it received from another daemon, such as a remote builder

@olivia wrote in https://git.lix.systems/lix-project/lix/issues/732#issuecomment-9443: > Definitely! I did get an strace yesterday, but most of the actual data got cut off. These are the read calls right before the error: > > ```text > read(6, "TLSR\0\0\0\0\1\0\0\0t\363\5\0i\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0"..., 32768) = 320 > read(6, "POTS\0\0\0\0\4\0\0\0t\363\5\0", 32768) = 16 > read(6, "POTS\0\0\0\0\2\0\0\0t\363\5\0POTS\0\0\0\0\1\0\0\0t\363\5\0"..., 32768) = 48 > read(6, "ptxc\0\0\0\0\5\0\0\0\0\0\0\0Error\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 224 > ``` > > It looks like that last one starts with `STDERR_ERROR`, but the protocol error almost certainly happened somewhere later in the buffer. are remote builders involved here? there's a possibility that the error you're correctly pointing to is the daemon reporting a remote store error *it* received from *another* daemon, such as a remote builder
Owner

also, which daemon version did this error happen with? (we really should start adding a lot more debug info to error reports)

also, which daemon version did this error happen with? (we really should start adding a lot more debug info to error reports)
Author
Member

@pennae wrote in #732 (comment):

...

are remote builders involved here? there's a possibility that the error you're correctly pointing to is the daemon reporting a remote store error it received from another daemon, such as a remote builder

Oh, yeah that's definitely a possibility. I have a remote builder configured, but don't know if it was used when I got that specific error. The local daemon should be 2.93.0-dev-pre20250311-85a140a (same as the client), but the remote one may have been older.

@pennae wrote in https://git.lix.systems/lix-project/lix/issues/732#issuecomment-9485: > ... > > are remote builders involved here? there's a possibility that the error you're correctly pointing to is the daemon reporting a remote store error _it_ received from _another_ daemon, such as a remote builder Oh, yeah that's definitely a possibility. I have a remote builder configured, but don't know if it was used when I got that specific error. The local daemon should be 2.93.0-dev-pre20250311-85a140a (same as the client), but the remote one may have been older.
Owner

@olivia wrote in #732 (comment):

Oh, yeah that's definitely a possibility. I have a remote builder configured, but don't know if it was used when I got that specific error. The local daemon should be 2.93.0-dev-pre20250311-85a140a (same as the client), but the remote one may have been older.

the remote builder version and the full command line and output of the failing lix invocations would be very helpful info. at this point the only thing we can think of that may have caused this error is a bug in the nar code that has been fixed about a week ago

@olivia wrote in https://git.lix.systems/lix-project/lix/issues/732#issuecomment-9494: > Oh, yeah that's definitely a possibility. I have a remote builder configured, but don't know if it was used when I got that specific error. The local daemon should be 2.93.0-dev-pre20250311-85a140a (same as the client), but the remote one may have been older. the remote builder version and the full command line and output of the failing lix invocations would be very helpful info. at this point the only thing we can think of that may have caused this error is a bug in the nar code that has been fixed about a week ago
Owner

here's how this could fail:

  • local daemon starts a remote build over ssh
  • remote build completes
  • local daemon starts copying result using copyPaths from build-remote.cc
  • copyPaths calls RemoteStore::narFromPath a bunch on the ssh connection. this spawns many threads that all want to use the single ssh connection to the remote store
  • one nar copy fails in copyNAR without having read the entire thing, where the next thing it'd read is \1\0\0\0\0\0\0\0)\0\0\0\0\0\0\0. this copy throws an exception. crucially the connection is not invalidated, which is in itself outlandishly sketchy
  • another nar copy unblocks immediately, the local daemon sends commands, tries to receive the results, and immediately gets the 1 (LE) of the next nar tag that was never read. it then crashes
here's how this could fail: - local daemon starts a remote build over ssh - remote build completes - local daemon starts copying result using copyPaths from build-remote.cc - copyPaths calls RemoteStore::narFromPath a bunch on the ssh connection. this spawns many threads that all want to use the single ssh connection to the remote store - one nar copy fails in copyNAR without having read the entire thing, where the next thing it'd read is `\1\0\0\0\0\0\0\0)\0\0\0\0\0\0\0`. this copy throws an exception. crucially the connection *is not invalidated*, which is in itself outlandishly sketchy - another nar copy unblocks immediately, the local daemon sends commands, tries to receive the results, and immediately gets the `1` (LE) of the next nar tag that was never read. it then crashes
Owner

Hmmm. Could this happen if you serialize a path with a case hack in it from a store with case hack disabled, causing the remote to fail the nar parsing? What other causes are there for copyNAR to fail?

Hmmm. Could this happen if you serialize a path with a case hack in it from a store with case hack disabled, causing the remote to fail the nar parsing? What other causes are there for copyNAR to fail?
Author
Member

Case hack is darwin-only right? The possible remote builder in my case is x86_64-linux.

Case hack is darwin-only right? The possible remote builder in my case is x86_64-linux.
Owner

@jade wrote in #732 (comment):

Hmmm. Could this happen if you serialize a path with a case hack in it from a store with case hack disabled, causing the remote to fail the nar parsing? What other causes are there for copyNAR to fail?

it could happen any time the nar parser exits prematurely outside of a framed sink/source pair, in such a way that the connection is reused. that pretty much requires running in a thread pool since otherwise the exception will bubble upwards, which means most likely either a flavor of copyPaths or queryMissing. queryMissing doesn't involve the nar parser though

this is all extremely cursed and made no easier by the fact that the wire protocol has no coherent design :(

@jade wrote in https://git.lix.systems/lix-project/lix/issues/732#issuecomment-9519: > Hmmm. Could this happen if you serialize a path with a case hack in it from a store with case hack disabled, causing the remote to fail the nar parsing? What other causes are there for copyNAR to fail? it could happen any time the nar parser exits prematurely outside of a framed sink/source pair, in such a way that the connection is reused. that pretty much requires running in a thread pool since otherwise the exception will bubble upwards, which means most likely either a flavor of copyPaths or queryMissing. queryMissing doesn't involve the nar parser though this is all extremely cursed and made no easier by the fact that the wire protocol has no coherent design :(
Author
Member

Got this again, and was able to grab debug logs and an rr recording. It is failing on a remote build, where the local nix client/daemon and remote nix client/daemon are all 2.93.0-dev-pre20250311-85a140a. The command that triggered it was nix develop nixpkgs#python3Packages.feedparser, against nixpkgs 85f1ba3e51676fa8cc604a3d863d729026a6b8eb. Running the command a second time succeeds, but I can reliably trigger a failure with nix store delete /nix/store/fppv85zxdlscqwflx6cfgbm93168dbl6-python3.12-feedparser-6.0.11-env.drv and then nix develop ....

The rr recordings are a combined couple hundred MB, and presumably not useful without my lix binary, so I didn't want to upload them here without checking first. I can put them on my own server if that would be better.

edit: oof, my lix client build also doesn't have debug symbols. Gonna try rebuilding it and hope I can still get the error.

Got this again, and was able to grab debug logs and an rr recording. It is failing on a remote build, where the local nix client/daemon and remote nix client/daemon are all `2.93.0-dev-pre20250311-85a140a`. The command that triggered it was `nix develop nixpkgs#python3Packages.feedparser`, against nixpkgs `85f1ba3e51676fa8cc604a3d863d729026a6b8eb`. Running the command a second time succeeds, but I can reliably trigger a failure with `nix store delete /nix/store/fppv85zxdlscqwflx6cfgbm93168dbl6-python3.12-feedparser-6.0.11-env.drv` and then `nix develop ...`. The rr recordings are a combined couple hundred MB, and presumably not useful without my lix binary, so I didn't want to upload them here without checking first. I can put them on my own server if that would be better. edit: oof, my lix client build also doesn't have debug symbols. Gonna try rebuilding it and hope I can still get the error.
Author
Member

After switching to lix with debug symbols, I'm no longer able to reproduce the crash with feedparser. Likely this is either because it restarted nix-daemon, or because it built some other derivations first which may have put things in a different state.

After switching to lix with debug symbols, I'm no longer able to reproduce the crash with feedparser. Likely this is either because it restarted nix-daemon, or because it built some other derivations first which may have put things in a different state.
Owner

awesome! we can reproduce this reliably now (requiring garbage collection between attempts, but that's fine). ssh and ssh-ng remotes get different errors; ssh-ng gets your error while ssh gets "input does not look like a nix archive". so there is definitely nar parser fuckery going on

awesome! we can reproduce this reliably now (requiring garbage collection between attempts, but that's fine). ssh and ssh-ng remotes get different errors; ssh-ng gets your error while ssh gets "input does not look like a nix archive". so there is definitely nar parser fuckery going on
Member

This issue was mentioned on Gerrit on the following CLs:

  • commit message in cl/2857 ("libutil: fix copyNAR not reading the whole nar when dropped early")
<!-- GERRIT_LINKBOT: {"cls": [{"backlink": "https://gerrit.lix.systems/c/lix/+/2857", "number": 2857, "kind": "commit message"}], "cl_meta": {"2857": {"change_title": "libutil: fix copyNAR not reading the whole nar when dropped early"}}} --> This issue was mentioned on Gerrit on the following CLs: * commit message in [cl/2857](https://gerrit.lix.systems/c/lix/+/2857) ("libutil: fix copyNAR not reading the whole nar when dropped early")
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#732
No description provided.