Protocol mismatch when copying to remote host running Nix 2.24 #644

Closed
opened 2025-01-23 20:03:35 +00:00 by flokli · 7 comments

Describe the bug

I'm running Lix on my laptop, and deploy to other hosts using Colmena.

Today, I got the following error when trying to apply:

error: 'nix-store --serve' protocol mismatch from 'root@...' got 'started        ��RT'

The remote side uses Nix 2.24.11, the laptop Lix 2.91.1. It looks like nix-copy-closure starts a nix-store --serve on the remote side, but stumbles hard the protocol being spoken there.

Running Colmena in an environment with Nix 2.24.11 in $PATH gets the copy to succeed.

Expected behavior

I'd expect nix-copy-closure etc to work, even when talking to NixOS systems running Nix 2.24.11.

nix --version output

On the laptop:
nix (Lix, like Nix) 2.91.1

On the remote side:
nix (Nix) 2.24.11

cc @raito as requested.

## Describe the bug I'm running Lix on my laptop, and deploy to other hosts using Colmena. Today, I got the following error when trying to apply: ``` error: 'nix-store --serve' protocol mismatch from 'root@...' got 'started ��RT' ``` The remote side uses Nix 2.24.11, the laptop Lix 2.91.1. It looks like nix-copy-closure starts a nix-store --serve on the remote side, but stumbles hard the protocol being spoken there. Running Colmena in an environment with Nix 2.24.11 in $PATH gets the copy to succeed. ## Expected behavior I'd expect nix-copy-closure etc to work, even when talking to NixOS systems running Nix 2.24.11. ## `nix --version` output On the laptop: nix (Lix, like Nix) 2.91.1 On the remote side: nix (Nix) 2.24.11 cc @raito as requested.
Owner

this is almost certainly a CppNix bug, because I really doubt we touched that code; can you try reproducing it on 2.18 on the client side? cc @roberth

this is almost certainly a CppNix bug, because I really doubt we touched that code; can you try reproducing it on 2.18 on the client side? cc @roberth
Author

I switched the machine to run Lix by temporarily shelling in Nix 2.24 client-side, and even after switching back and forth I wasn't able to trigger it anymore. Not sure why. In case noone else is able to reproduce this, or has an idea what's going on feel free to close this.

I switched the machine to run Lix by temporarily shelling in Nix 2.24 client-side, and even after switching back and forth I wasn't able to trigger it anymore. Not sure why. In case noone else is able to reproduce this, or has an idea what's going on feel free to close this.
Author

I saw this today again, with both client and server using nix (Lix, like Nix) 2.91.1.

This time, is was a colmena apply to another host (with deployment.buildOnTarget = true; set to true):

[INFO ] Enumerating nodes...
[INFO ] Selected 1 out of 11 hosts.
m2air | Evaluating m2air
m2air | Evaluated m2air
m2air | Building m2air
m2air | mux_client_request_session: read from master failed: Broken pipe
m2air | error: 'nix-store --serve' protocol mismatch from 'root@m2air', got 'started
m2air |        ��RT'
m2air | Build failed: Child process exited with error code: 1
      | Failed: Child process exited with error code: 1
[ERROR] Failed to build m2air - Last 6 lines of logs:
[ERROR]  created)
[ERROR]    state) Running
[ERROR]   stderr) mux_client_request_session: read from master failed: Broken pipe
[ERROR]   stderr) error: 'nix-store --serve' protocol mismatch from 'root@m2air', got 'started
[ERROR]   stderr)        ��RT'
[ERROR]  failure) Child process exited with error code: 1
[ERROR] Failed to complete requested operation - Last 1 lines of logs:
[ERROR]  failure) Child process exited with error code: 1
[ERROR] -----
[ERROR] Operation failed with error: Child process exited with error code: 1
Hint: Backtrace available - Use `RUST_BACKTRACE=1` environment variable to display a backtrace
I saw this today again, with both client and server using `nix (Lix, like Nix) 2.91.1`. This time, is was a `colmena apply` to another host (with `deployment.buildOnTarget = true;` set to true): ``` [INFO ] Enumerating nodes... [INFO ] Selected 1 out of 11 hosts. m2air | Evaluating m2air m2air | Evaluated m2air m2air | Building m2air m2air | mux_client_request_session: read from master failed: Broken pipe m2air | error: 'nix-store --serve' protocol mismatch from 'root@m2air', got 'started m2air | ��RT' m2air | Build failed: Child process exited with error code: 1 | Failed: Child process exited with error code: 1 [ERROR] Failed to build m2air - Last 6 lines of logs: [ERROR] created) [ERROR] state) Running [ERROR] stderr) mux_client_request_session: read from master failed: Broken pipe [ERROR] stderr) error: 'nix-store --serve' protocol mismatch from 'root@m2air', got 'started [ERROR] stderr) ��RT' [ERROR] failure) Child process exited with error code: 1 [ERROR] Failed to complete requested operation - Last 1 lines of logs: [ERROR] failure) Child process exited with error code: 1 [ERROR] ----- [ERROR] Operation failed with error: Child process exited with error code: 1 Hint: Backtrace available - Use `RUST_BACKTRACE=1` environment variable to display a backtrace ```
Owner

try turning off ssh connection multiplexing, that's known to be busted in weird and wonderful ways

try turning off ssh connection multiplexing, that's known to be busted in weird and wonderful ways
Author

Wouldn't it be a good idea for Lix to automatically set ControlMaster=no whenever we specify a connection over ssh?

Wouldn't it be a good idea for Lix to automatically set `ControlMaster=no` whenever we specify a connection over ssh?
Owner

we could do that, but the muxing code is sufficiently broken that we should remove lix-directed muxing entirely instead. the error you're seeing here is not caused by lix itself using muxing, but by another process on the same system using muxing. the only reasonable way forward seems to be ripping out our mux handling entirely and becoming mux-agnostic

we could do that, but the muxing code is sufficiently broken that we should remove lix-directed muxing entirely instead. the error you're seeing here is not caused by lix *itself* using muxing, but by *another process on the same system* using muxing. the only reasonable way forward seems to be ripping out our mux handling entirely and becoming mux-agnostic
Member

This issue was mentioned on Gerrit on the following CLs:

  • commit message in cl/3005 ("libstore: stop using ssh connection sharing")
<!-- GERRIT_LINKBOT: {"cls": [{"backlink": "https://gerrit.lix.systems/c/lix/+/3005", "number": 3005, "kind": "commit message"}], "cl_meta": {"3005": {"change_title": "libstore: stop using ssh connection sharing"}}} --> This issue was mentioned on Gerrit on the following CLs: * commit message in [cl/3005](https://gerrit.lix.systems/c/lix/+/3005) ("libstore: stop using ssh connection sharing")
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#644
No description provided.