Remote builds are not stopped on C-c #540

Closed
opened 2024-10-07 16:22:17 +00:00 by ma27 · 6 comments
Member

Describe the bug

When stopping a remote-build, the SSH process etc. is not removed. As a result, there's a waiting on lock message when restarting the build.

Steps To Reproduce

  1. Modify pkgs.hello in nixpkgs slightly (e.g. by adding a custom postPath attribute)
  2. Have remote builders using ssh-ng configured
  3. run nix-build -A hello -j0 in your nixpkgs checkout
  4. When the build has started, stop it via C-c.
  5. Re-run nix-build -A hello -j0. Now you should get a message like this:
this derivation will be built:
  /nix/store/0isisf74rz3ari3lf33a22vvifqh9s9k-hello-2.12.1.drv
waiting for lock on '/nix/store/l4ah713yj4n4q81vgaas82ydfj5blrh2-hello-2.12.1'...

systemctl status nix-daemon shows me that there are indeed leftover processes:

     CGroup: /system.slice/nix-daemon.service
             ├─1440371 nix-daemon --daemon
             ├─1443196 nix-daemon 1443194
             ├─1443753 nix __build-remote 3
             └─1443755 ssh roflmayr -x -oPermitLocalCommand=yes "-oLocalCommand=echo started" "nix-daemon --stdio"

Expected behavior

I would expect all processes related to this build to be killed and to not wait on a stale lock when running nix-build -A hello -j0 again.

nix --version output

nix (Lix, like Nix) 2.92.0-devpre20241005_ed9b7f4 (both locally and on the remote builder)

Additional context

I've seen this a few years ago in CppNix, but I could've sworn it got fixed eventually. I may be wrong about this, though.

## Describe the bug When stopping a remote-build, the SSH process etc. is not removed. As a result, there's a `waiting on lock` message when restarting the build. ## Steps To Reproduce 1. Modify `pkgs.hello` in `nixpkgs` slightly (e.g. by adding a custom `postPath` attribute) 2. Have remote builders using `ssh-ng` configured 3. run `nix-build -A hello -j0` in your nixpkgs checkout 4. When the build has started, stop it via `C-c`. 5. Re-run `nix-build -A hello -j0`. Now you should get a message like this: ``` this derivation will be built: /nix/store/0isisf74rz3ari3lf33a22vvifqh9s9k-hello-2.12.1.drv waiting for lock on '/nix/store/l4ah713yj4n4q81vgaas82ydfj5blrh2-hello-2.12.1'... ``` `systemctl status nix-daemon` shows me that there are indeed leftover processes: ``` CGroup: /system.slice/nix-daemon.service ├─1440371 nix-daemon --daemon ├─1443196 nix-daemon 1443194 ├─1443753 nix __build-remote 3 └─1443755 ssh roflmayr -x -oPermitLocalCommand=yes "-oLocalCommand=echo started" "nix-daemon --stdio" ``` ## Expected behavior I would expect all processes related to this build to be killed and to not wait on a stale lock when running `nix-build -A hello -j0` again. ## `nix --version` output `nix (Lix, like Nix) 2.92.0-devpre20241005_ed9b7f4` (both locally and on the remote builder) ## Additional context I've seen this a few years ago in CppNix, but I could've sworn it got fixed eventually. I may be wrong about this, though.
ma27 added the
bug
label 2024-10-07 16:22:17 +00:00
Owner

I've seen this a few years ago in CppNix, but I could've sworn it got fixed eventually. I may be wrong about this, though.

in our experience remote builds usually end by crashing the remote daemon when ssh exits due to tcp timeouts. :/ we want to fix this properly with the new protocols, but that may take a bit yet

> I've seen this a few years ago in CppNix, but I could've sworn it got fixed eventually. I may be wrong about this, though. in our experience remote builds usually end by crashing the remote daemon when ssh exits due to tcp timeouts. :/ we want to fix this properly with the new protocols, but that may take a bit yet
Owner

i suspect this is possibly a regression against 2.91 or 2.90 but we should verify that.

i suspect this is possibly a regression against 2.91 or 2.90 but we should verify that.
Author
Member

Yep, cannot reproduce with 2.91.
Guess I'll do a bisect then.

Yep, cannot reproduce with 2.91. Guess I'll do a bisect then.
Author
Member

bf32085d63 is the first commit I can reproduce the behavior with, the previous one is fine.

Didn't dig deep enough into the changes in libstore so far, so I can't really say much about the why I'm afraid.

bf32085d63ccfa8fb1e0cff2f2ae7156b4679015 is the first commit I can reproduce the behavior with, the previous one is fine. Didn't dig deep enough into the changes in libstore so far, so I can't really say much about the why I'm afraid.
Owner

bf32085d63 is the first commit I can reproduce the behavior with, the previous one is fine.

(sob quietly)

> bf32085d63ccfa8fb1e0cff2f2ae7156b4679015 is the first commit I can reproduce the behavior with, the previous one is fine. (sob quietly)
Member

This issue was mentioned on Gerrit on the following CLs:

  • comment in cl/2060 ("worker: respect C-c on sudo nix-build")
  • commit message in cl/2060 ("worker: respect C-c on sudo nix-build")
<!-- GERRIT_LINKBOT: {"cls": [{"backlink": "https://gerrit.lix.systems/c/lix/+/2060", "number": 2060, "kind": "comment"}, {"backlink": "https://gerrit.lix.systems/c/lix/+/2060", "number": 2060, "kind": "commit message"}], "cl_meta": {"2060": {"change_title": "worker: respect C-c on `sudo nix-build`"}}} --> This issue was mentioned on Gerrit on the following CLs: * comment in [cl/2060](https://gerrit.lix.systems/c/lix/+/2060) ("worker: respect C-c on `sudo nix-build`") * commit message in [cl/2060](https://gerrit.lix.systems/c/lix/+/2060) ("worker: respect C-c on `sudo nix-build`")
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#540
No description provided.