nix offline detection is applying to --refresh when that should be fatal, breaking system.autoUpgrade #286

Open
opened 2024-05-09 04:53:25 +00:00 by dcarosone · 1 comment
Member

Describe the bug

autoUpgrade service doesn't fail when steps within the process have errors. nixos-rebuild seems to be swallowing them nix seems to not be exiting with an error visible to caller.

As well as simply not doing the intended job of upgrading, this can actually cause configuration to go backwards.

Steps To Reproduce

  1. enable the service on a laptop using wifi, with a persistent timer (the default)
  2. suspend the machine, and resume the following morning after the scheduled timer expires (04:40 default)
  3. the service can start immediately, before network connectivity is available. It has a dependency on network-online.target but this is not meaningful after a resume, unfortunately.
  4. the upgrade has no network, and so does not fetch channel updates or update the specified flake, but this doesn't generate an error that systemd sees.
  5. Even if the --refresh argument is given with a flake, it will use the previously-cached fetch from the last run, which should be considered stale and invalid. The build proceeds anyway.
  6. If the system had been manually updated (from a more recent checkout in /etc/nixos/flake.nix for example), the autoupgrade service will build and switch to the older revision, effectively rolling back unexpectedly.

Expected behavior

Issues and errors, such as lack of network connectivity for an upgrade, should be considered as errors for the rebuild, and cause the service to fail (so it can optionally then be configured to retry with a delay).

In particular, at step 5, the --refresh argument should consider cached copies of the flake source as invalid (as documented) and refuse to use them. The errors in the log, reported as "fatal", should therefore be fatal.

Screenshots

In the below log, wifi was disabled. The autoUpgrade service is configures with a git+ssh:// flake repo.

Without --refresh in the options list, the ssh errors don't appear, presumably because the 'network-dependent features' have been disabled. With --refresh they're tried anyway but the errors are ignored.

Dec 14 13:51:34 rocinante systemd[1]: Starting NixOS Upgrade...
Dec 14 13:51:34 rocinante nixos-upgrade-start[98047]: warning: you don't have Internet access; disabling some network-dependent features
Dec 14 13:51:34 rocinante nixos-upgrade-start[98047]: [4B blob data]
Dec 14 13:51:34 rocinante nixos-upgrade-start[98051]: ssh: connect to host soft-serve port 23231: Network is unreachable
Dec 14 13:51:34 rocinante nixos-upgrade-start[98050]: fatal: Could not read from remote repository.
Dec 14 13:51:34 rocinante nixos-upgrade-start[98050]: Please make sure you have the correct access rights
Dec 14 13:51:34 rocinante nixos-upgrade-start[98050]: and the repository exists.
Dec 14 13:51:34 rocinante nixos-upgrade-start[98047]: [148B blob data]
Dec 14 13:51:35 rocinante nixos-upgrade-start[98045]: building the system configuration...
Dec 14 13:51:35 rocinante nixos-upgrade-start[98058]: warning: you don't have Internet access; disabling some network-dependent features
Dec 14 13:51:35 rocinante nixos-upgrade-start[98058]: [4B blob data]
Dec 14 13:51:35 rocinante nixos-upgrade-start[98062]: ssh: connect to host soft-serve port 23231: Network is unreachable
Dec 14 13:51:35 rocinante nixos-upgrade-start[98061]: fatal: Could not read from remote repository.
Dec 14 13:51:35 rocinante nixos-upgrade-start[98061]: Please make sure you have the correct access rights
Dec 14 13:51:35 rocinante nixos-upgrade-start[98061]: and the repository exists.
Dec 14 13:51:35 rocinante nixos-upgrade-start[98058]: [148B blob data]
Dec 14 13:51:38 rocinante nixos-upgrade-start[98078]: updating GRUB 2 menu...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: NOT restarting the following changed units: nixos-upgrade.service
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: activating the configuration...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] creating new generation in /run/agenix.d/8
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] decrypting secrets...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: decrypting '/nix/store/hz41qqz5x88yk1jlwsj3shbqx74w904n-nm-geek-env.age' to '/run/agenix.d/8/nm-geek-env'...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] symlinking new secrets to /run/agenix (generation 8)...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] removing old secrets (generation 7)...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] chowning...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: setting up /etc...
Dec 14 13:51:40 rocinante nixos-upgrade-start[98078]: reloading user units for dan...
Dec 14 13:51:41 rocinante nixos-upgrade-start[98078]: setting up tmpfiles
Dec 14 13:51:42 rocinante systemd[1]: nixos-upgrade.service: Deactivated successfully.
Dec 14 13:51:42 rocinante systemd[1]: Finished NixOS Upgrade.
Dec 14 13:51:42 rocinante systemd[1]: nixos-upgrade.service: Consumed 2.772s CPU time, no IP traffic.

Speculation

After pondering on this for a while, I'm becoming more convinced that the issue is nix itself:

  • behaving as if --offline had been passed explicitly, based on some auto-detection of connectivity
  • having this take precedence over --refresh, that was passed explicitly
  • ignoring "fatal" errors in updates and proceeding anyway

Additional context

Full config, including workaround using a preStart job that will fail in a way systemd can see, and another to prevent rollback of 'dirty' changes when hacking:

{ inputs, ... }: {
  system.autoUpgrade = {
    enable = ((inputs.self.rev or "dirty") != "dirty");
    flake = "git+ssh://soft-serve:23231/geek/nixos?ref=flake";
    flags = [ "--refresh" ];
    randomizedDelaySec = "45m";
  };
  systemd.services.nixos-upgrade = {
    preStart = "ssh soft-serve -p 23231 info";
    startLimitIntervalSec = 120;
    startLimitBurst = 6;
    serviceConfig = {
      Restart = "on-failure";
      RestartSec = "20";
      CPUSchedulingPolicy = "idle";
      IOSchedulingClass = "idle";
    };
  };
}

It also seems to rebuild and switch when there's full network connectivity but no new revisions are fetched, regardless of whether this is because (without --refresh) the content is still within TTL, or simply no new revisions are found on the git repo. I don't think this is necessary.

It might be helpful to have an option that's the inverse of --offline that seems to be getting detected.. something like --require-online such that it can bail directly from this autodetection before even getting to the other steps. But it should still bail on those other errors, and the failure to update with --refresh, and it should very-definitely not roll back by building and switching to a stale revision.

originally at https://github.com/NixOS/nixpkgs/issues/274146

## Describe the bug autoUpgrade service doesn't fail when steps within the process have errors. ~~`nixos-rebuild` seems to be swallowing them~~ nix seems to not be exiting with an error visible to caller. As well as simply not doing the intended job of upgrading, this can actually cause configuration to go backwards. ## Steps To Reproduce 1. enable the service on a laptop using wifi, with a persistent timer (the default) 2. suspend the machine, and resume the following morning after the scheduled timer expires (04:40 default) 3. the service can start immediately, before network connectivity is available. It has a dependency on `network-online.target` but this is not meaningful after a resume, unfortunately. 4. the upgrade has no network, and so does not fetch channel updates or update the specified flake, but this doesn't generate an error that systemd sees. 5. Even if the `--refresh` argument is given with a flake, it will use the previously-cached fetch from the last run, which should be considered stale and invalid. The build proceeds anyway. 6. If the system had been manually updated (from a more recent checkout in `/etc/nixos/flake.nix` for example), the autoupgrade service will build and switch to the older revision, effectively rolling back unexpectedly. ## Expected behavior Issues and errors, such as lack of network connectivity for an upgrade, should be considered as errors for the rebuild, and cause the service to fail (so it can optionally then be configured to retry with a delay). In particular, at step 5, the `--refresh` argument should consider cached copies of the flake source as invalid (as documented) and refuse to use them. The errors in the log, reported as "fatal", should therefore be fatal. ## Screenshots In the below log, wifi was disabled. The autoUpgrade service is configures with a git+ssh:// flake repo. Without `--refresh` in the options list, the ssh errors don't appear, presumably because the 'network-dependent features' have been disabled. With `--refresh` they're tried anyway but the errors are ignored. ```console Dec 14 13:51:34 rocinante systemd[1]: Starting NixOS Upgrade... Dec 14 13:51:34 rocinante nixos-upgrade-start[98047]: warning: you don't have Internet access; disabling some network-dependent features Dec 14 13:51:34 rocinante nixos-upgrade-start[98047]: [4B blob data] Dec 14 13:51:34 rocinante nixos-upgrade-start[98051]: ssh: connect to host soft-serve port 23231: Network is unreachable Dec 14 13:51:34 rocinante nixos-upgrade-start[98050]: fatal: Could not read from remote repository. Dec 14 13:51:34 rocinante nixos-upgrade-start[98050]: Please make sure you have the correct access rights Dec 14 13:51:34 rocinante nixos-upgrade-start[98050]: and the repository exists. Dec 14 13:51:34 rocinante nixos-upgrade-start[98047]: [148B blob data] Dec 14 13:51:35 rocinante nixos-upgrade-start[98045]: building the system configuration... Dec 14 13:51:35 rocinante nixos-upgrade-start[98058]: warning: you don't have Internet access; disabling some network-dependent features Dec 14 13:51:35 rocinante nixos-upgrade-start[98058]: [4B blob data] Dec 14 13:51:35 rocinante nixos-upgrade-start[98062]: ssh: connect to host soft-serve port 23231: Network is unreachable Dec 14 13:51:35 rocinante nixos-upgrade-start[98061]: fatal: Could not read from remote repository. Dec 14 13:51:35 rocinante nixos-upgrade-start[98061]: Please make sure you have the correct access rights Dec 14 13:51:35 rocinante nixos-upgrade-start[98061]: and the repository exists. Dec 14 13:51:35 rocinante nixos-upgrade-start[98058]: [148B blob data] Dec 14 13:51:38 rocinante nixos-upgrade-start[98078]: updating GRUB 2 menu... Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: NOT restarting the following changed units: nixos-upgrade.service Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: activating the configuration... Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] creating new generation in /run/agenix.d/8 Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] decrypting secrets... Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: decrypting '/nix/store/hz41qqz5x88yk1jlwsj3shbqx74w904n-nm-geek-env.age' to '/run/agenix.d/8/nm-geek-env'... Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] symlinking new secrets to /run/agenix (generation 8)... Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] removing old secrets (generation 7)... Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] chowning... Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: setting up /etc... Dec 14 13:51:40 rocinante nixos-upgrade-start[98078]: reloading user units for dan... Dec 14 13:51:41 rocinante nixos-upgrade-start[98078]: setting up tmpfiles Dec 14 13:51:42 rocinante systemd[1]: nixos-upgrade.service: Deactivated successfully. Dec 14 13:51:42 rocinante systemd[1]: Finished NixOS Upgrade. Dec 14 13:51:42 rocinante systemd[1]: nixos-upgrade.service: Consumed 2.772s CPU time, no IP traffic. ``` ## Speculation After pondering on this for a while, I'm becoming more convinced that the issue is nix itself: * behaving as if `--offline` had been passed explicitly, based on some auto-detection of connectivity * having this take precedence over `--refresh`, that *was* passed explicitly * ignoring "fatal" errors in updates and proceeding anyway ## Additional context Full config, including workaround using a preStart job that will fail in a way systemd can see, and another to prevent rollback of 'dirty' changes when hacking: ```nix { inputs, ... }: { system.autoUpgrade = { enable = ((inputs.self.rev or "dirty") != "dirty"); flake = "git+ssh://soft-serve:23231/geek/nixos?ref=flake"; flags = [ "--refresh" ]; randomizedDelaySec = "45m"; }; systemd.services.nixos-upgrade = { preStart = "ssh soft-serve -p 23231 info"; startLimitIntervalSec = 120; startLimitBurst = 6; serviceConfig = { Restart = "on-failure"; RestartSec = "20"; CPUSchedulingPolicy = "idle"; IOSchedulingClass = "idle"; }; }; } ``` It also seems to rebuild and switch when there's full network connectivity but no new revisions are fetched, regardless of whether this is because (without `--refresh`) the content is still within TTL, or simply no new revisions are found on the git repo. I don't think this is necessary. It might be helpful to have an option that's the inverse of `--offline` that seems to be getting detected.. something like `--require-online` such that it can bail directly from this autodetection before even getting to the other steps. But it should still bail on those other errors, and the failure to update with `--refresh`, and it should very-definitely not roll back by building and switching to a stale revision. originally at https://github.com/NixOS/nixpkgs/issues/274146
dcarosone added the
bug
label 2024-05-09 04:53:25 +00:00
jade changed title from nix offline detection is masking errors, breaking system.autoUpgrade to nix offline detection is applying to `--refresh` when that should be fatal, breaking system.autoUpgrade 2024-05-09 05:00:42 +00:00
jade added the
Area/flakes
label 2024-05-09 05:01:23 +00:00
Author
Member

Smaller repro, invoking nixos-rebuild build --flake git+ssh://soft-serve:23231/geek/nixos?ref=flake --refresh 3 times:

  1. just to ensure everything is cached properly at the current revision
  2. to se that nothing happens once cached
  3. after taking the soft-serve repo offline; note that it continues despite the update failure and the use of --refresh

This also confirms the issue still persists in lix as of now.


❯ nixos-rebuild build --flake git+ssh://soft-serve:23231/geek/nixos?ref=flake --refresh
building the system configuration...
trace: warning: The option `services.xserver.displayManager.defaultSession' defined in `/nix/store/z6wv0a36lsxj6w3fg4cxh0f7xmv2ikfa-source/common/general-desktop.nix' has been renamed to `services.displayManager.defaultSession'.
trace: warning: `overrideScope'` (from `lib.makeScope`) has been renamed to `overrideScope`.

❯ nixos-rebuild build --flake git+ssh://soft-serve:23231/geek/nixos?ref=flake --refresh --verbose
$ cat /proc/sys/kernel/hostname
building the system configuration...
Building in flake mode.
$ nix --extra-experimental-features nix-command flakes build git+ssh://soft-serve:23231/geek/nixos?ref=flake#nixosConfigurations."oenone".config.system.build.toplevel --refresh --verbose

❯ nixos-rebuild build --flake git+ssh://soft-serve:23231/geek/nixos?ref=flake --refresh --verbose
$ cat /proc/sys/kernel/hostname
building the system configuration...
Building in flake mode.
$ nix --extra-experimental-features nix-command flakes build git+ssh://soft-serve:23231/geek/nixos?ref=flake#nixosConfigurations."oenone".config.system.build.toplevel --refresh --verbose
ssh: connect to host soft-serve port 23231: No route to host
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
warning: could not update local clone of Git repository 'ssh://soft-serve:23231/geek/nixos'; continuing with the most recent version
Smaller repro, invoking `nixos-rebuild build --flake git+ssh://soft-serve:23231/geek/nixos?ref=flake --refresh` 3 times: 1. just to ensure everything is cached properly at the current revision 2. to se that nothing happens once cached 3. after taking the soft-serve repo offline; note that it continues despite the update failure and the use of `--refresh` This also confirms the issue still persists in lix as of now. ```console ❯ nixos-rebuild build --flake git+ssh://soft-serve:23231/geek/nixos?ref=flake --refresh building the system configuration... trace: warning: The option `services.xserver.displayManager.defaultSession' defined in `/nix/store/z6wv0a36lsxj6w3fg4cxh0f7xmv2ikfa-source/common/general-desktop.nix' has been renamed to `services.displayManager.defaultSession'. trace: warning: `overrideScope'` (from `lib.makeScope`) has been renamed to `overrideScope`. ❯ nixos-rebuild build --flake git+ssh://soft-serve:23231/geek/nixos?ref=flake --refresh --verbose $ cat /proc/sys/kernel/hostname building the system configuration... Building in flake mode. $ nix --extra-experimental-features nix-command flakes build git+ssh://soft-serve:23231/geek/nixos?ref=flake#nixosConfigurations."oenone".config.system.build.toplevel --refresh --verbose ❯ nixos-rebuild build --flake git+ssh://soft-serve:23231/geek/nixos?ref=flake --refresh --verbose $ cat /proc/sys/kernel/hostname building the system configuration... Building in flake mode. $ nix --extra-experimental-features nix-command flakes build git+ssh://soft-serve:23231/geek/nixos?ref=flake#nixosConfigurations."oenone".config.system.build.toplevel --refresh --verbose ssh: connect to host soft-serve port 23231: No route to host fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. warning: could not update local clone of Git repository 'ssh://soft-serve:23231/geek/nixos'; continuing with the most recent version
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#286
No description provided.