hydra-queue-runner: fix bad optional access

benaryorg commented

2025-03-28 07:00:57 +00:00

Contributor

BuildProduct uses optionals, and this accesses optionals so why unwrap them if they can be empty.
This seems to happen when using a combination of recursive-nix with hydra's build inputs to pull in the .drv of an aggregate build.
I have two jobs which run diffs from one aggregate to another (diffing main vs. staging/next/head); one for nix-diff and one for nvd, but this only happens with nix-diff, and not exactly consistently (presumably because sometimes it has references in the database already) but often enough that I think the issue lies there specifically.

The log outputs without this patch:

Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: downloading 'https://cache.nixos.org/782gxy231i4c8dpczpk3bw2l44ylicix.narinfo'...
Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: loading build 289305 (infra:staging-diff:nix-diff-txt)
Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: fetching NAR contents of '/nix/store/b3shfiirms8i0pjq1hjfkmkwfbf8j5bk-infra-nix-diff'...
Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: copying path '/nix/store/1q5akrss4z1xzdrmzi3kxgp67xgbvliv-07y8pdj0ayicw5jr8imxxgbqcmad7mlw-source' (971568 bytes, compressed 81.8% in 305 ms) to binary cache
Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: downloaded 's3://nc.benary.org/nar/1ijyvlqgdba9nmfdi4lg2jp92sq5bjbqaivh2rmw6d01n54zmmra.nar.xz' (272 bytes) in 40 ms
Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: queue monitor: bad optional access

After applying the patch:

Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: loading build 289305 (infra:staging-diff:nix-diff-txt)
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: copying path '/nix/store/6hq7b4nshgxxcd7cv1fpgi08gw1h2jq7-etc' (30384 bytes, compressed 87.3% in 25 ms) to binary cache
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: performing step ‘/nix/store/370zw8p1212hqp0iszlnkj15hyjx8sa5-infra-nix-diff.drv’ 1 times on ‘localhost’ (needed by build 289863 and 2 others)
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: fetching NAR contents of '/nix/store/b3shfiirms8i0pjq1hjfkmkwfbf8j5bk-infra-nix-diff'...
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: performing step ‘/nix/store/jiwvd4xdshjbvlcmpy5c56hcaiv07jid-infra-nvd-diff.drv’ 1 times on ‘localhost’ (needed by build 289861 and 2 others)
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: uploaded 's3://nc.benary.org/nar/056q36i0fl0m1wld2csb9vvqxkmqb4izf9k05pnl6qrdmm837v2l.nar.xz' (6292 bytes) in 1046 ms
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: copying path '/nix/store/740rwm41953drlbjnmij8frsscx64q41-etc' (30384 bytes, compressed 87.3% in 68 ms) to binary cache
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: downloaded 's3://nc.benary.org/nar/1ijyvlqgdba9nmfdi4lg2jp92sq5bjbqaivh2rmw6d01n54zmmra.nar.xz' (272 bytes) in 73 ms
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: marking build 289305 as succeeded (cached)

In both cases the store path that hydra is trying to build is already present on the machine, hydra seemingly fails to add it to the database though.

Note that this has only been tested with lix-flavoured hydra, YMMV running with CppNix.
Either way the change seems sound; assigning an optional to an optional rather than unpacking the optional without guarding against failure just to put it back in an optional.

(the above is the commit message, below are details that I didn't want to include there)

The failing build had several "Aborted" steps on my hydra instance, with no log output at all.

The inputs are (here because I accidentally wiped my S3 bucket three times this week while tinkering with Ceph, so I'm not sure it'll be available forever, the git repo most likely will tho):

to: "build" of { src = "https://git.shell.bsocat.net/infra?rev=dac7215ace50e78d40d0ba72cbf57a5ca62a2f4b"; buildTarget = "hydraJobs"; }
from: "build" of { src = "https://git.shell.bsocat.net/infra?rev=d37fd63e618bce140c5ef408a924d4733456e8bc"; buildTarget = "hydraJobs"; }
projectName: "infra"
src: https://git.shell.bsocat.net/infra?rev=d37fd63e618bce140c5ef408a924d4733456e8bc
nixpkgs: https://git.shell.bsocat.net/nixpkgs?rev=f0946fa5f1fb876a9dc2e1850d9d3a4e3f914092
and the expression is <src/hydra/diff.nix>

Given that this uses build inputs – of aggregate builds no less – please don't ask me how to reproduce the builds outside of hydra, or how hydra manages to get unset hashes.
If memory serves the filesize was actually set, just the hash was missing, however both are optionals anyway.

BuildProduct uses optionals, and this accesses optionals so why unwrap them if they can be empty. This seems to happen when using a combination of *recursive-nix* with hydra's *build* inputs to pull in the .drv of an aggregate build. I have two jobs which run diffs from one aggregate to another (diffing main vs. staging/next/head); one for nix-diff and one for nvd, but this *only* happens with nix-diff, and not exactly consistently (presumably because sometimes it has references in the database already) but often enough that I think the issue lies there specifically. The log outputs without this patch: ```text Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: downloading 'https://cache.nixos.org/782gxy231i4c8dpczpk3bw2l44ylicix.narinfo'... Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: loading build 289305 (infra:staging-diff:nix-diff-txt) Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: fetching NAR contents of '/nix/store/b3shfiirms8i0pjq1hjfkmkwfbf8j5bk-infra-nix-diff'... Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: copying path '/nix/store/1q5akrss4z1xzdrmzi3kxgp67xgbvliv-07y8pdj0ayicw5jr8imxxgbqcmad7mlw-source' (971568 bytes, compressed 81.8% in 305 ms) to binary cache Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: downloaded 's3://nc.benary.org/nar/1ijyvlqgdba9nmfdi4lg2jp92sq5bjbqaivh2rmw6d01n54zmmra.nar.xz' (272 bytes) in 40 ms Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: queue monitor: bad optional access ``` After applying the patch: ```text Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: loading build 289305 (infra:staging-diff:nix-diff-txt) Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: copying path '/nix/store/6hq7b4nshgxxcd7cv1fpgi08gw1h2jq7-etc' (30384 bytes, compressed 87.3% in 25 ms) to binary cache Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: performing step ‘/nix/store/370zw8p1212hqp0iszlnkj15hyjx8sa5-infra-nix-diff.drv’ 1 times on ‘localhost’ (needed by build 289863 and 2 others) Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: fetching NAR contents of '/nix/store/b3shfiirms8i0pjq1hjfkmkwfbf8j5bk-infra-nix-diff'... Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: performing step ‘/nix/store/jiwvd4xdshjbvlcmpy5c56hcaiv07jid-infra-nvd-diff.drv’ 1 times on ‘localhost’ (needed by build 289861 and 2 others) Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: uploaded 's3://nc.benary.org/nar/056q36i0fl0m1wld2csb9vvqxkmqb4izf9k05pnl6qrdmm837v2l.nar.xz' (6292 bytes) in 1046 ms Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: copying path '/nix/store/740rwm41953drlbjnmij8frsscx64q41-etc' (30384 bytes, compressed 87.3% in 68 ms) to binary cache Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: downloaded 's3://nc.benary.org/nar/1ijyvlqgdba9nmfdi4lg2jp92sq5bjbqaivh2rmw6d01n54zmmra.nar.xz' (272 bytes) in 73 ms Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: marking build 289305 as succeeded (cached) ``` In both cases the store path that hydra is trying to build is already present on the machine, hydra seemingly fails to add it to the database though. Note that this has only been tested with lix-flavoured hydra, YMMV running with CppNix. Either way the change seems sound; assigning an optional to an optional rather than unpacking the optional without guarding against failure just to put it back in an optional. --- (the above is the commit message, below are details that I didn't want to include there) The [failing build](https://hydra.cloud.bsocat.net/build/289305) had several "Aborted" steps on my hydra instance, with no log output at all. The inputs are (here because I accidentally wiped my S3 bucket three times this week while tinkering with Ceph, so I'm not sure it'll be available forever, the git repo most likely will tho): - to: "build" of `{ src = "https://git.shell.bsocat.net/infra?rev=dac7215ace50e78d40d0ba72cbf57a5ca62a2f4b"; buildTarget = "hydraJobs"; }` - from: "build" of `{ src = "https://git.shell.bsocat.net/infra?rev=d37fd63e618bce140c5ef408a924d4733456e8bc"; buildTarget = "hydraJobs"; }` - projectName: "infra" - src: `https://git.shell.bsocat.net/infra?rev=d37fd63e618bce140c5ef408a924d4733456e8bc` - nixpkgs: `https://git.shell.bsocat.net/nixpkgs?rev=f0946fa5f1fb876a9dc2e1850d9d3a4e3f914092` - and the expression is `<src/hydra/diff.nix>` Given that this uses build inputs – of aggregate builds no less – *please* don't ask me how to reproduce the builds outside of hydra, or how hydra manages to get unset hashes. If memory serves the filesize was actually set, just the hash was missing, however both are optionals anyway.

benaryorg added 1 commit

2025-03-28 07:00:57 +00:00

hydra-queue-runner: fix bad optional access 4efe10f3a6

BuildProduct uses optionals, and this accesses optionals so why unwrap them if they can be empty.
This seems to happen when using a combination of *recursive-nix* with hydra's *build* inputs to pull in the .drv of an aggregate build.
I have two jobs which run diffs from one aggregate to another (diffing main vs. staging/next/head); one for nix-diff and one for nvd, but this *only* happens with nix-diff, and not exactly consistently (presumably because sometimes it has references in the database already) but often enough that I think the issue lies there specifically.

The log outputs without this patch:

```text
Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: downloading 'https://cache.nixos.org/782gxy231i4c8dpczpk3bw2l44ylicix.narinfo'...
Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: loading build 289305 (infra:staging-diff:nix-diff-txt)
Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: fetching NAR contents of '/nix/store/b3shfiirms8i0pjq1hjfkmkwfbf8j5bk-infra-nix-diff'...
Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: copying path '/nix/store/1q5akrss4z1xzdrmzi3kxgp67xgbvliv-07y8pdj0ayicw5jr8imxxgbqcmad7mlw-source' (971568 bytes, compressed 81.8% in 305 ms) to binary cache
Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: downloaded 's3://nc.benary.org/nar/1ijyvlqgdba9nmfdi4lg2jp92sq5bjbqaivh2rmw6d01n54zmmra.nar.xz' (272 bytes) in 40 ms
Mar 28 06:27:09 hydra hydra-queue-runner[3127107]: queue monitor: bad optional access
```

After applying the patch:

```text
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: loading build 289305 (infra:staging-diff:nix-diff-txt)
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: copying path '/nix/store/6hq7b4nshgxxcd7cv1fpgi08gw1h2jq7-etc' (30384 bytes, compressed 87.3% in 25 ms) to binary cache
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: performing step ‘/nix/store/370zw8p1212hqp0iszlnkj15hyjx8sa5-infra-nix-diff.drv’ 1 times on ‘localhost’ (needed by build 289863 and 2 others)
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: fetching NAR contents of '/nix/store/b3shfiirms8i0pjq1hjfkmkwfbf8j5bk-infra-nix-diff'...
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: performing step ‘/nix/store/jiwvd4xdshjbvlcmpy5c56hcaiv07jid-infra-nvd-diff.drv’ 1 times on ‘localhost’ (needed by build 289861 and 2 others)
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: uploaded 's3://nc.benary.org/nar/056q36i0fl0m1wld2csb9vvqxkmqb4izf9k05pnl6qrdmm837v2l.nar.xz' (6292 bytes) in 1046 ms
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: copying path '/nix/store/740rwm41953drlbjnmij8frsscx64q41-etc' (30384 bytes, compressed 87.3% in 68 ms) to binary cache
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: downloaded 's3://nc.benary.org/nar/1ijyvlqgdba9nmfdi4lg2jp92sq5bjbqaivh2rmw6d01n54zmmra.nar.xz' (272 bytes) in 73 ms
Mar 28 06:36:03 hydra hydra-queue-runner[3158204]: marking build 289305 as succeeded (cached)
```

In both cases the store path that hydra is trying to build is already present on the machine, hydra seemingly fails to add it to the database though.

Note that this has only been tested with lix-flavoured hydra, YMMV running with CppNix.
Either way the change seems sound; assigning an optional to an optional rather than unpacking the optional without guarding against failure just to put it back in an optional.

Signed-off-by: benaryorg <binary@benary.org>

benaryorg changed title from ~~hydra-queue-runner: fix bad optional access~~ to WIP: hydra-queue-runner: fix bad optional access

2025-03-28 07:20:25 +00:00

benaryorg commented

2025-03-28 07:25:18 +00:00

Author

Contributor

Ugh, nevermind the WIP there, I just checked the files and the ones I happened to check were all empty, and I though that they were at least lacking the header for the different derivations which are diffed even if the content is the same.

Yeah.… no. That diff was running against the exact same derivation twice, so that was legit.
I gotta go catch some sleep before looking at this PR or the code again >.<

Ugh, nevermind the WIP there, I just checked the files and the ones I happened to check were all empty, and I though that they were at least lacking the header for the different derivations which are diffed even if the content is the same. Yeah.… no. That diff was running against the exact same derivation twice, so that was legit. I gotta go catch some sleep before looking at this PR or the code again >.<

benaryorg changed title from ~~WIP: hydra-queue-runner: fix bad optional access~~ to hydra-queue-runner: fix bad optional access

2025-03-28 07:25:19 +00:00

ma27 commented

2025-03-28 23:15:40 +00:00

Member

This seems to happen when using a combination of recursive-nix with hydra's build inputs to pull in the .drv of an aggregate build.

I'm afraid I don't follow. What is meant here specifically?

but this only happens with nix-diff, and not exactly consistently (presumably because sometimes it has references in the database already) but often enough that I think the issue lies there specifically.

If this only depends on the DB state and not on timing, do you think it would be possible to write a small regression test for this? The Hydra testsuite is the thing that prevented me from doing stupid things quite a few times already, so having more edge-cases in there is something I'd be very interested in.

> This seems to happen when using a combination of recursive-nix with hydra's build inputs to pull in the .drv of an aggregate build. I'm afraid I don't follow. What is meant here specifically? > but this only happens with nix-diff, and not exactly consistently (presumably because sometimes it has references in the database already) but often enough that I think the issue lies there specifically. If this only depends on the DB state and not on timing, do you think it would be possible to write a small regression test for this? The Hydra testsuite is the thing that prevented me from doing stupid things quite a few times already, so having more edge-cases in there is something I'd be very interested in.

benaryorg commented

2025-03-29 01:34:18 +00:00

Author

Contributor

I'm afraid I don't follow. What is meant here specifically?

Fair enough, it is a monstrosity of a build ;-)

Basically the output of my repo yields several images (kexec, iso, etc.), some docs (nixpkgs, nixos, but also custom for ceph clusters), and the actual toplevel outputs for my machines.
Since I want to get a diff of the machines between main branch and other branches ahead of main (staging, but also one that pulls in latest nixpkgs, and one that pulls in latest versions of all dependency repos, such as lix), I need to somehow run nix-diff in hydra.
Since nix-diff runs nix in the background for querying the store this requires recursive-nix.
However recursive-nix is still "pure" in that one has to provide it with all inputs that it should be able to access.
This works fine with the "eval" input type of hydra for nvd out of the box, however nix-diff has to diff the .drv files themselves, the derivations and not the output, and "eval" only yields the output.
So to depend on the derivations of a different build I have to use the "build" input specifically ("sysbuild" does not work) since it sets the drvPath attribute which in turn pulls in the entire derivation tree as a dependency when depended on.
This type of input however needs to be specified per output, meaning I don't just depend on "build of main branch", I have to depend on "output of build for my laptop of main branch".
I cannot sensibly list all the outputs as inputs since those may change between main and the other branch and I do want to see a missing or added node in a diff.
Which means I have to use an aggregate build with all nodes as constituents, because an aggregate build has magic which makes the output meaningless, but still pulls in all the constituents as dependencies.
Since the aggregate job for my nodes has all the node derivations as dependencies for its own derivation (but not output) I can pull in the aggregate job of each branch as the "build" input in a hydra job, which then uses recursive-nix to nix-diff the two aggregates.

This is the monstrosity that I have built, and it works fine for nvd, however the nix-diff builds often just abort and are orphaned (don't ask me how that happens) because one fetch from the binary cache for the output that is already present on the hydra instance fails, presumably because hydra does not have the output in its database, but has built it somewhere along the way already; basically there are five builds, nvd and nix-diff as text output with ansi colours each, then each gets a pass of ansi2html to produce a total of four builds, and then a fifth one which simply symlinks the two html ones so I get a single build with two outputs that I can bookmark in my browser.
And when the nix-diff txt version fails in some sort of fetch phase for the NARs it is treated as a transient error, with regular retries, despite the build for the html version having already finished the build, but due to depending on the prior build never passes, and always remains "building".
So I get passed builds for nvd-txt and nvd-html, nix-diff-txt continues to have occasional retried build steps that abort, and nix-diff-html just sits there waiting for nix-diff-txt despite the output already existing, same for the summarized diff.

If this only depends on the DB state and not on timing, do you think it would be possible to write a small regression test for this?

Honestly I have no idea how to even vaguely reproduce this in terms of database state.
The thing is, the build gets picked up as usual by the queue-runner, but since all five of the builds contain a "hydra-build-products" file, hydra will try to parse parts of the content, and for some reason that I don't understand the data it has at that point does not have a hash (the exact line that the PR changes).
I don't know how this can happen, I don't know why it needs the hash, I just throw gdb at it and stepped until I figured out where the "bad optional access" happened, looked at the struct it was assigned to and noticed that was optional too, removed the .value() and tested it and now my builds work.

However I do think there is state that can be built to reproduce this on a smaller scale, I just don't know how to possibly boil it down (or at least don't have the spoons to build a full minimal example of two constituent builds depending on one another etc. etc.).
Please note that hydra has trouble deleting a project if its jobsets have any "build" or "sysbuild" type dependencies due to foreign key constraints, which is why I wouldn't want to test this on a production hydra instance either (I mean, not much trouble to set up a local test instance, but the work required adds up and.… well.… spoons).
So if anyone wants to take a shot at it, go ahead, I'm available for questions, but for the time being I can only answer the question of whether "I think it would be possible" and that's "very likely".

> I'm afraid I don't follow. What is meant here specifically? Fair enough, it is a monstrosity of a build ;-) Basically the output of my repo yields several images (kexec, iso, etc.), some docs (nixpkgs, nixos, but also custom for ceph clusters), and the actual toplevel outputs for my machines. Since I want to get a diff of the machines between main branch and other branches ahead of main (staging, but also one that pulls in latest nixpkgs, and one that pulls in latest versions of all dependency repos, such as lix), I need to somehow run nix-diff in hydra. Since nix-diff runs nix in the background for querying the store this requires recursive-nix. However recursive-nix is still "pure" in that one has to provide it with all inputs that it should be able to access. This works fine with the "eval" input type of hydra for nvd out of the box, however nix-diff has to diff the *.drv* files themselves, the derivations and not the output, and "eval" only yields the output. So to depend on the derivations of a different build I have to use the "build" input specifically ("sysbuild" does not work) since it [sets the *drvPath* attribute](https://git.lix.systems/lix-project/hydra/src/commit/5167437de11bb1ffd2623c56298159ee8e124c0d/src/script/hydra-eval-jobset#L148) which in turn pulls in the entire derivation tree as a dependency when depended on. This type of input however needs to be specified per output, meaning I don't just depend on "build of main branch", I have to depend on "output of build for my laptop of main branch". I cannot sensibly list all the outputs as inputs since those may change between main and the other branch and I do want to see a missing or added node in a diff. Which means I have to use an aggregate build with all nodes as constituents, because an aggregate build has magic which makes the output meaningless, but still pulls in all the constituents as dependencies. Since the aggregate job for my nodes has all the node derivations as dependencies for its own derivation (but not output) I can pull in the aggregate job of each branch as the "build" input in a hydra job, which then uses recursive-nix to nix-diff the two aggregates. This is the monstrosity that I have built, and it works fine for nvd, however the nix-diff builds often just abort and are orphaned (don't ask me how that happens) because one fetch from the binary cache for the output that is already present on the hydra instance fails, presumably because hydra does not have the output in its database, but has built it somewhere along the way already; basically there are five builds, nvd and nix-diff as text output with ansi colours each, then each gets a pass of ansi2html to produce a total of four builds, and then a fifth one which simply symlinks the two html ones so I get a single build with two outputs that I can bookmark in my browser. And when the nix-diff txt version fails in some sort of fetch phase for the NARs it is treated as a transient error, with regular retries, despite the build for the html version having already finished the build, but due to depending on the prior build never passes, and always remains "building". So I get passed builds for nvd-txt and nvd-html, nix-diff-txt continues to have occasional retried build steps that abort, and nix-diff-html just sits there waiting for nix-diff-txt despite the output already existing, same for the summarized diff. > If this only depends on the DB state and not on timing, do you think it would be possible to write a small regression test for this? Honestly I have no idea how to even vaguely reproduce this in terms of database state. The thing is, the build gets picked up as usual by the queue-runner, but since all five of the builds contain a "hydra-build-products" file, [hydra will try to parse parts of the content](https://git.lix.systems/lix-project/hydra/src/commit/5167437de11bb1ffd2623c56298159ee8e124c0d/src/hydra-queue-runner/build-result.cc#L62-L65), and for some reason that I don't understand the data it has at that point does not have a hash (the exact line that the PR changes). I don't know how this can happen, I don't know why it needs the hash, I just throw gdb at it and stepped until I figured out where the "bad optional access" happened, looked at the struct it was assigned to and noticed that was optional too, removed the `.value()` and tested it and now my builds work. However I do think there is state that can be built to reproduce this on a smaller scale, I just don't know how to possibly boil it down (or at least don't have the spoons to build a full minimal example of two constituent builds depending on one another etc. etc.). Please note that hydra has trouble deleting a project if its jobsets have any "build" or "sysbuild" type dependencies due to foreign key constraints, which is why I wouldn't want to test this on a production hydra instance either (I mean, not much trouble to set up a local test instance, but the work required adds up and.… well.… spoons). So if anyone wants to take a shot at it, go ahead, I'm available for questions, but for the time being I can only answer the question of whether "I think it would be possible" and that's "very likely".

benaryorg referenced this pull request

2025-03-29 02:45:31 +00:00

RFE: revision of hydra-eval-jobset 'build' type attr filter #31

ma27 commented

2025-04-07 20:57:51 +00:00

Member

Thanks a lot for the explanation and the context.

The reason why I'm a little hesitant is that I have a theory why fileSize/sha256hash are optional: it's probably a poor way to denote that a build product is a directory (I'm pretty sure that I've seen that already). But keep in mind, this is my interpretation of the code, feel free to disagree. The getBuildOutput function also seem so to confirm that, at least it looks like it's also assuming that the optional ahs a value for filetype ==regular. Your patch would essentially break that assumption.

To me, this sounds like something you hit regularly (or at least often)? If that's the case, I'd probably throw another Hydra into a VM and see if I can trigger this to understand what's going on.

I mean, not much trouble to set up a local test instance, but the work required adds up and.… well.… spoons

In case you don't know already, Hydra actually has a quite useful local dev env: for me, it's essentially nix develop, just build and then foreman start and I have a running Hydra on my laptop.

Thanks a lot for the explanation and the context. The reason why I'm a little hesitant is that I have a theory why fileSize/sha256hash are optional: it's probably a poor way to denote that a build product is a directory (I'm pretty sure that I've seen that already). But keep in mind, this is my interpretation of the code, feel free to disagree. The `getBuildOutput` function also seem so to confirm that, at least it looks like it's also assuming that the optional ahs a value for filetype ==regular. Your patch would essentially break that assumption. To me, this sounds like something you hit regularly (or at least often)? If that's the case, I'd probably throw another Hydra into a VM and see if I can trigger this to understand what's going on. > I mean, not much trouble to set up a local test instance, but the work required adds up and.… well.… spoons In case you don't know already, Hydra actually has a quite useful local dev env: for me, it's essentially `nix develop`, `just build` and then `foreman start` and I have a running Hydra on my laptop.

benaryorg commented

2025-04-08 11:12:51 +00:00

Author

Contributor

@ma27 wrote in #33 (comment):

The reason why I'm a little hesitant is that I have a theory why fileSize/sha256hash are optional: it's probably a poor way to denote that a build product is a directory (I'm pretty sure that I've seen that already).

The code in question is specifically concerned about the hydra build products (i.e. that funny lil file saying doc html /path/to/thing which shows up in the hydra UI).
Since those are build artifacts (which you can click on to view/download) I don't think that pointing to a directory there is meaningful in any way (semantically) although technically possible (syntactically) as the code just skips the surrounding conditional and therefore the filetype == regular entirely.
And in fact those files on my end are .txt and .html files respectively, just magically missing a hash.
The passing nvd and failing nix-diff outputs are actual files, the combined build is using symlinks but the preceding builds fail already before it reaches that last one.

(as far as I understand those explicit build products, they point at "the file which should be shown in the browser" even if the surrounding directory is required to view, as is the case with CSS/JS in addition to HTML, and they are only relevant to hydra's UI, not the actual build)

Your patch would essentially break that assumption.

I mean.… the datatype itself is an optional so I would expect all code using that optional to be aware that it is, in fact, optional, and guarding against absence or handling the optional (yeah, I know, bold assumption).
So with that in mind, if the data structure which is being filled in the lines this PR touches denotes an optional, I don't see how passing on the optional would break assumptions there since all other processing code must work under the assumption that the value is optional, so if this change triggers any unwanted behaviour (which as of today it does not on my patched instance) that would be a bug in the code using the optional, rather than the code setting it IMHO (I mean.… type safety and all that).

To me, this sounds like something you hit regularly (or at least often)?

Removing the patch will trigger the issue semi-consistently (read: most of the very specific builds hit the error, just not always apparently).
As noted below, no minimal reproducer for now, but if you don't mind running a large-ish eval and builds for a few things including some isos and tarballs, config is below.
If you fork the repo (since it depends on main branches) to a non-github thing (since it needs the dumb-http transport for retrieving refs, github turned that off), edit some references/links in the hydra nix files (hydra subdirectory), you could technically boil it down to a single machine without isos or tarballs, given that I could push a repo to this forgejo instance to fulfill that criteria I might do that.

screenshot of project config

---

The rest of this reply is highly opinionated.

So given that those are actual files as noted above, and the source and destination data structures allows optionals I don't see this as something that should be unexpected for any related code, and if it is, then the bug is in the code that assumes the optional to have a value without checking, not in the code setting the permissible optional value.
And at the very least if it were intended behaviour to err out when no value is present, it shouldn't be a red bold "bad optional access" with no indication of where or why (read: if it were intentional it should be surrounded by guards that handle the error gracefully, whether fatal or not, by printing something meaningful instead).

This is the C++ equivalent of having two Rust structs with Option<Value> fields and assigning s1.x = Some(s2.x.unwrap()) which is an obvious anti-pattern, and in lieu of any indication that it should be impossible I would opt for fixing the reproducible¹ error of a build failing when it could succeed otherwise.
IOW: the patch does not seem to break anything, it does however fix something, and if something were to break then the fix should be implemented elsewhere in the code.

¹: I have stuff on my list that has higher priority at the moment, so I can't provide a minimal reproducer, and I'm fine with this issue going stale (I do have this patch in my overlay), but replying to issues has higher prio than working on things for me, so you'll have to rely on my word of mouth claim that it is reproducible for the time being, sorry about that

@ma27 wrote in https://git.lix.systems/lix-project/hydra/pulls/33#issuecomment-10177: > The reason why I'm a little hesitant is that I have a theory why fileSize/sha256hash are optional: it's probably a poor way to denote that a build product is a directory (I'm pretty sure that I've seen that already). [The code in question](https://git.lix.systems/lix-project/hydra/src/commit/4efe10f3a6f0420147fe5ed2e1c423071ca5f3d8/src/hydra-queue-runner/build-result.cc#L62-L70) is *specifically* concerned about the hydra build products (i.e. that funny lil file saying `doc html /path/to/thing` which shows up in the hydra UI). Since those are build artifacts (which you can click on to view/download) I don't think that pointing to a directory there is meaningful in any way (semantically) although technically possible (syntactically) as the code just skips the [surrounding conditional](https://git.lix.systems/lix-project/hydra/src/commit/4efe10f3a6f0420147fe5ed2e1c423071ca5f3d8/src/hydra-queue-runner/build-result.cc#L95) and therefore the [filetype == regular](https://git.lix.systems/lix-project/hydra/src/commit/4efe10f3a6f0420147fe5ed2e1c423071ca5f3d8/src/hydra-queue-runner/build-result.cc#L96) entirely. And in fact those files on my end are *.txt* and *.html* files respectively, just magically missing a hash. The passing nvd and failing nix-diff outputs are actual files, the combined build is using symlinks but the preceding builds fail already before it reaches that last one. (as far as I understand those explicit build products, they point at "the file which should be shown in the browser" even if the surrounding directory is required to view, as is the case with CSS/JS in addition to HTML, and they are only relevant to hydra's UI, not the actual build) > Your patch would essentially break that assumption. I mean.… the datatype itself is an *optional* so I would expect all code using that optional to be aware that it is, in fact, optional, and guarding against absence or handling the optional (yeah, I know, bold assumption). So with that in mind, if the data structure which is being filled in the [lines this PR touches](https://git.lix.systems/lix-project/hydra/src/commit/4efe10f3a6f0420147fe5ed2e1c423071ca5f3d8/src/hydra-queue-runner/build-result.cc#L96-L98) denotes an optional, I don't see how passing on the optional would break assumptions there since all other processing code must work under the assumption that the value is optional, so if this change triggers any unwanted behaviour (which as of today it does not on my patched instance) that would be a bug in the code *using* the optional, rather than the code setting it IMHO (I mean.… type safety and all that). > To me, this sounds like something you hit regularly (or at least often)? Removing the patch will trigger the issue semi-consistently (read: most of the very specific builds hit the error, just not always apparently). As noted below, no minimal reproducer for now, but if you don't mind running a large-ish eval and builds for a few things including some isos and tarballs, config is below. If you fork the repo (since it depends on main branches) to a non-github thing (since it needs the dumb-http transport for retrieving refs, github turned that off), edit some references/links in the hydra nix files (hydra subdirectory), you could technically boil it down to a single machine without isos or tarballs, given that I could push a repo to this forgejo instance to fulfill that criteria I might do that. <details><summary>screenshot of project config</summary ![image](/attachments/bfc1c5bb-a097-49b7-9d16-312237f1c7c4) </details> --- The rest of this reply is highly opinionated. So given that those are actual files as noted above, and the source and destination data structures allows optionals I don't see this as something that should be unexpected for any related code, and if it is, then the bug is in the code that assumes the optional to have a value without checking, not in the code setting the permissible optional value. And at the very least if it were intended behaviour to err out when no value is present, it shouldn't be a red bold "bad optional access" with no indication of where or why (read: if it were intentional it should be surrounded by guards that handle the error gracefully, whether fatal or not, by printing something meaningful instead). This is the C++ equivalent of having two Rust structs with `Option<Value>` fields and assigning `s1.x = Some(s2.x.unwrap())` which is an obvious anti-pattern, and in lieu of any indication that it *should* be impossible I would opt for fixing the reproducible¹ error of a build failing when it could succeed otherwise. IOW: the patch does not seem to break anything, it does however fix something, and if something were to break then the fix should be implemented elsewhere in the code. ¹: I have stuff on my list that has higher priority at the moment, so I can't provide a minimal reproducer, and I'm fine with this issue going stale (I do have this patch in my overlay), but replying to issues has higher prio than working on things for me, so you'll have to rely on my word of mouth claim that it is reproducible for the time being, sorry about that

image.png

86 KiB

ma27 commented

2025-04-08 14:42:19 +00:00

Member

Since those are build artifacts (which you can click on to view/download) I don't think that pointing to a directory there is meaningful in any way (semantically)

Just checked, this can be done to expose a directory containing e.g. files for a manual such as mdbook.
The serveFile function in the Build controller explicitly checks for that and serves an index.html in that case.

I mean.… the datatype itself is an optional so I would expect all code using that optional to be aware that it is, in fact, optional, and guarding against absence or handling the optional (yeah, I know, bold assumption).

So the thing is, we don't have a disagreement on how it should be done at all! I fully agree that the dereferencing of the optional and putting it back into another optional is pretty bad.

The problem is essentially that I'm in "this is legacy code I don't fully understand and thus I'm kinda defensive"-mode and right now I don't have the time to verify that what we're doing is sensible (sensible means in this case that there's no code relying on the fact fileType == tRegular implies that the optional is set). So in other words, I'd like to avoid shifting the problem to just another place. Does that make a little more sense?

That being said, a different question: did this start to happen by any chance somewhere around the middle of March? I'm a little afraid that the extensive changes to the NAR reader (after Lix store became fully async) are related. This is like the other thing: perhaps there's another bug hiding below the surface.

So, I think I should (soonish :3) give it a try to reproduce (I assume the configs you posted can be used easily? How many drvs do I need to build on my own btw?). Sooner or later, I'd merge this (or write some sensible error reporting that's not "bad optional access"), e.g. if I don't find anything. But for me there are too many open questions right now, sorry.

> Since those are build artifacts (which you can click on to view/download) I don't think that pointing to a directory there is meaningful in any way (semantically) Just checked, this can be done to expose a directory containing e.g. files for a manual such as mdbook. The `serveFile` function in the Build controller explicitly checks for that and serves an `index.html` in that case. > I mean.… the datatype itself is an optional so I would expect all code using that optional to be aware that it is, in fact, optional, and guarding against absence or handling the optional (yeah, I know, bold assumption). So the thing is, we don't have a disagreement on how it _should_ be done _at all_! I fully agree that the dereferencing of the optional and putting it back into another optional is pretty bad. The problem is essentially that I'm in "this is legacy code I don't fully understand and thus I'm kinda defensive"-mode and right now I don't have the time to verify that what we're doing is sensible (sensible means in this case that there's no code relying on the fact `fileType == tRegular` implies that the optional is set). So in other words, I'd like to avoid shifting the problem to just another place. Does that make a little more sense? That being said, a different question: did this start to happen by any chance somewhere around the middle of March? I'm a little afraid that the extensive changes to the NAR reader (after Lix store became fully async) are related. This is like the other thing: perhaps there's another bug hiding below the surface. So, I think I should (soonish :3) give it a try to reproduce (I assume the configs you posted can be used easily? How many drvs do I need to build on my own btw?). Sooner or later, I'd merge this (or write some sensible error reporting that's not "bad optional access"), e.g. if I don't find anything. But for me there are too many open questions right now, sorry.

benaryorg commented

2025-04-10 14:58:13 +00:00

Author

Contributor

@ma27 wrote in #33 (comment):

Just checked, this can be done to expose a directory containing e.g. files for a manual such as mdbook.
The serveFile function in the Build controller explicitly checks for that and serves an index.html in that case.

Good to know (I kinda assumed one'd point at the index.html then), thanks for clearing that up ^^

So the thing is, we don't have a disagreement on how it should be done at all! I fully agree that the dereferencing of the optional and putting it back into another optional is pretty bad.
The problem is essentially that I'm in "this is legacy code I don't fully understand and thus I'm kinda defensive"-mode and right now I don't have the time to verify that what we're doing is sensible (sensible means in this case that there's no code relying on the fact fileType == tRegular implies that the optional is set). So in other words, I'd like to avoid shifting the problem to just another place. Does that make a little more sense?

Which is 100% okay, I'm not judging.
Usually it's me playing devil's advocate, so it's an interesting change of pace.

That being said, a different question: did this start to happen by any chance somewhere around the middle of March? I'm a little afraid that the extensive changes to the NAR reader (after Lix store became fully async) are related. This is like the other thing: perhaps there's another bug hiding below the surface.

I only built the whole thing like two weeks ago so I can't tell.
The build ran fine once or twice (presumably because it had cached artifacts from my testing back then, at least that's my working theory) and when I thought the whole construct worked and did its job the build started bailing out in the queue-runner.

So, I think I should (soonish :3) give it a try to reproduce (I assume the configs you posted can be used easily? How many drvs do I need to build on my own btw?). Sooner or later, I'd merge this (or write some sensible error reporting that's not "bad optional access"), e.g. if I don't find anything. But for me there are too many open questions right now, sorry.

Okay, so.… I've taken a stab at getting this reproduced, and I have a minimal reproducer, except it doesn't reproduce.
That of course means there's some additional factor that this test does not cover, so I guess my next move would be to run the full producer on a separate, potentially by just pulling the original repo into this but that's for another time.

full reproducer

I tried with just the setup I described before, which I've assembled (badly) using pkgs.linkFarm in a directory in /etc since hydra is unable to unpack a tarball, it can download it but will never unpack, so I couldn't just provide the repo as a tarball via nginx, and getting a git repo or such in there didn't seem worth the effort, and I wanted it to be a self-contained expression.
I then failed to reproduce the issue with the setup, and noticed that the output diverged in terms of fetching NARs, which made me setup s3proxy to get a bucket in the machine.
It will take a long time to copy the entire derivation tree in there, since it depends on the drvPath of the aggregates, which includes a lot of stuff like /nix/store/gp6gh2jn0x7y7shdvvwxlza4r5bmh211-stdenv-linux.drv (which has a lot of dependencies), despite not even using stdenv/runCommand (however that's still pulling curl, bash, etc.)
I've set the memory limit to 8GiB because I ran into OOM issues before and I didn't want it to fail after like an hour of running.
Beware of the temporary disk image residing in /tmp which may be tmpfs.
The test is built such that the machine boots up, waits for hydra to be available, sets up the bucket, then creates the declarative project, waits for all of the jobs to finish (with an incredibly high timeout for the one that'll copy the .drv files), then update the repo to trigger a second build, and then wait until that one finishes.
If at any point "bad optional" is triggered the test will fail early.
This means that the test will fail early if any of the builds fail, or if the bug is triggered, and it will succeed if all builds pass without the bug triggering.

Due to The Horrors™ contained therein, this Nix Expression is best viewed with Netscape Navigator 4.0 or below on a Pentium 3±1 emulated in Javascript on an Apple IIGS at a screen resolution of 1024x1. Please enable your ad blockers, disable high-heat drying, and remove your device from Airplane Mode and set it to Boat Mode. For security reasons, please leave caps lock on while evaluating.

# Or run this one:
#  nom shell --print-build-logs -v --show-trace -f THIS_FILE test.driver --command env "QEMU_NET_OPTS=hostfwd=tcp:127.0.0.1:3000-:3000,hostfwd=tcp:127.0.0.1:8000-:80" nixos-test-driver
# you should be able to access hydra on http://localhost:3000 and the bucket on http://localhost:8000/bucket
# probably needs the flake or nix-command feature for fetchTree

{ system ? builtins.currentSystem
, nixpkgs ? builtins.fetchTree { type = "github"; owner = "NixOS"; repo = "nixpkgs"; rev = "a880f49904d68b5e53338d1e8c7bf80f59903928"; narHash = "sha256-o4FjFOUmjSRMK7dn0TFdAT0RRWUWD+WsspPHa+qEQT8="; }
, hydra ? builtins.fetchTree { type = "git"; url = "https://git.lix.systems/lix-project/hydra"; ref = "main"; rev = "e9c63f286a3ecc2f2694c801518f992a8bd34bbc"; narHash = "sha256-Hu1tybr6RaYavmmZneBAaMKf0URYKZDkHsJhr70JFzs="; }
, lix ? builtins.fetchTree { type = "git"; url = "https://git.lix.systems/lix-project/lix"; ref = "main"; rev = "fb0ef6ca6bfea368e594e9ae2714858030f66bc6"; narHash = "sha256-eDym7mHXW0AVuqhp5BZ7uBk5b5NcM4slnSXFuoAUadc="; }
, lix-module ? builtins.fetchTree { type = "git"; url = "https://git.lix.systems/lix-project/nixos-module"; ref = "main"; rev = "fa69ae26cc32dda178117b46487c2165c0e08316"; narHash = "sha256-MB/b/xcDKqaVBxJIIxwb81r8ZiGLeKEcqokATRRroo8="; }
}:

{
  test = (import nixpkgs { inherit system; }).nixosTest ({ lib, pkgs, ... }:
  {
    name = "hydra-bad-optional";
    nodes.host = { pkgs, config, ... }:
    {
      imports = [
        {
          config.nixpkgs.overlays =
          [
            (import "${lix-module}/overlay.nix" { inherit lix; })
            (final: _prev: { hydra = final.callPackage "${hydra}/package.nix" { inherit (lib) fileset; rawSrc = hydra; stdenv = final.clangStdenv; }; })
          ];
        }
      ];

      config = let
        generateRepo = text: pkgs.linkFarm "hydra-repo"
        {
          "jobspec.json" = (json: pkgs.writeTextFile { name = "hydra-json-jobspec"; text = builtins.toJSON json; })
          {
            enabled = 3;
            hidden = false;
            description = "test project";
            nixexprinput = "src";
            nixexprpath = "generate-jobset.nix";
            checkinterval = 15;
            schedulingshares = 1000;
            enableemail = false;
            emailoverride = "";
            keepnr = 3;
            type = 0;
            inputs =
            {
              src =
              {
                value = "/etc/repo";
                type = "path";
                emailresponsible = false;
              };
            };
          };
          "simple-job.nix" = (text: pkgs.writeTextFile { name = "hydra-nix-file"; inherit text; })
          ''
            { nixpkgs, output, system ? builtins.currentSystem, ... }:
            let
              inherit (import nixpkgs { inherit system; }) lib pkgs;
            in
            {
              # version with a lot of .drvs being generated
              #job = pkgs.runCommandLocal "simple-job" {} "echo simple-job: ''${output} > $out";
              # version with virtually no dependencies (should make the diff much faster)
              job = builtins.derivation
              {
                inherit system;
                name = "simple-job";
                builder = "/bin/sh";
                args = [ "-c" "echo simple-job: ''${output} > $out" ];
              };
              aggregate = builtins.derivation
              {
                inherit system;
                name = "aggregate";
                constituents = [ "job" ];
                builder = "/bin/sh";
                args = [ "-c" ": > $out" ];
                _hydraAggregate = true;
              };
            }
          '';
          "diff.nix" = (text: pkgs.writeTextFile { name = "hydra-nix-file"; inherit text; })
          ''
            { nixpkgs, from, to, system ? builtins.currentSystem, ... }:
            let
              inherit (import nixpkgs { inherit system; }) lib pkgs;
            in
            lib.fix (self:
            {
              nix-diff-txt = pkgs.runCommandLocal "nix-diff-txt"
                {
                  from = from.drvPath;
                  to = to.drvPath;
                  buildInputs = with pkgs; [ nix nix-diff ];
                  requiredSystemFeatures = [ "recursive-nix" ];
                }
                ${"''"}
                  set -eo pipefail

                  mkdir -p $out/nix-support/
                  nix-diff --color always $from $to | tee $out/output.txt
                  printf "%s %s %s\\n" doc txt ''${builtins.placeholder "out"}/output.txt > $out/nix-support/hydra-build-products
                ${"''"};

              nix-diff-html = pkgs.runCommand "nix-diff-html"
                {
                  buildInputs = with pkgs; [ ansi2html ];
                }
                ${"''"}
                  set -eo pipefail

                  mkdir -p $out/nix-support/
                  ansi2html -s osx-basic -m < ''${self.nix-diff-txt}/output.txt > $out/output.html
                  printf "%s %s %s\\n" doc html ''${builtins.placeholder "out"}/output.html > $out/nix-support/hydra-build-products
                ${"''"};
            })
          '';
          "generate-jobset.nix" = (text: pkgs.writeTextFile { name = "hydra-nix-file"; inherit text; })
          ''
            { src, ... }:

            let
              job = { extraInputs ? {}, exprinput ? "src", exprpath }:
              {
                enabled = 3;
                hidden = false;
                description = "example job";
                checkinterval = 15;
                schedulingshares = 100;
                enableemail = false;
                emailoverride = "";
                keepnr = 0;
                type = 0;
                nixexprinput = exprinput;
                nixexprpath = exprpath;
                inputs =
                {
                  src = { type = "path"; value = src; };
                  nixpkgs = { type = "path"; value = "${pkgs.path}"; };
                } // extraInputs;
              };
              spec =
              {
                base-a = job
                {
                  exprpath = "simple-job.nix";
                  extraInputs = { output = { type = "string"; value = "base text"; }; };
                };
                base-b = job
                {
                  exprpath = "simple-job.nix";
                  extraInputs = { output = { type = "string"; value = "${text}"; }; };
                };
                diff = job
                {
                  exprpath = "diff.nix";
                  extraInputs =
                  {
                    from = { type = "build"; value = "test:base-a:aggregate"; };
                    to = { type = "build"; value = "test:base-b:aggregate"; };
                  };
                };
              };

            in

            {
              jobsets = builtins.derivation
              {
                system = builtins.currentSystem;
                name = "test-jobspec";
                builder = "/bin/sh";
                args =
                [
                  "-c"
                  ${"''"}
                    read -r jobspec <<"END"
                    ''${builtins.toJSON spec}
                    END
                    printf "%s\\n" "$jobspec" > $out
                    printf "%s\\n" "$jobspec"
                  ${"''"}
                ];
              };
            }
          '';
        };
      in
      {
        virtualisation.memorySize = 8192;
        nix.settings.experimental-features = [ "nix-command" "flakes" "no-url-literals" "pipe-operator" "auto-allocate-uids" "cgroups" "recursive-nix" ];
        environment.etc =
        {
          old-repo.source = generateRepo "early text";
          new-repo.source = generateRepo "later text";
        };
        systemd.tmpfiles.settings."50-hydra" =
        {
          "/var/cache/hydra".d = { mode = "0755"; user = "hydra"; group = "hydra"; };
          "/var/cache/hydra/nar-cache".d = { mode = "0775"; user = "hydra"; group = "hydra"; age = "1d"; };
        };
        systemd.services.s3proxy =
        {
          description = "s3proxy for temporary hydra bucket";
          wantedBy = [ "hydra-queue-runner.service" ];
          before = [ "hydra-queue-runner.service" ];
          serviceConfig =
          {
            User = "root";
            Group = "root";
            ExecStart = let
                config = pkgs.writeTextFile
                {
                  name = "s3-proxy-config";
                  text =
                  ''
                    s3proxy.authorization=none
                    s3proxy.endpoint=http://[::]:80
                    jclouds.provider=filesystem
                    jclouds.filesystem.basedir=/tmp/s3proxy
                  '';
                };
              in
                [ "${pkgs.s3proxy}/bin/s3proxy --properties ${config}" ];
          };
        };
        services =
        {
          hydra =
          {
            enable = true;
            hydraURL = "http://hydra.example.com";
            useSubstitutes = true;
            notificationSender = "root@example.com";
            extraConfig =
            ''
              store_uri = s3://bucket?endpoint=hydra.example.com&scheme=http&region=default
              server_store_uri = http://hydra.example.com/bucket?local-nar-cache=/var/cache/hydra/nar-cache
              binary_cache_public_uri = http://hydra.example.com/bucket
              max_concurrent_evals = 1
            '';
          };
        };
        networking =
        {
          extraHosts = "::1 hydra.example.com";
          firewall.enable = false;
        };
      };
    };

    # doesn't like the polling_condition?
    skipTypeCheck = true;
    testScript =
      let
        curl = params: builtins.toString (pkgs.writeShellScript "curl-hydra" "${pkgs.curl}/bin/curl --fail-with-body -o /dev/stdout -Ls -H 'accept: application/json' ${lib.escapeShellArgs params}");
        curl_with_path = { path, params ? [] }: curl ([ "http://hydra.example.com:3000${path}" ] ++ params);
        curl_get = path: curl_with_path { inherit path; };
        curl_post_json = { path, params ? [] }: json: curl_with_path { inherit path; params = params ++ [ "-H" "content-type: application/json" "--referer" "http://hydra.example.com:3000/" "-d" "@${pkgs.writeTextFile { name = "curl-hydra-json"; text = builtins.toJSON json; }}" ]; };
      in
        ''
          start_all()
          host.wait_for_unit("default.target")

          # this is the bug we're looking for, if this one succeeds we've hit the bug
          @polling_condition(seconds_interval=30)
          def no_bad_access_yet():
            host.fail("journalctl --since \"1min ago\" -u hydra-queue-runner.service | grep -qF \"bad optional\"")

          with no_bad_access_yet:
            # wait for hydra
            host.wait_until_succeeds("${curl_get "/"}")

            # setup repo and hydra user
            host.succeed("ln -snf /etc/old-repo /etc/repo")
            host.succeed("hydra-create-user --type hydra root --role admin --password root")

            # setup s3
            host.succeed("${pkgs.awscli}/bin/aws --no-sign-request --endpoint-url http://hydra.example.com s3api create-bucket --acl public-read --bucket bucket")
            host.succeed("${(config: "${pkgs.awscli}/bin/aws --no-sign-request --endpoint-url http://hydra.example.com s3 cp ${config} s3://bucket/nix-cache-info") (pkgs.writeTextFile
            {
              name = "nix-cache-info";
              text =
              ''
                StoreDir: /nix/store
                WantMassQuery: 0
                Priority: 35
              '';
            })}")

            # login and create project
            host.succeed("${curl_with_path { path = "/login"; params = [ "-c" "/tmp/cookie-jar" "--referer" "http://hydra.example.com:3000/" "-F" "username=root" "-F" "password=root" ]; }}")
            host.succeed("${curl_post_json { path = "/project/.new"; params = [ "-b" "/tmp/cookie-jar" "-X" "PUT" ]; }
            {
              declarative =
              {
                file = "jobspec.json";
                type = "path";
                value = "/etc/repo";
              };
              description = "test project";
              displayname = "test project";
              name = "test";
              enabled = "on";
              visible = "on";
              owner = "root";
            }}")

            # wait for the project and the first jobset to show up and pass
            host.wait_until_succeeds("${curl_get "/project/test"}")
            host.wait_until_succeeds("${curl_get "/jobset/test/.jobsets"}")
            host.wait_until_succeeds("${curl_get "/job/test/.jobsets/jobsets/latest-finished"}")
            # wait for the two builds to pass
            host.wait_until_succeeds("${curl_get "/job/test/base-a/aggregate/latest-finished"}")
            host.wait_until_succeeds("${curl_get "/job/test/base-b/aggregate/latest-finished"}")
            # wait for the diff to pass (this takes forever)
            host.wait_until_succeeds("${curl_get "/job/test/diff/nix-diff-html/latest-finished"}", timeout=36000)
            # switch the repo
            host.succeed("ln -snf /etc/new-repo /etc/repo")
            # block until the second diff has finished building
            # if we hit the bug instead (at any point) we'll bail out due to the `with` above
            host.wait_until_succeeds("${curl_get "/evals"} | ${pkgs.jq}/bin/jq -e 'map(select(.eval.jobsetevalinputs.from.dependency? and .eval.jobsetevalinputs.to.dependency? and .nrScheduled == 0)) | length == 2'")
          # check one final time
          host.fail("journalctl -u hydra-queue-runner.service | grep -qF \"bad optional\"")
        '';
  });
}

So, I think I should (soonish :3) give it a try to reproduce (I assume the configs you posted can be used easily? How many drvs do I need to build on my own btw?). Sooner or later, I'd merge this (or write some sensible error reporting that's not "bad optional access"), e.g. if I don't find anything. But for me there are too many open questions right now, sorry.

I mean.… the repository can just be thrown in there, and it will run (if you have the system features enabled), but it will generate a lot of stuff, depending on how many branches are in the repo at the time.
The only thing that probably won't work are the aarch64 builds, which block the aggregate node builds, but that's basically just rm -rf config/host/{nixos-aarch64,dart}.*.

Generally it will build every package in the overlay which *double checks* at time of writing includes a patched version of Ceph which I guarantee you do not want to build just for the heck of it, then there's Lix, nvidia drivers.…
On second thought, you probably don't want to build that.
I'm not sure whether the build failure will still happen if data is pulled from a binary cache (I think it would since it's about the hydra buildoutputs for which it would fetch the nar and then bail on trying to process it.? I think?), but you could add my cache (hydra.cloud.bsocat.net-1:/VSAhsJt0cwx+GHJ43TojtFYkHKiYRnbbnrbv/kRBcY= @ https://nc.benary.org) which should contain all the things at all times (including the aarch64 builds, so if that works it'd just eat a few gigs of diskspace and hopefully still fail to build the diffs).

After spending this much time trying to reproduce this using a smaller setup I'm tempted to deter you from trying yourself and instead let the issue sit until I have the opportunity to setup something myself.
I could also offer you a box on my infrastructure which I'd throw your keys on if compute resources are a concern (only "limitation" there would be memory, i.e. max_concurrent_evals = 1, evaluator_max_memory_size = 4096, evaluator_workers = 1 if you wanna be on the safe side.

@ma27 wrote in https://git.lix.systems/lix-project/hydra/pulls/33#issuecomment-10184: > Just checked, this can be done to expose a directory containing e.g. files for a manual such as mdbook. > The `serveFile` function in the Build controller explicitly checks for that and serves an `index.html` in that case. Good to know (I kinda assumed one'd point at the *index.html* then), thanks for clearing that up ^^ > So the thing is, we don't have a disagreement on how it _should_ be done _at all_! I fully agree that the dereferencing of the optional and putting it back into another optional is pretty bad. > The problem is essentially that I'm in "this is legacy code I don't fully understand and thus I'm kinda defensive"-mode and right now I don't have the time to verify that what we're doing is sensible (sensible means in this case that there's no code relying on the fact `fileType == tRegular` implies that the optional is set). So in other words, I'd like to avoid shifting the problem to just another place. Does that make a little more sense? Which is 100% okay, I'm not judging. Usually it's me playing devil's advocate, so it's an interesting change of pace. > That being said, a different question: did this start to happen by any chance somewhere around the middle of March? I'm a little afraid that the extensive changes to the NAR reader (after Lix store became fully async) are related. This is like the other thing: perhaps there's another bug hiding below the surface. I only built the whole thing like two weeks ago so I can't tell. The build ran fine once or twice (presumably because it had cached artifacts from my testing back then, at least that's my working theory) and when I thought the whole construct worked and did its job the build started bailing out in the queue-runner. > So, I think I should (soonish :3) give it a try to reproduce (I assume the configs you posted can be used easily? How many drvs do I need to build on my own btw?). Sooner or later, I'd merge this (or write some sensible error reporting that's not "bad optional access"), e.g. if I don't find anything. But for me there are too many open questions right now, sorry. Okay, so.… I've taken a stab at getting this reproduced, and I have a minimal reproducer, except it **doesn't reproduce**. That of course means there's some additional factor that this test does *not* cover, so I guess my next move would be to run the full producer on a separate, potentially by just pulling the original repo into this but that's for another time. <details><summary>full reproducer</summary> I tried with just the setup I described before, which I've assembled (badly) using *pkgs.linkFarm* in a directory in */etc* since hydra is unable to unpack a tarball, it can download it but will never unpack, so I couldn't just provide the repo as a tarball via nginx, and getting a git repo or such in there didn't seem worth the effort, and I wanted it to be a self-contained expression. I then failed to reproduce the issue with the setup, and noticed that the output diverged in terms of fetching NARs, which made me setup *s3proxy* to get a bucket in the machine. It will take a *long* time to copy the entire derivation tree in there, since it depends on the `drvPath` of the aggregates, which includes a lot of stuff like `/nix/store/gp6gh2jn0x7y7shdvvwxlza4r5bmh211-stdenv-linux.drv` (which has a lot of dependencies), despite not even using *stdenv*/*runCommand* (however that's still pulling *curl*, *bash*, etc.) I've set the memory limit to 8GiB because I ran into OOM issues before and I didn't want it to fail after like an hour of running. Beware of the temporary disk image residing in */tmp* which may be *tmpfs*. The test is built such that the machine boots up, waits for hydra to be available, sets up the bucket, then creates the declarative project, waits for all of the jobs to finish (with an incredibly high timeout for the one that'll copy the .drv files), then update the repo to trigger a second build, and then wait until that one finishes. If at any point "bad optional" is triggered the test will fail early. This means that the test will fail early if any of the builds fail, or if the bug is triggered, and it will succeed if all builds pass without the bug triggering. Due to The Horrors™ contained therein, this Nix Expression is best viewed with Netscape Navigator 4.0 or below on a Pentium 3±1 emulated in Javascript on an Apple IIGS at a screen resolution of 1024x1. Please enable your ad blockers, disable high-heat drying, and remove your device from Airplane Mode and set it to Boat Mode. For security reasons, please leave caps lock on while evaluating. ```nix # Or run this one: # nom shell --print-build-logs -v --show-trace -f THIS_FILE test.driver --command env "QEMU_NET_OPTS=hostfwd=tcp:127.0.0.1:3000-:3000,hostfwd=tcp:127.0.0.1:8000-:80" nixos-test-driver # you should be able to access hydra on http://localhost:3000 and the bucket on http://localhost:8000/bucket # probably needs the flake or nix-command feature for fetchTree { system ? builtins.currentSystem , nixpkgs ? builtins.fetchTree { type = "github"; owner = "NixOS"; repo = "nixpkgs"; rev = "a880f49904d68b5e53338d1e8c7bf80f59903928"; narHash = "sha256-o4FjFOUmjSRMK7dn0TFdAT0RRWUWD+WsspPHa+qEQT8="; } , hydra ? builtins.fetchTree { type = "git"; url = "https://git.lix.systems/lix-project/hydra"; ref = "main"; rev = "e9c63f286a3ecc2f2694c801518f992a8bd34bbc"; narHash = "sha256-Hu1tybr6RaYavmmZneBAaMKf0URYKZDkHsJhr70JFzs="; } , lix ? builtins.fetchTree { type = "git"; url = "https://git.lix.systems/lix-project/lix"; ref = "main"; rev = "fb0ef6ca6bfea368e594e9ae2714858030f66bc6"; narHash = "sha256-eDym7mHXW0AVuqhp5BZ7uBk5b5NcM4slnSXFuoAUadc="; } , lix-module ? builtins.fetchTree { type = "git"; url = "https://git.lix.systems/lix-project/nixos-module"; ref = "main"; rev = "fa69ae26cc32dda178117b46487c2165c0e08316"; narHash = "sha256-MB/b/xcDKqaVBxJIIxwb81r8ZiGLeKEcqokATRRroo8="; } }: { test = (import nixpkgs { inherit system; }).nixosTest ({ lib, pkgs, ... }: { name = "hydra-bad-optional"; nodes.host = { pkgs, config, ... }: { imports = [ { config.nixpkgs.overlays = [ (import "${lix-module}/overlay.nix" { inherit lix; }) (final: _prev: { hydra = final.callPackage "${hydra}/package.nix" { inherit (lib) fileset; rawSrc = hydra; stdenv = final.clangStdenv; }; }) ]; } ]; config = let generateRepo = text: pkgs.linkFarm "hydra-repo" { "jobspec.json" = (json: pkgs.writeTextFile { name = "hydra-json-jobspec"; text = builtins.toJSON json; }) { enabled = 3; hidden = false; description = "test project"; nixexprinput = "src"; nixexprpath = "generate-jobset.nix"; checkinterval = 15; schedulingshares = 1000; enableemail = false; emailoverride = ""; keepnr = 3; type = 0; inputs = { src = { value = "/etc/repo"; type = "path"; emailresponsible = false; }; }; }; "simple-job.nix" = (text: pkgs.writeTextFile { name = "hydra-nix-file"; inherit text; }) '' { nixpkgs, output, system ? builtins.currentSystem, ... }: let inherit (import nixpkgs { inherit system; }) lib pkgs; in { # version with a lot of .drvs being generated #job = pkgs.runCommandLocal "simple-job" {} "echo simple-job: ''${output} > $out"; # version with virtually no dependencies (should make the diff much faster) job = builtins.derivation { inherit system; name = "simple-job"; builder = "/bin/sh"; args = [ "-c" "echo simple-job: ''${output} > $out" ]; }; aggregate = builtins.derivation { inherit system; name = "aggregate"; constituents = [ "job" ]; builder = "/bin/sh"; args = [ "-c" ": > $out" ]; _hydraAggregate = true; }; } ''; "diff.nix" = (text: pkgs.writeTextFile { name = "hydra-nix-file"; inherit text; }) '' { nixpkgs, from, to, system ? builtins.currentSystem, ... }: let inherit (import nixpkgs { inherit system; }) lib pkgs; in lib.fix (self: { nix-diff-txt = pkgs.runCommandLocal "nix-diff-txt" { from = from.drvPath; to = to.drvPath; buildInputs = with pkgs; [ nix nix-diff ]; requiredSystemFeatures = [ "recursive-nix" ]; } ${"''"} set -eo pipefail mkdir -p $out/nix-support/ nix-diff --color always $from $to | tee $out/output.txt printf "%s %s %s\\n" doc txt ''${builtins.placeholder "out"}/output.txt > $out/nix-support/hydra-build-products ${"''"}; nix-diff-html = pkgs.runCommand "nix-diff-html" { buildInputs = with pkgs; [ ansi2html ]; } ${"''"} set -eo pipefail mkdir -p $out/nix-support/ ansi2html -s osx-basic -m < ''${self.nix-diff-txt}/output.txt > $out/output.html printf "%s %s %s\\n" doc html ''${builtins.placeholder "out"}/output.html > $out/nix-support/hydra-build-products ${"''"}; }) ''; "generate-jobset.nix" = (text: pkgs.writeTextFile { name = "hydra-nix-file"; inherit text; }) '' { src, ... }: let job = { extraInputs ? {}, exprinput ? "src", exprpath }: { enabled = 3; hidden = false; description = "example job"; checkinterval = 15; schedulingshares = 100; enableemail = false; emailoverride = ""; keepnr = 0; type = 0; nixexprinput = exprinput; nixexprpath = exprpath; inputs = { src = { type = "path"; value = src; }; nixpkgs = { type = "path"; value = "${pkgs.path}"; }; } // extraInputs; }; spec = { base-a = job { exprpath = "simple-job.nix"; extraInputs = { output = { type = "string"; value = "base text"; }; }; }; base-b = job { exprpath = "simple-job.nix"; extraInputs = { output = { type = "string"; value = "${text}"; }; }; }; diff = job { exprpath = "diff.nix"; extraInputs = { from = { type = "build"; value = "test:base-a:aggregate"; }; to = { type = "build"; value = "test:base-b:aggregate"; }; }; }; }; in { jobsets = builtins.derivation { system = builtins.currentSystem; name = "test-jobspec"; builder = "/bin/sh"; args = [ "-c" ${"''"} read -r jobspec <<"END" ''${builtins.toJSON spec} END printf "%s\\n" "$jobspec" > $out printf "%s\\n" "$jobspec" ${"''"} ]; }; } ''; }; in { virtualisation.memorySize = 8192; nix.settings.experimental-features = [ "nix-command" "flakes" "no-url-literals" "pipe-operator" "auto-allocate-uids" "cgroups" "recursive-nix" ]; environment.etc = { old-repo.source = generateRepo "early text"; new-repo.source = generateRepo "later text"; }; systemd.tmpfiles.settings."50-hydra" = { "/var/cache/hydra".d = { mode = "0755"; user = "hydra"; group = "hydra"; }; "/var/cache/hydra/nar-cache".d = { mode = "0775"; user = "hydra"; group = "hydra"; age = "1d"; }; }; systemd.services.s3proxy = { description = "s3proxy for temporary hydra bucket"; wantedBy = [ "hydra-queue-runner.service" ]; before = [ "hydra-queue-runner.service" ]; serviceConfig = { User = "root"; Group = "root"; ExecStart = let config = pkgs.writeTextFile { name = "s3-proxy-config"; text = '' s3proxy.authorization=none s3proxy.endpoint=http://[::]:80 jclouds.provider=filesystem jclouds.filesystem.basedir=/tmp/s3proxy ''; }; in [ "${pkgs.s3proxy}/bin/s3proxy --properties ${config}" ]; }; }; services = { hydra = { enable = true; hydraURL = "http://hydra.example.com"; useSubstitutes = true; notificationSender = "root@example.com"; extraConfig = '' store_uri = s3://bucket?endpoint=hydra.example.com&scheme=http&region=default server_store_uri = http://hydra.example.com/bucket?local-nar-cache=/var/cache/hydra/nar-cache binary_cache_public_uri = http://hydra.example.com/bucket max_concurrent_evals = 1 ''; }; }; networking = { extraHosts = "::1 hydra.example.com"; firewall.enable = false; }; }; }; # doesn't like the polling_condition? skipTypeCheck = true; testScript = let curl = params: builtins.toString (pkgs.writeShellScript "curl-hydra" "${pkgs.curl}/bin/curl --fail-with-body -o /dev/stdout -Ls -H 'accept: application/json' ${lib.escapeShellArgs params}"); curl_with_path = { path, params ? [] }: curl ([ "http://hydra.example.com:3000${path}" ] ++ params); curl_get = path: curl_with_path { inherit path; }; curl_post_json = { path, params ? [] }: json: curl_with_path { inherit path; params = params ++ [ "-H" "content-type: application/json" "--referer" "http://hydra.example.com:3000/" "-d" "@${pkgs.writeTextFile { name = "curl-hydra-json"; text = builtins.toJSON json; }}" ]; }; in '' start_all() host.wait_for_unit("default.target") # this is the bug we're looking for, if this one succeeds we've hit the bug @polling_condition(seconds_interval=30) def no_bad_access_yet(): host.fail("journalctl --since \"1min ago\" -u hydra-queue-runner.service | grep -qF \"bad optional\"") with no_bad_access_yet: # wait for hydra host.wait_until_succeeds("${curl_get "/"}") # setup repo and hydra user host.succeed("ln -snf /etc/old-repo /etc/repo") host.succeed("hydra-create-user --type hydra root --role admin --password root") # setup s3 host.succeed("${pkgs.awscli}/bin/aws --no-sign-request --endpoint-url http://hydra.example.com s3api create-bucket --acl public-read --bucket bucket") host.succeed("${(config: "${pkgs.awscli}/bin/aws --no-sign-request --endpoint-url http://hydra.example.com s3 cp ${config} s3://bucket/nix-cache-info") (pkgs.writeTextFile { name = "nix-cache-info"; text = '' StoreDir: /nix/store WantMassQuery: 0 Priority: 35 ''; })}") # login and create project host.succeed("${curl_with_path { path = "/login"; params = [ "-c" "/tmp/cookie-jar" "--referer" "http://hydra.example.com:3000/" "-F" "username=root" "-F" "password=root" ]; }}") host.succeed("${curl_post_json { path = "/project/.new"; params = [ "-b" "/tmp/cookie-jar" "-X" "PUT" ]; } { declarative = { file = "jobspec.json"; type = "path"; value = "/etc/repo"; }; description = "test project"; displayname = "test project"; name = "test"; enabled = "on"; visible = "on"; owner = "root"; }}") # wait for the project and the first jobset to show up and pass host.wait_until_succeeds("${curl_get "/project/test"}") host.wait_until_succeeds("${curl_get "/jobset/test/.jobsets"}") host.wait_until_succeeds("${curl_get "/job/test/.jobsets/jobsets/latest-finished"}") # wait for the two builds to pass host.wait_until_succeeds("${curl_get "/job/test/base-a/aggregate/latest-finished"}") host.wait_until_succeeds("${curl_get "/job/test/base-b/aggregate/latest-finished"}") # wait for the diff to pass (this takes forever) host.wait_until_succeeds("${curl_get "/job/test/diff/nix-diff-html/latest-finished"}", timeout=36000) # switch the repo host.succeed("ln -snf /etc/new-repo /etc/repo") # block until the second diff has finished building # if we hit the bug instead (at any point) we'll bail out due to the `with` above host.wait_until_succeeds("${curl_get "/evals"} | ${pkgs.jq}/bin/jq -e 'map(select(.eval.jobsetevalinputs.from.dependency? and .eval.jobsetevalinputs.to.dependency? and .nrScheduled == 0)) | length == 2'") # check one final time host.fail("journalctl -u hydra-queue-runner.service | grep -qF \"bad optional\"") ''; }); } ``` </details> > So, I think I should (soonish :3) give it a try to reproduce (I assume the configs you posted can be used easily? How many drvs do I need to build on my own btw?). Sooner or later, I'd merge this (or write some sensible error reporting that's not "bad optional access"), e.g. if I don't find anything. But for me there are too many open questions right now, sorry. I mean.… the repository *can* just be thrown in there, and it will run (if you have the system features enabled), but it will generate a lot of stuff, depending on how many branches are in the repo at the time. The only thing that probably won't work are the aarch64 builds, which block the aggregate node builds, but that's basically just `rm -rf config/host/{nixos-aarch64,dart}.*`. Generally it will build every package in the overlay which *\*double checks\** at time of writing includes a patched version of Ceph which I guarantee you do *not* want to build just for the heck of it, then there's Lix, nvidia drivers.… On second thought, you probably don't want to build that. I'm not sure whether the build failure will still happen if data is pulled from a binary cache (I think it would since it's about the hydra buildoutputs for which it would fetch the nar and then bail on trying to process it.? I think?), but you could add my cache (`hydra.cloud.bsocat.net-1:/VSAhsJt0cwx+GHJ43TojtFYkHKiYRnbbnrbv/kRBcY=` @ https://nc.benary.org) which should contain all the things at all times (including the aarch64 builds, so if that works it'd just eat a few gigs of diskspace and *hopefully* still fail to build the diffs). After spending this much time trying to reproduce this using a smaller setup I'm tempted to deter you from trying yourself and instead let the issue sit until I have the opportunity to setup something myself. I could also offer you a box on my infrastructure which I'd throw your keys on if compute resources are a concern (only "limitation" there would be memory, i.e. `max_concurrent_evals = 1`, `evaluator_max_memory_size = 4096`, `evaluator_workers = 1` if you wanna be on the safe side.

ma27 commented

2025-04-19 10:12:30 +00:00

Member

Thanks a lot.
I think the code you shared is a good starting point. I don't expect compute to be an issue, I'll get this to run somewhere soon and will report back when I know more.

Thanks a lot. I think the code you shared is a good starting point. I don't expect compute to be an issue, I'll get this to run somewhere soon and will report back when I know more.

👍 1

benaryorg commented

2025-04-26 16:20:47 +00:00

Author

Contributor

As part of lix-project/lix#767 I have now removed the recursive-nix components from my builds (which took a while). In that context I have removed the patch contained in this PR to check whether I can still reproduce the issue.
If I cannot reproduce it (which is what it looks like as of now), then the reason why some of the files did not have their hashes was probably the recursive-nix code which is now gone.
It doesn't mean that this PR isn't still valid in one way or another, but at least it means that there should be no further way to hit the faulty code-path for now.

However given that my issue was always slightly delayed in one way or another I'll keep it running like this for a while longer and report back at a later date.

As part of lix-project/lix#767 I have now removed the *recursive-nix* components from my builds (which took a while). In that context I have removed the patch contained in this PR to check whether I can still reproduce the issue. If I *cannot* reproduce it (which is what it looks like as of now), then the reason why some of the files did not have their hashes was *probably* the *recursive-nix* code which is now gone. It doesn't mean that this PR isn't still valid in one way or another, but at least it means that there should be no further way to hit the faulty code-path for now. However given that my issue was always slightly delayed in one way or another I'll keep it running like this for a while longer and report back at a later date.

benaryorg commented

2025-04-26 22:14:37 +00:00

Author

Contributor

Welp, that didn't take long.

2025-04-26T17:09:42+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access
2025-04-26T17:10:51+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access
2025-04-26T17:14:03+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access
2025-04-26T17:23:16+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access
2025-04-26T17:50:27+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access
2025-04-26T19:11:36+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access

At least I can confirm that the issue still exists and is vaguely reproducible even without recursive-nix (specifically infra @ 7e422bee39 with the diff between two builds of evaluations using the exact same rev).
I'll reapply the patch on my end for now *sigh*

Welp, that didn't take long. [![screenshot of hydra webui reporting a nix-diff job repeatedly aborting](/attachments/381e1a3c-c246-4105-9fa4-023d5d0a636e)](https://hydra.cloud.bsocat.net/build/302065#tabs-buildsteps) ```text 2025-04-26T17:09:42+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access 2025-04-26T17:10:51+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access 2025-04-26T17:14:03+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access 2025-04-26T17:23:16+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access 2025-04-26T17:50:27+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access 2025-04-26T19:11:36+00:00 hydra hydra-queue-runner[2157319]: uncaught exception building ‘/nix/store/8zy37gns0sl50k6j5h2miywd4za7dwgl-infra-nix-diff.drv’ on ‘(no machine)’: bad optional access ``` At least I can confirm that the issue still exists and is *vaguely* reproducible even without *recursive-nix* (specifically [infra @ *7e422bee39*](https://git.shell.bsocat.net/infra/commit/?id=7e422bee393015f0f2afaec42529a4299218fe71) with the diff between two builds of evaluations using the exact same rev). I'll reapply the patch on my end for now *\*sigh\**

image.png

72 KiB

ma27 commented

2025-06-11 15:26:15 +00:00

Member

Just gave your reproducer a try and it seems as if the queue runner doesn't start up properly?

vm-test-run-hydra-bad-optional> host # [   22.699943] hydra-queue-runner[1076]: error: AWS error fetching 'nix-cache-info': The specified bucket does not exist
vm-test-run-hydra-bad-optional> host # [   22.721601] systemd[1]: hydra-queue-runner.service: Main process exited, code=exited, status=1/FAILURE

Just gave your reproducer a try and it seems as if the queue runner doesn't start up properly? ``` vm-test-run-hydra-bad-optional> host # [ 22.699943] hydra-queue-runner[1076]: error: AWS error fetching 'nix-cache-info': The specified bucket does not exist vm-test-run-hydra-bad-optional> host # [ 22.721601] systemd[1]: hydra-queue-runner.service: Main process exited, code=exited, status=1/FAILURE ```

This pull request can be merged automatically.

This branch is out-of-date with the base branch

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u optional-access:benaryorg-optional-access

git switch benaryorg-optional-access

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch main

git merge --no-ff benaryorg-optional-access

git switch benaryorg-optional-access

git rebase main

git switch main

git merge --ff-only benaryorg-optional-access

git switch benaryorg-optional-access

git rebase main

git switch main

git merge --no-ff benaryorg-optional-access

git switch main

git merge --squash benaryorg-optional-access

git switch main

git merge --ff-only benaryorg-optional-access

git switch main

git merge benaryorg-optional-access

git push origin main

Rows
Columns

hydra-queue-runner: fix bad optional access #33

Checkout

Merge