hydra-queue-runner: fix bad optional access #33
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "benaryorg/hydra:optional-access"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
BuildProduct uses optionals, and this accesses optionals so why unwrap them if they can be empty.
This seems to happen when using a combination of recursive-nix with hydra's build inputs to pull in the .drv of an aggregate build.
I have two jobs which run diffs from one aggregate to another (diffing main vs. staging/next/head); one for nix-diff and one for nvd, but this only happens with nix-diff, and not exactly consistently (presumably because sometimes it has references in the database already) but often enough that I think the issue lies there specifically.
The log outputs without this patch:
After applying the patch:
In both cases the store path that hydra is trying to build is already present on the machine, hydra seemingly fails to add it to the database though.
Note that this has only been tested with lix-flavoured hydra, YMMV running with CppNix.
Either way the change seems sound; assigning an optional to an optional rather than unpacking the optional without guarding against failure just to put it back in an optional.
(the above is the commit message, below are details that I didn't want to include there)
The failing build had several "Aborted" steps on my hydra instance, with no log output at all.
The inputs are (here because I accidentally wiped my S3 bucket three times this week while tinkering with Ceph, so I'm not sure it'll be available forever, the git repo most likely will tho):
{ src = "https://git.shell.bsocat.net/infra?rev=dac7215ace50e78d40d0ba72cbf57a5ca62a2f4b"; buildTarget = "hydraJobs"; }
{ src = "https://git.shell.bsocat.net/infra?rev=d37fd63e618bce140c5ef408a924d4733456e8bc"; buildTarget = "hydraJobs"; }
https://git.shell.bsocat.net/infra?rev=d37fd63e618bce140c5ef408a924d4733456e8bc
https://git.shell.bsocat.net/nixpkgs?rev=f0946fa5f1fb876a9dc2e1850d9d3a4e3f914092
<src/hydra/diff.nix>
Given that this uses build inputs – of aggregate builds no less – please don't ask me how to reproduce the builds outside of hydra, or how hydra manages to get unset hashes.
If memory serves the filesize was actually set, just the hash was missing, however both are optionals anyway.
hydra-queue-runner: fix bad optional accessto WIP: hydra-queue-runner: fix bad optional accessUgh, nevermind the WIP there, I just checked the files and the ones I happened to check were all empty, and I though that they were at least lacking the header for the different derivations which are diffed even if the content is the same.
Yeah.… no. That diff was running against the exact same derivation twice, so that was legit.
I gotta go catch some sleep before looking at this PR or the code again >.<
WIP: hydra-queue-runner: fix bad optional accessto hydra-queue-runner: fix bad optional accessI'm afraid I don't follow. What is meant here specifically?
If this only depends on the DB state and not on timing, do you think it would be possible to write a small regression test for this? The Hydra testsuite is the thing that prevented me from doing stupid things quite a few times already, so having more edge-cases in there is something I'd be very interested in.
Fair enough, it is a monstrosity of a build ;-)
Basically the output of my repo yields several images (kexec, iso, etc.), some docs (nixpkgs, nixos, but also custom for ceph clusters), and the actual toplevel outputs for my machines.
Since I want to get a diff of the machines between main branch and other branches ahead of main (staging, but also one that pulls in latest nixpkgs, and one that pulls in latest versions of all dependency repos, such as lix), I need to somehow run nix-diff in hydra.
Since nix-diff runs nix in the background for querying the store this requires recursive-nix.
However recursive-nix is still "pure" in that one has to provide it with all inputs that it should be able to access.
This works fine with the "eval" input type of hydra for nvd out of the box, however nix-diff has to diff the .drv files themselves, the derivations and not the output, and "eval" only yields the output.
So to depend on the derivations of a different build I have to use the "build" input specifically ("sysbuild" does not work) since it sets the drvPath attribute which in turn pulls in the entire derivation tree as a dependency when depended on.
This type of input however needs to be specified per output, meaning I don't just depend on "build of main branch", I have to depend on "output of build for my laptop of main branch".
I cannot sensibly list all the outputs as inputs since those may change between main and the other branch and I do want to see a missing or added node in a diff.
Which means I have to use an aggregate build with all nodes as constituents, because an aggregate build has magic which makes the output meaningless, but still pulls in all the constituents as dependencies.
Since the aggregate job for my nodes has all the node derivations as dependencies for its own derivation (but not output) I can pull in the aggregate job of each branch as the "build" input in a hydra job, which then uses recursive-nix to nix-diff the two aggregates.
This is the monstrosity that I have built, and it works fine for nvd, however the nix-diff builds often just abort and are orphaned (don't ask me how that happens) because one fetch from the binary cache for the output that is already present on the hydra instance fails, presumably because hydra does not have the output in its database, but has built it somewhere along the way already; basically there are five builds, nvd and nix-diff as text output with ansi colours each, then each gets a pass of ansi2html to produce a total of four builds, and then a fifth one which simply symlinks the two html ones so I get a single build with two outputs that I can bookmark in my browser.
And when the nix-diff txt version fails in some sort of fetch phase for the NARs it is treated as a transient error, with regular retries, despite the build for the html version having already finished the build, but due to depending on the prior build never passes, and always remains "building".
So I get passed builds for nvd-txt and nvd-html, nix-diff-txt continues to have occasional retried build steps that abort, and nix-diff-html just sits there waiting for nix-diff-txt despite the output already existing, same for the summarized diff.
Honestly I have no idea how to even vaguely reproduce this in terms of database state.
The thing is, the build gets picked up as usual by the queue-runner, but since all five of the builds contain a "hydra-build-products" file, hydra will try to parse parts of the content, and for some reason that I don't understand the data it has at that point does not have a hash (the exact line that the PR changes).
I don't know how this can happen, I don't know why it needs the hash, I just throw gdb at it and stepped until I figured out where the "bad optional access" happened, looked at the struct it was assigned to and noticed that was optional too, removed the
.value()
and tested it and now my builds work.However I do think there is state that can be built to reproduce this on a smaller scale, I just don't know how to possibly boil it down (or at least don't have the spoons to build a full minimal example of two constituent builds depending on one another etc. etc.).
Please note that hydra has trouble deleting a project if its jobsets have any "build" or "sysbuild" type dependencies due to foreign key constraints, which is why I wouldn't want to test this on a production hydra instance either (I mean, not much trouble to set up a local test instance, but the work required adds up and.… well.… spoons).
So if anyone wants to take a shot at it, go ahead, I'm available for questions, but for the time being I can only answer the question of whether "I think it would be possible" and that's "very likely".
Thanks a lot for the explanation and the context.
The reason why I'm a little hesitant is that I have a theory why fileSize/sha256hash are optional: it's probably a poor way to denote that a build product is a directory (I'm pretty sure that I've seen that already). But keep in mind, this is my interpretation of the code, feel free to disagree. The
getBuildOutput
function also seem so to confirm that, at least it looks like it's also assuming that the optional ahs a value for filetype ==regular. Your patch would essentially break that assumption.To me, this sounds like something you hit regularly (or at least often)? If that's the case, I'd probably throw another Hydra into a VM and see if I can trigger this to understand what's going on.
In case you don't know already, Hydra actually has a quite useful local dev env: for me, it's essentially
nix develop
,just build
and thenforeman start
and I have a running Hydra on my laptop.@ma27 wrote in #33 (comment):
The code in question is specifically concerned about the hydra build products (i.e. that funny lil file saying
doc html /path/to/thing
which shows up in the hydra UI).Since those are build artifacts (which you can click on to view/download) I don't think that pointing to a directory there is meaningful in any way (semantically) although technically possible (syntactically) as the code just skips the surrounding conditional and therefore the filetype == regular entirely.
And in fact those files on my end are .txt and .html files respectively, just magically missing a hash.
The passing nvd and failing nix-diff outputs are actual files, the combined build is using symlinks but the preceding builds fail already before it reaches that last one.
(as far as I understand those explicit build products, they point at "the file which should be shown in the browser" even if the surrounding directory is required to view, as is the case with CSS/JS in addition to HTML, and they are only relevant to hydra's UI, not the actual build)
I mean.… the datatype itself is an optional so I would expect all code using that optional to be aware that it is, in fact, optional, and guarding against absence or handling the optional (yeah, I know, bold assumption).
So with that in mind, if the data structure which is being filled in the lines this PR touches denotes an optional, I don't see how passing on the optional would break assumptions there since all other processing code must work under the assumption that the value is optional, so if this change triggers any unwanted behaviour (which as of today it does not on my patched instance) that would be a bug in the code using the optional, rather than the code setting it IMHO (I mean.… type safety and all that).
Removing the patch will trigger the issue semi-consistently (read: most of the very specific builds hit the error, just not always apparently).
As noted below, no minimal reproducer for now, but if you don't mind running a large-ish eval and builds for a few things including some isos and tarballs, config is below.
If you fork the repo (since it depends on main branches) to a non-github thing (since it needs the dumb-http transport for retrieving refs, github turned that off), edit some references/links in the hydra nix files (hydra subdirectory), you could technically boil it down to a single machine without isos or tarballs, given that I could push a repo to this forgejo instance to fulfill that criteria I might do that.
screenshot of project config
The rest of this reply is highly opinionated.
So given that those are actual files as noted above, and the source and destination data structures allows optionals I don't see this as something that should be unexpected for any related code, and if it is, then the bug is in the code that assumes the optional to have a value without checking, not in the code setting the permissible optional value.
And at the very least if it were intended behaviour to err out when no value is present, it shouldn't be a red bold "bad optional access" with no indication of where or why (read: if it were intentional it should be surrounded by guards that handle the error gracefully, whether fatal or not, by printing something meaningful instead).
This is the C++ equivalent of having two Rust structs with
Option<Value>
fields and assignings1.x = Some(s2.x.unwrap())
which is an obvious anti-pattern, and in lieu of any indication that it should be impossible I would opt for fixing the reproducible¹ error of a build failing when it could succeed otherwise.IOW: the patch does not seem to break anything, it does however fix something, and if something were to break then the fix should be implemented elsewhere in the code.
¹: I have stuff on my list that has higher priority at the moment, so I can't provide a minimal reproducer, and I'm fine with this issue going stale (I do have this patch in my overlay), but replying to issues has higher prio than working on things for me, so you'll have to rely on my word of mouth claim that it is reproducible for the time being, sorry about that
Just checked, this can be done to expose a directory containing e.g. files for a manual such as mdbook.
The
serveFile
function in the Build controller explicitly checks for that and serves anindex.html
in that case.So the thing is, we don't have a disagreement on how it should be done at all! I fully agree that the dereferencing of the optional and putting it back into another optional is pretty bad.
The problem is essentially that I'm in "this is legacy code I don't fully understand and thus I'm kinda defensive"-mode and right now I don't have the time to verify that what we're doing is sensible (sensible means in this case that there's no code relying on the fact
fileType == tRegular
implies that the optional is set). So in other words, I'd like to avoid shifting the problem to just another place. Does that make a little more sense?That being said, a different question: did this start to happen by any chance somewhere around the middle of March? I'm a little afraid that the extensive changes to the NAR reader (after Lix store became fully async) are related. This is like the other thing: perhaps there's another bug hiding below the surface.
So, I think I should (soonish :3) give it a try to reproduce (I assume the configs you posted can be used easily? How many drvs do I need to build on my own btw?). Sooner or later, I'd merge this (or write some sensible error reporting that's not "bad optional access"), e.g. if I don't find anything. But for me there are too many open questions right now, sorry.
@ma27 wrote in #33 (comment):
Good to know (I kinda assumed one'd point at the index.html then), thanks for clearing that up ^^
Which is 100% okay, I'm not judging.
Usually it's me playing devil's advocate, so it's an interesting change of pace.
I only built the whole thing like two weeks ago so I can't tell.
The build ran fine once or twice (presumably because it had cached artifacts from my testing back then, at least that's my working theory) and when I thought the whole construct worked and did its job the build started bailing out in the queue-runner.
Okay, so.… I've taken a stab at getting this reproduced, and I have a minimal reproducer, except it doesn't reproduce.
That of course means there's some additional factor that this test does not cover, so I guess my next move would be to run the full producer on a separate, potentially by just pulling the original repo into this but that's for another time.
full reproducer
I tried with just the setup I described before, which I've assembled (badly) using pkgs.linkFarm in a directory in /etc since hydra is unable to unpack a tarball, it can download it but will never unpack, so I couldn't just provide the repo as a tarball via nginx, and getting a git repo or such in there didn't seem worth the effort, and I wanted it to be a self-contained expression.
I then failed to reproduce the issue with the setup, and noticed that the output diverged in terms of fetching NARs, which made me setup s3proxy to get a bucket in the machine.
It will take a long time to copy the entire derivation tree in there, since it depends on the
drvPath
of the aggregates, which includes a lot of stuff like/nix/store/gp6gh2jn0x7y7shdvvwxlza4r5bmh211-stdenv-linux.drv
(which has a lot of dependencies), despite not even using stdenv/runCommand (however that's still pulling curl, bash, etc.)I've set the memory limit to 8GiB because I ran into OOM issues before and I didn't want it to fail after like an hour of running.
Beware of the temporary disk image residing in /tmp which may be tmpfs.
The test is built such that the machine boots up, waits for hydra to be available, sets up the bucket, then creates the declarative project, waits for all of the jobs to finish (with an incredibly high timeout for the one that'll copy the .drv files), then update the repo to trigger a second build, and then wait until that one finishes.
If at any point "bad optional" is triggered the test will fail early.
This means that the test will fail early if any of the builds fail, or if the bug is triggered, and it will succeed if all builds pass without the bug triggering.
Due to The Horrors™ contained therein, this Nix Expression is best viewed with Netscape Navigator 4.0 or below on a Pentium 3±1 emulated in Javascript on an Apple IIGS at a screen resolution of 1024x1. Please enable your ad blockers, disable high-heat drying, and remove your device from Airplane Mode and set it to Boat Mode. For security reasons, please leave caps lock on while evaluating.
I mean.… the repository can just be thrown in there, and it will run (if you have the system features enabled), but it will generate a lot of stuff, depending on how many branches are in the repo at the time.
The only thing that probably won't work are the aarch64 builds, which block the aggregate node builds, but that's basically just
rm -rf config/host/{nixos-aarch64,dart}.*
.Generally it will build every package in the overlay which *double checks* at time of writing includes a patched version of Ceph which I guarantee you do not want to build just for the heck of it, then there's Lix, nvidia drivers.…
On second thought, you probably don't want to build that.
I'm not sure whether the build failure will still happen if data is pulled from a binary cache (I think it would since it's about the hydra buildoutputs for which it would fetch the nar and then bail on trying to process it.? I think?), but you could add my cache (
hydra.cloud.bsocat.net-1:/VSAhsJt0cwx+GHJ43TojtFYkHKiYRnbbnrbv/kRBcY=
@ https://nc.benary.org) which should contain all the things at all times (including the aarch64 builds, so if that works it'd just eat a few gigs of diskspace and hopefully still fail to build the diffs).After spending this much time trying to reproduce this using a smaller setup I'm tempted to deter you from trying yourself and instead let the issue sit until I have the opportunity to setup something myself.
I could also offer you a box on my infrastructure which I'd throw your keys on if compute resources are a concern (only "limitation" there would be memory, i.e.
max_concurrent_evals = 1
,evaluator_max_memory_size = 4096
,evaluator_workers = 1
if you wanna be on the safe side.Thanks a lot.
I think the code you shared is a good starting point. I don't expect compute to be an issue, I'll get this to run somewhere soon and will report back when I know more.
As part of lix-project/lix#767 I have now removed the recursive-nix components from my builds (which took a while). In that context I have removed the patch contained in this PR to check whether I can still reproduce the issue.
If I cannot reproduce it (which is what it looks like as of now), then the reason why some of the files did not have their hashes was probably the recursive-nix code which is now gone.
It doesn't mean that this PR isn't still valid in one way or another, but at least it means that there should be no further way to hit the faulty code-path for now.
However given that my issue was always slightly delayed in one way or another I'll keep it running like this for a while longer and report back at a later date.
Welp, that didn't take long.
At least I can confirm that the issue still exists and is vaguely reproducible even without recursive-nix (specifically infra @ 7e422bee39 with the diff between two builds of evaluations using the exact same rev).
I'll reapply the patch on my end for now *sigh*
Just gave your reproducer a try and it seems as if the queue runner doesn't start up properly?
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.Merge
Merge the changes and update on Forgejo.Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.