run3 just seems to do better handling for what we want to do, and
requires less deep-reaching changes to this plugin to get it to play
nice, as IPC::Run::run would.
This uses the somewhat restrictive umask of 0027 so that people outside
the user or group cannot read the files. This also helps to inhibit
TOCTOU where someone else has a handle to our file before we chmod it
and after we close it.
In a Hydra instance I saw:
possibly transient failure building ‘/nix/store/X.drv’ on ‘localhost’:
dependency '/nix/store/Y' of '/nix/store/Y.drv' does not exist,
and substitution is disabled
This is confusing because the Hydra in question does have substitution enabled.
This instance uses:
keep-outputs = true
keep-derivations = true
and an S3 binary cache which is not configured as a substituter in the nix.conf.
It appears this instance encountered a situation where store path Y was built
and present in the binary cache, and Y.drv was GC rooted on the instance,
however Y was not on the host.
When Hydra would try to build this path locally, it would look in the binary
cache to see if it was cached:
(nix)
439 bool valid = isValidPathUncached(storePath);
440
441 if (diskCache && !valid)
442 // FIXME: handle valid = true case.
443 diskCache->upsertNarInfo(getUri(), hashPart, 0);
444
445 return valid;
Since it was cached, the store path was considered Valid.
The queue monitor would then not put this input in for substitution, because
the path is valid:
(hydra)
470 if (!destStore->isValidPath(*i.second.path(*localStore, step->drv->name, i.first))) {
471 valid = false;
472 missing.insert_or_assign(i.first, i.second);
473 }
Hydra appears to correctly handle the case of missing paths that need
to be substituted from the binary cache already, but since most
Hydra instances use `keep-outputs` *and* all paths in the binary cache
originate from that machine, it is not common for a path to be cached
and not GC rooted locally.
I'll run Hydra with this patch for a while and see if we run in to the
problem again.
A big thanks to John Ericson who helped debug this particular issue.
I'm not sure this is a good implementation as-is. It does work,
but the password gets echo'd to the screen. I tried to use IO::Prompt
but IO::Prompt really seems to want to read the password from ARGV.
Deleting jobsets first would fail because buildmetrics has an FK
to the jobset. However, the jobset / project relationship is not
marked as CASCADE.
Deleting all the builds automatically cascades to delete
buildmetrics, so deleting the relevant builds first, then deleting
the jobset solves it.