Co-authored-by: Graham Christensen <graham@grahamc.com>
... but just fixing up merge conflicts from the introduction of flakes
and the removal of the Jobs table.
This is a breaking change. Previously, packages named `packageset.foo`
would be exposed in the fake derivation channel as `packageset-foo`.
Presumably this was done to avoid needing to track attribute sets, and
to avoid the complexity. I think this now correctly handles the
complexity and properly mirrors the input expressions layout.
Previously, the build ID would never flow through channels which
exited.
This patch tracks the buildOne state as part of State and exits avoids
waiting forever for new work.
The code around buildOnly is a bit rough, making this a bit weird to
implement but since it is only used for testing the value of improving
it on its own is a bit questionable.
A reproduce script includes a logline that may resemble:
> using these flags: --arg nixpkgs { outPath = /tmp/build-137689173/nixpkgs/source; rev = "fdc872fa200a32456f12cc849d33b1fdbd6a933c"; shortRev = "fdc872f"; revCount = 273100; } -I nixpkgs=/tmp/build-137689173/nixpkgs/source --arg officialRelease false --option extra-binary-caches https://hydra.nixos.org/ --option system x86_64-linux /tmp/build-137689173/nixpkgs/source/pkgs/top-level/release.nix -A
These are passed along to nix-build and that's fine and dandy, but you can't just copy-paste this as is, as the `{}` introduces a syntax error and the value accompanying `-A` is `''`.
A very naive approach is to just `printf "%q"` the individual args, which makes them safe to copy-paste. Unfortunately, this looks awful due to the liberal usage of slashes:
```
$ printf "%q" '{ outPath = /tmp/build-137689173/nixpkgs/source; rev = "fdc872fa200a32456f12cc849d33b1fdbd6a933c"; shortRev = "fdc872f"; revCount = 273100; }'
\{\ outPath\ =\ /tmp/build-137689173/nixpkgs/source\;\ rev\ =\ \"fdc872fa200a32456f12cc849d33b1fdbd6a933c\"\;\ shortRev\ =\ \"fdc872f\"\;\ revCount\ =\ 273100\;\ \}
```
Alternatively, if we just use `set -x` before we execute nix-build, we'll get the whole invocation in a friendly, copy-pastable format that nicely displays `{}`-enclosed content and preserves the empty arg following `-A`:
```
running nix-build...
using this invocation:
+ nix-build --arg nixpkgs '{ outPath = /tmp/build-138165173/nixpkgs/source; rev = "e0e4484f2c028d2269f5ebad0660a51bbe46caa4"; shortRev = "e0e4484"; revCount = 274008; }' -I nixpkgs=/tmp/build-138165173/nixpkgs/source --arg officialRelease false --option extra-binary-caches https://hydra.nixos.org/ --option system x86_64-linux /tmp/build-138165173/nixpkgs/source/pkgs/top-level/release.nix -A ''
```
The queue runner used to special-case `localhost` as a remote builder:
Rather than using the normal remote-build (using the
`cmdBuildDerivation` command), it was using the (generally less
efficient, except when running against localhost) `cmdBuildPaths`
command because the latter didn't require a privileged Nix user (so made
testing easier − allowing to run hydra in a container in particular).
However:
1. this means that the build loop can follow two discint code paths depending
on the setup, the irony being that the most commonly used one in production
(the “non-localhost” case) isn't the one used in the testsuite (because all
the tests run against a local store);
2. It turns out that the “localhost” version is buggy in relatively obvious
ways − in particular a failure in a fixed-output derivation or a hash
mismatch isn't reported properly;
3. If the “run in a container” use-case is indeed that important, it can be
(partially) restored using a chroot store (which wouldn't behave excactly
the same way of course, but would be more than good-enough for testing)
The current check happening in jobsets is incorrect.
The wanted constraint is stated as follow :
- If type is 0 (legacy), then the flake field should be null, and
both nixExprInput and nixExprPath should be non-null
- If type is 1 (flake), then the flake field should be non-null, and
both nixExprInput and nixExprPath should be null
The current version will not catch (i.e. it will accept) situations
where you have for instance :
type = 1, nixExprPath null, nixExprInput non-null, flake non-null
This commit fixes that.
I split(ted) that into two constraints, to make it more readable and
easier to extend if a new type appears in the future.
The complete query could be instead :
( type = 0
AND nixExprInput IS NOT NULL AND nixExprPath IS NOT NULL AND flake IS NULL )
OR ( type = 1
AND nixExprInput IS NULL AND nixExprPath IS NULL AND flake IS NOT NULL )
(but an "OR" cannot be split, hence the other formulation)
DBIx likes to eagerly select all columns without a way to really tell
it so. Therefore, this splits this one large column in to its own
table.
I'd also like to make "jobsets" use this table too, but that is on hold
to stop the bleeding caused by the extreme amount of traffic this is
causing.
The database has these constraints:
check ((type = 0) = (nixExprInput is not null and nixExprPath is not null)),
check ((type = 1) = (flake is not null)),
which prevented switching to flakes in a declarative jobspec, since the
nixexpr{path,input} fields were not nulled in such an update
Co-Authored-By: Graham Christensen <graham@grahamc.com>
This search query is pretty heavy. Defaulting to 500 has caused
Hydra's web UI to appear to be down. Since 500 can take it down, users
probably shouldn't be allowed t ask for that many.
Duplicating this data on every record of the builds table cost
approximately 4G of duplication.
Note that the database migration included took about 4h45m on an
untuned server which uses very slow rotational disks in a RAID5 setup,
with not a lot of RAM. I imagine in production it might take an hour
or two, but not 4. If this should become a chunked migration, I can do
that.
Note: Because of the question about chunked migrations, I have NOT
YET tested this migration thoroughly enough for merge.
Looking at AWS' Performance Insights for a Hydra instance, I found
the hydra-queue-runner's query:
select id, buildStatus, releaseName, closureSize, size
from Builds b
join BuildOutputs o on b.id = o.build
where
finished = ?
and (buildStatus = ? or buildStatus = ?)
and path = $1
was the slowest query by at least 10x. Running an explain on this
showed why:
hydra=> explain select id, buildStatus, releaseName, closureSize, size
from Builds b join BuildOutputs o on b.id = o.build where
finished = 1 and (buildStatus = 0 or buildStatus = 6) and
path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1';
QUERY PLAN
------------------------------------------------------------------------
Gather (cost=1000.43..33718.98 rows=2 width=56)
Workers Planned: 2
-> Nested Loop (cost=0.43..32718.78 rows=1 width=56)
-> Parallel Seq Scan on buildoutputs o (cost=0.00..32710.32
rows=1
width=4)
Filter: (path = '/nix/store/s93kh...snip...'::text)
-> Index Scan using indexbuildsonjobsetidfinishedid on builds b
(cost=0.43..8.45 rows=1 width=56)
Index Cond: ((id = o.build) AND (finished = 1))
Filter: ((buildstatus = 0) OR (buildstatus = 6))
(8 rows)
A paralell sequential scan is definitely better than a sequential scan, but the
cost ranging from 0 to 32710 is not great. Looking at the table, I saw the `path`
column is completely unindex:
hydra=> \d buildoutputs
Table "public.buildoutputs"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
build | integer | | not null |
name | text | | not null |
path | text | | not null |
Indexes:
"buildoutputs_pkey" PRIMARY KEY, btree (build, name)
Foreign-key constraints:
"buildoutputs_build_fkey" FOREIGN KEY (build) REFERENCES builds(id)
ON DELETE CASCADE
Since we always do exact matches on the path and don't care about ordering,
and since the path column is very high cardinality a `hash` index is a
good candidate. Note that I did test a btree index and it performed
similarly well, but slightly worse.
After creating the index (this took about 10 seconds) on a test database:
create index IndexBuildOutputsPath on BuildOutputs using hash(path);
We get a *significantly* reduced cost:
hydra=> explain select id, buildStatus, releaseName, closureSize, size
hydra-> from Builds b join BuildOutputs o on b.id = o.build where
hydra-> finished = 1 and (buildStatus = 0 or buildStatus = 6) and
hydra-> path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1';
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.43..41.41 rows=2 width=56)
-> Index Scan using buildoutputs_path_hash on buildoutputs o (cost=0.00..16.05 rows=3 width=4)
Index Cond: (path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1'::text)
-> Index Scan using indexbuildsonjobsetidfinishedid on builds b (cost=0.43..8.45 rows=1 width=56)
Index Cond: ((id = o.build) AND (finished = 1))
Filter: ((buildstatus = 0) OR (buildstatus = 6))
(6 rows)
For direct comparison, the overall query plan was changed:
From: Gather (cost=1000.43..33718.98 rows=2 width=56)
To: Nested Loop (cost= 0.43.....41.41 rows=2 width=56)
and the query plan for buildoutputs changed from a maximum cost of
32,710 down to 16.
In practical terms, the query's planning and execution time was reduced:
Before (ms) | Try 1 | Try 2 | Try 3
------------+---------+---------+--------
Planning | 0.898 | 0.416 | 0.383
Execution | 138.644 | 172.331 | 375.585
After (ms) | Try 1 | Try 2 | Try 3
------------+---------+---------+--------
Planning | 0.298 | 0.290 | 0.296
Execution | 219.625 | 0.035 | 0.034
Requires the following configuration options
enable_github_login = 1
github_client_id
github_client_secret
Or github_client_secret_file which points to a file with the secret
Fixes this error:
ERROR: failed to process declarative jobset test:inputs,
DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st
execute failed: ERROR: null value in column "emailoverride" violates
not-null constraint
This would start happening if the network connection between the Hydra
server and the remote build server breaks after sucessfully importing
at least one output of a derivation, but before having finished
importing all outputs.
Fixes#816.
These make the hydra-queue-runner logs very noisy even when not using the GitlabStatus plugin.
Also, they shouldn't be necessary except when developing the plugin itself and should have been removed before release.
It might happen that a job from the aggregate returned an error!
This is what the vague "[json.exception.type_error.302] type must be string, but is null"
was all about in this instance; there was no `drvPath` to stringify!
So we now actively watch for errors and copy them to the aggregate job.
The vague "[json.exception.type_error.302] type must be string, but is null"
is **absolutely** unhelpful in the way Hydra currently handles it on
evaluation.
This is handling *unexpected* errors only; the following commit will
handle the specific instance of the previously mentioned error.
Recently a few internal APIs have changed[1]. The `outputPaths` function
has been removed and a lot of data structures are modeled with
`std::optional` which broke compilation.
This patch updates the code in `hydra-queue-runner` accordingly to make
sure that Hydra compiles again.
[1] https://github.com/NixOS/nix/pull/3883
With the current implementation, if ANY hash was found inside the decl
spec, the spec would be treated as static. This is problematic since
`inputs` is a hash and hence any configuration would be handled as a
static one.
This fixes the code to match the documentation and only switch to static
processing when ALL values are hashes.
As of https://github.com/NixOS/hydra/pull/737 (removal of sqlite
dependency), the only supported database is Postgresql.
This change removes all references to hydra-postgresql.sql file. This
file is generated using a cpp on hydra.sql, but doesn't differ from
hydra.sql at all.
PathInput plugin keeps a cache of path evaluations. This cache is simple, and
path is not checked more than once every N seconds, where N=30. The caching is
there to avoid expensive calls to `nix-store --add`.
This change makes the validity period configurable. The main use case is
`api-test.pl` which was implemented wrong for a while, as the invocation of
`hydra-eval-jobset` would return the previous evaluation, claiming there are no
changes. The test has been fixed to check better for a new evaluation.
`build_finished` Postgres event will never be fired for the dependent builds.
For example, on our Hydra, the following query always returns increasing
numbers, even though all notifications have been delivered:
```
hydra=> select count(1) from builds where notificationpendingsince is not null;
count
-------
4583
(1 row)
```
Thus, we have to iterate over all dependent builds and mark their
`notificationpendingsince` as `null`, otherwise they will pile up until
the next restart of hydra-notify, when they will get delivered.
When deploying Hydra different than hydra.nixos.org one may encounter a problem
as building any job that uses IFD fails with:
May 22 19:41:07 hydra hydra-evaluator[6960]: error: "attempted to realize '/nix/store/1jm02mfiv58rpy8zrx95cpqxzsp64ssh-source.drv' during evaluation but 'allow-import-from-derivation' is false"
May 22 19:41:07 hydra hydra-evaluator[6960]: error: "attempted to realize '/nix/store/av3jr8ix4qcadq2wm3y3hplvxwzlhl4y-source.drv' during evaluation but 'allow-import-from-derivation' is false"
May 22 19:41:07 hydra hydra-evaluator[6960]: error: "attempted to realize
'/nix/store/2jm02mfiv58rpy8zrx95cpqxzsp64ssh-source.drv' during evaluation but
'allow-import-from-derivation' is false"
The recent change enforced passing `--no-allow-import-from-derivation`
to `hydra-eval-job` unconditionally. This change makes it configurable and
defaults to **NOT PASSING IT** -- most of the deployments allow IFDs.
The configuration option is called `allow_import_from_derivation` and
defaults to `true`. It is interpreted as a boolean, with only true option being
`true`.
Taken from `Perl::Critic`:
A common idiom in perl for dealing with possible errors is to use `eval`
followed by a check of `$@`/`$EVAL_ERROR`:
eval {
...
};
if ($EVAL_ERROR) {
...
}
There's a problem with this: the value of `$EVAL_ERROR` (`$@`) can change
between the end of the `eval` and the `if` statement. The issue are object
destructors:
package Foo;
...
sub DESTROY {
...
eval { ... };
...
}
package main;
eval {
my $foo = Foo->new();
...
};
if ($EVAL_ERROR) {
...
}
Assuming there are no other references to `$foo` created, when the
`eval` block in `main` is exited, `Foo::DESTROY()` will be invoked,
regardless of whether the `eval` finished normally or not. If the `eval`
in `main` fails, but the `eval` in `Foo::DESTROY()` succeeds, then
`$EVAL_ERROR` will be empty by the time that the `if` is executed.
Additional issues arise if you depend upon the exact contents of
`$EVAL_ERROR` and both `eval`s fail, because the messages from both will
be concatenated.
Even if there isn't an `eval` directly in the `DESTROY()` method code,
it may invoke code that does use `eval` or otherwise affects
`$EVAL_ERROR`.
The solution is to ensure that, upon normal exit, an `eval` returns a
true value and to test that value:
# Constructors are no problem.
my $object = eval { Class->new() };
# To cover the possiblity that an operation may correctly return a
# false value, end the block with "1":
if ( eval { something(); 1 } ) {
...
}
eval {
...
1;
}
or do {
# Error handling here
};
Unfortunately, you can't use the `defined` function to test the result;
`eval` returns an empty string on failure.
Various modules have been written to take some of the pain out of
properly localizing and checking `$@`/`$EVAL_ERROR`. For example:
use Try::Tiny;
try {
...
} catch {
# Error handling here;
# The exception is in $_/$ARG, not $@/$EVAL_ERROR.
}; # Note semicolon.
"But we don't use DESTROY() anywhere in our code!" you say. That may be
the case, but do any of the third-party modules you use have them? What
about any you may use in the future or updated versions of the ones you
already use?
The original code would return standard "Please come back later" page when there
are only fetch errors on a newly setup declarative project. The problem is that
there are two types of errors: standard errors and fetch errors. Each is
acompanied by a corresponding field for time of occurence. Standard errors use
'errortime', while fetch errors have 'lastchecktime' set to the time of the
error. Unfortunately, jobset.tt file was only using 'errortime' for displaying
the time. This would result in the following errors in logs:
Couldn't render template "date error - bad time/date string: expects 'hⓂ️s dⓂ️y' got: ''
This change includes using 'lastchecktime' when rendering the error times.
The current implementation will pass all values to `create_or_update` method. The
missing values will end up as `undef` (or `NULL`) when assigned to `%update`.
Thus, for columns that are NOT NULL, when, for example, flakes are not used,
will result in a horrible:
DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed:
ERROR: null value in column "type" violates not-null constraint
DETAIL: Failing row contains (.jobsets, 118, hydra, hydra jobsets, src, hydra/jobsets.nix, null,
null, null, 1589536378, 1, 0, 0, , 3, 30, 100, null, null, 1589536379, null, null). [for Statement
"UPDATE jobsets SET checkinterval = ?, description = ?, enableemail = ?, nixexprinput = ?,
nixexprpath = ?, type = ? WHERE ( ( name = ? AND project = ? ) )" with ParamValues: 1='30',
2='hydra jobsets', 3='0', 4='src', 5='hydra/jobsets.nix', 6=undef, 7='.jobsets', 8='hydra'] at
/nix/store/lsf81ip9ybxihk5praf2n0nh14a6i9j0-hydra-0.1.19700101.DIRTY/libexec/hydra/lib/Hydra/Helper/AddBuilds.pm line 50
This change just omits adding such values to `%update`, which results in
PostgreSQL assigning the default values.
The previous code converted option values to ints when the value
contained a digit somewhere. This is too eager since it also converts
strings like `release-0.2` to an int which should not happen.
We now only convert to int when the value is an integer.
This plugin is a counterpart to GithubPulls plugin. Instead of fetching pull
requests, it will fetch all references (branches and tags) that start with a
particular prefix.
The plugin is a copy of GithubPulls plugin with appropriate changes to call the
right API and parse the config matching the need.
To quote the function's comment:
Awful hack to handle timeouts in SQLite: just retry the transaction.
DBD::SQLite *has* a 30 second retry window, but apparently it
doesn't work.
Since SQLite is now dropped entirely, this wrapper can be removed
completely.
SQLite isn't properly supported by Hydra for a few years now[1], but
Hydra still depends on it. Apart from a slightly bigger closure this can
cause confusion by users since Hydra picks up SQLite rather than
PostgreSQL by default if HYDRA_DBI isn't configured properly[2]
[1] 78974abb69
[2] https://logs.nix.samueldr.com/nixos-dev/2020-04-10#3297342;
If we don't see machine that supports a build step for
'max_unsupported_time' seconds, the step is aborted. The default is 0,
which is appropriate for Hydra installations that don't provision
missing machines dynamically.
(cherry picked from commit f5cdbfe21d)
If we don't see machine that supports a build step for
'max_unsupported_time' seconds, the step is aborted. The default is 0,
which is appropriate for Hydra installations that don't provision
missing machines dynamically.
When I browse failed builds in a jobset-eval on Hydra, I regularly
mistake actual build-failures with temporary issues like timeouts (that
probably disappear at the next eval).
To prevent this kind of issue, I figured that using the stopsign-svg for
builds with timeouts or exceeded log-limits is a reasonable choice for
the following reasons:
* A user can now distinguish between actual build-errors (like
compilation-failures or oversized outputs) and (usually) temporary issues
(like a bloated log or a timeout).
* The stopsign is also used for aborted jobs that are shown in a
different tab and can't be confused with timeouts for that reason.
Declarative jobsets were broken by the Nix update, causing
nix cat-file to break silently.
This commit restores declarative jobsets, based on top of a commit
making it easier to see what broke.
In the past, jobsets which are automatically evaluated are evaluated
regularly, on a schedule. This schedule means a new evaluation is
created every checkInterval seconds (assuming something changed.)
This model works well for architectures where our build farm can
easily keep up with demand.
This commit adds a new type of evaluation, called ONE_AT_A_TIME, which
only schedules a new evaluation if the previous evaluation of the
jobset has no unfinished builds.
This model of evaluation lets us have 'low-tier' architectures.
For example, we could now have a jobset for ARMv7l builds, where
the buildfarm only has a single, underpowered ARMv7l builder.
Configuring that jobset as ONE_AT_A_TIME will create an evaluation
and then won't schedule another evaluation until every job of
the existing evaluation is complete.
This way, the cache will have a complete collection of pre-built
software for some commits, but the underpowered architecture will
never become backlogged in ancient revisions.
A postgresql column which is non-null and unique is treated with
the same optimisations as a primary key, so we have no need to
try and recreate the `id` as the primary key.
No read paths are impacted by this change, and the database will
automatically create an ID for each insert. Thus, no code needs to
change.
hydra.nixos.org is already running this rev, and it should be safe to
apply to everyone else. If we make changes to this migration, we'll
need to write another migration anyway.
Lowercasing is due to postgresql not having case-sensitive table names.
It always technically workde before, but those table names never
existed literally.
The switch to generating from postgresql is to handle an upcoming
addition of an auto-incrementign ID to the Jobset table. Sqlite doesn't
seem to be able to handle the table having an auto incrementing ID
field which isn't the primary key, but we can't change the primary
key trivially.
Since hydra doesn't support sqlite and hasn't for many year anyway,
it is easier to just generate from pgsql directly.
Building on macOS with the latest nixpkgs master and NixOS/nixpkgs#77147
fails. It seems some `std::experimental` (optional) for instance are
not available as `experimental`, but are in `std`. Also `toJSON` is
missing for `atomic< unsigned long long >`.
In a NixOS container, cmdBuildDerivation doesn't work because we're
not privileged. But we also don't need it because the store already
has the derivation.
Also, don't copy from/to the store since this gives errors about
missing signatures.