When using the "build" or "sysbuild" jobset input types in conjunction
with a binary cache store, the evaluator needs to be able to fetch
store paths from the binary cache. Typical usage:
store_uri = s3://nix-test-cache?secret-key=...
eval_substituter = s3://nix-test-cache
Also, the public key of the binary cache must be added to
binary-cache-public-keys in nix.conf, otherwise the local nix-daemon
won't allow the store paths to be copied over.
Also, remove support in hydra-eval-jobs for multiple jobset input
alternatives. The web interface hasn't supported this in a long
time. Thus we can use the regular "--arg" handler.
This makes downloading/viewing build results work with binary cache
stores. For good performance, this should be used in conjunction with
ca580bec35,
i.e. you should set store_uri to something like
s3://my-cache?local-nar-cache=/tmp/nar-cache
to cache NARs between requests.
When creating a Hydra user with the `hydra-create-user` command, you can now
provide a SHA1 password hash with the `--password-hash` flag. This is useful for
the upcoming work on Fully Declarative Hydra, since the end user should not have
to specify plaintext passwords in their `configuration.nix` file.
Thus, we no longer hold the send lock while substituting missing paths
on the build machine. This is a good thing in particular for macOS
builders which have a tendency to hang forever in curl downloads.
Previously, when hydra-queue-runner was restarted, any pending "build
finished" notifications were lost. Now hydra-queue-runner marks
finished but unnotified builds in the database and uses that to run
pending notifications at startup.
The queue runner can now run up to ‘max-concurrent-notifications’ in
parallel (default is 2). This is useful when some hydra-notify
invocations can take a long time to complete (e.g. because they need
to compress a giant build log) and we don't want this to block all
other notifications.
As @dtzWill discovered, with the concurrent hydra-evaluator, there can
be multiple active transactions adding builds to the database. As a
result, builds can become visible in a non-monotonically increasing
order, breaking the queue monitor's assumption that build IDs only go
up.
The fix is to have hydra-eval-jobset provide the lowest build ID it
just added in the builds_added notification, and have the queue
monitor check from there.
Fixes#496.
This plugin will post to the build status system in BitBucket. In order
to use it you need to add to ExtraConfig
<bitbucket>
username = bitbucket_username
password = bitbucket_password
</bitbucket>
You can use an application password https://blog.bitbucket.org/2016/06/06/app-passwords-bitbucket-cloud/
This can take an excessive amount of time. For example, on
hydra.nixos.org, a call to hydra-notify takes 0.7s even if there are
no plugins. So for an eval with ~45K new builds, the calls to
hydra-notify add up to about 9 hours.
The proper fix would be to pass a list of build IDs, or an eval ID.
This can be used with declarative projects to build PRs.
The github_authorization section should contain verbatim Authorization header contents keyed by repo owner for private repos
1. From the hydra configuration file.
The configuration is loaded from the "git-input" block.
Currently only the "timeout" variable is been looked up in the file.
<git-input>
# general timeout
timeout = 400
<input-name>
# specific timeout for a particular input name
timeout = 400
</input-name>
# use quotes when the input name has spaces
<"foot with spaces">
# specific timeout for a particular input name
timeout = 400
</"foo with spaces">
</git-input>
2. As an argument in the input value after the repo url and branch (and after the deepClone if is defined)
"timeout=<value>"
The preference on which value is used:
1. input value
2. Block with the name of the input in the <git-input> block
3. "timeout" inside the <git-input> block
4. Default value of 600 seconds. (original hard-coded value)
The code is generalized for more values to be configured, it might be too much
for a single value on a single plugin.
Adding a 96-core aarch64 build machine to the build farm caused the
potential number of database connections to increase a lot, so we
started hitting the Postgres connection limit.
* The "Jobset" page now shows when evaluations are in progress (rather
than just pending).
* Restored the ability to do a single evaluation from the command line
by doing "hydra-evaluator <project> <jobset>".
* Fix some consistency issues between jobset status in PostgreSQL and
in hydra-evaluator. In particular, "lastCheckedTime" was never
updated internally.
Setting
xxx-jobset-repeats = patchelf:master:2
will cause Hydra to perform every build step in the specified jobset 2
additional times (i.e. 3 times in total). Non-determinism is not fatal
unless the derivation has the attribute "isDeterministic = true"; we
just note the lack of determinism in the Hydra database. This will
allow us to get stats about the (lack of) reproducibility of all of
Nixpkgs.
Builds can now specify the attribute "isDeterministic = true" to tell
Hydra to build with build-repeat > 0. If there is a mismatch between
rounds, the step / build fails with a suitable status.
Maybe this should be a meta attribute, but that makes it invisible to
hydra-queue-runner, and it seems reasonable to make a claim of
mandatory determinism part of the derivation (since e.g. enabling this
flag should trigger a rebuild).
We now take into account the memory necessary for compressing the NAR
being exported to the binary cache, plus xz compression overhead.
Also, we now release the memory tokens for the NAR accessor *after*
releasing the NAR accessor. Previously the memory for the NAR accessor
might still be in use while another thread does an allocation, causing
the maximum to be exceeded temporarily.
Also, use notify_all instead of notify_one to wake up memory token
waiters. This is not very nice, but not every waiter is requesting the
same number of tokens, so some might be able to proceed.
If a step is cancelled just as its builder step is starting,
doBuildStep() will return sRetry. This causes builder() to make the
step runnable again, since the queue monitor may have added new builds
referencing it. The idea is that if the latter condition is not true,
the step's reference count will drop to zero and it will be
deleted. However, if the dispatcher thread sees and locks the step
before the reference count can drop to zero in the builder thread, the
dispatcher thread will start a new builder thread for the step. Thus
the step can be kept alive for an indefinite amount of time.
The fix is for State::builder() to use a weak pointer to the step, to
ensure that the step's reference count can drop to zero *before* it's
added to the runnable queue.
This was a bad idea because pthread_cancel() is unsalvageable broken
in C++. Destructors are not allowed to throw exceptions (especially in
C++11), but pthread_cancel() can cause a __cxxabiv1::__forced_unwind
exception inside any destructor that invokes a cancellation
point. (This exception can be caught but *must* be rethrown.) So let's
just kill the builder process instead.
It was hitting
assert(reservation.unique());
Since we do want the machine reservation to be released before calling
wakeDispatcher(), let's use a different object for keeping track of
active steps.
We now kill active build steps when there are no more referring
builds. This is useful e.g. for preventing cancelled multi-hour TPC-H
benchmark runs from hogging build machines.
If two active steps of the same build failed, then the first would be
marked as "failed", but the second would end up as "orphaned", causing
it to be marked as "aborted" later on. Now it's correctly marked as
"failed".
Without this, if (failed or aborted) derivations have been
garbage-collected, there is no way to restart them, which is very
annoying. Now we set a forceEval flag in the jobset to cause it to be
re-evaluated even if none of the inputs have changed.
‘basicDrv.inputSrcs’ also contains the outputs of inputDrvs. These
don't necessarily exist in the local store, so copying them may cause
an exception. We should only copy the real inputSrcs.
Some Hydra API requests were vulnerable to XSRF attacks, e.g. you
could have a form on another website using http://hydra/logout as the
form action. So we now require POST requests to come from the same
origin.
Reported by Hans-Christian Esperer.
This rewrites the top-level loop of hydra-evaluator in C++. The Perl
stuff is moved into hydra-eval-jobset. (Rewriting the entire evaluator
would be nice but is a bit too much work.) The new version has some
advantages:
* It can run multiple jobset evaluations in parallel.
* It uses PostgreSQL notifications so it doesn't have to poll the
database. So if a jobset is triggered via the web interface or from
a GitHub / Bitbucket webhook, evaluation of the jobset will start
almost instantaneously (assuming the evaluator is not at its
concurrency limit).
* It imposes a timeout on evaluations. So if e.g. hydra-eval-jobset
hangs connecting to a Mercurial server, it will eventually be
killed.
This prevents the server from gradually filling up due to store paths
fetched by hydra-server that then get turned into a GC root by
hydra-update-gc-roots.
Dashboards can now be marked as publically visible in the user
preferences. The dashboard URL has changed from /user/<name>/dashboard
to /dashboard/<name> because /user/<name> requires being logged in as
<name> or as an admin.
This allows fully declarative project specifications. This is best
illustrated by example:
* I create a new project, setting the declarative spec file to
"spec.json" and the declarative input to a git repo pointing
at git://github.com/shlevy/declarative-hydra-example.git
* hydra creates a special ".jobsets" jobset alongside the project
* Just before evaluating the ".jobsets" jobset, hydra fetches
declarative-hydra-example.git, reads spec.json as a jobset spec,
and updates the jobset's configuration accordingly:
{
"enabled": 1,
"hidden": false,
"description": "Jobsets",
"nixexprinput": "src",
"nixexprpath": "default.nix",
"checkinterval": 300,
"schedulingshares": 100,
"enableemail": false,
"emailoverride": "",
"keepnr": 3,
"inputs": {
"src": { "type": "git", "value": "git://github.com/shlevy/declarative-hydra-example.git", "emailresponsible": false },
"nixpkgs": { "type": "git", "value": "git://github.com/NixOS/nixpkgs.git release-16.03", "emailresponsible": false }
}
}
* When the "jobsets" job of the ".jobsets" jobset completes, hydra
reads its output as a JSON representation of a dictionary of
jobset specs and creates a jobset named "master" configured
accordingly (In this example, this is the same configuration as
.jobsets itself, except using release.nix instead of default.nix):
{
"enabled": 1,
"hidden": false,
"description": "js",
"nixexprinput": "src",
"nixexprpath": "release.nix",
"checkinterval": 300,
"schedulingshares": 100,
"enableemail": false,
"emailoverride": "",
"keepnr": 3,
"inputs": {
"src": { "type": "git", "value": "git://github.com/shlevy/declarative-hydra-example.git", "emailresponsible": false },
"nixpkgs": { "type": "git", "value": "git://github.com/NixOS/nixpkgs.git release-16.03", "emailresponsible": false }
}
}
Currently, the hydra.nixos.org queue contains 1000s of Darwin builds
that all depend on a stdenv-darwin that previously failed. However,
before, first createStep() would construct a dependency graph for each
build, then getQueuedBuilds() would discover that one of the steps had
failed previously and discard all those steps. Since the graph
construction involves a lot of uncached calls to isValidPath(), this
took several seconds per build.
Now createStep() detects the previous failure right away and bails
out.
These are build steps that remain "busy" in the database even though
they have finished, because they couldn't be updated (e.g. due to a
PostgreSQL connection problem). To prevent them from showing up as
busy in the "Machine status" page, we now periodically purge them.
Previously, if the queue monitor thread encounters a build that Hydra
has previously built, it downloaded the output paths from the binary
cache, just to determine the build products and metrics. This is very
inefficient. In particular, when doing something like merging
nixpkgs:staging into nixpkgs:master, the queue monitor thread will be
locked up for a long time fetching files from S3, causing the build
farm to be mostly idle.
Of course this is entirely unnecessary, since the build
products/metrics are already in the Hydra database. So now we just
look up a previous build with the same output path, and copy the
products/metrics.
Mutliple <githubstatus> sections are possible:
* jobs: regexp for jobs to match
* inputs: the input which corresponds to the github repo/rev whose
status we want to report. Can be repeated
* authorization: Verbatim contents of the Authorization header. See
https://developer.github.com/v3/#authentication.
Otherwise, the browser may mix up HTML and JSON responses if it has
requested both. For example, hitting the back button to return to a
job metric page will show a JSON response, because that was the last
thing the browser fetched for that URL.
This requires Catalyst::Action::Rest >= 1.20.
The previous query
select count(*) from builds b left join buildsteps s on s.build = b.id where busy = 1 and finished = 0
is suddenly taking several minutes. Probably PostgreSQL decided to use
a suboptimal query plan.
The maximum output size per build step (as the sum of the NARs of each
output) can be set via hydra.conf, e.g.
max-output-size = 1000000000
The default is 2 GiB.
Also refactored the build error / status handling a bit.
When using a binary cache store, the queue runner receives NARs from
the build machines, compresses them, and uploads them to the
cache. However, keeping multiple large NARs in memory can cause the
queue runner to run out of memory. This can happen for instance when
it's processing multiple ISO images concurrently.
The fix is to use a TokenServer to prevent the builder threads to
store more than a certain total size of NARs concurrently (at the
moment, this is hard-coded at 4 GiB). Builder threads that cause the
limit to be exceeded will block until other threads have finished.
The 4 GiB limit does not include certain other allocations, such as
for xz compression or for FSAccessor::readFile(). But since these are
unlikely to be more than the size of the NARs and hydra.nixos.org has
32 GiB RAM, it should be fine.
The old page didn't scale very well if you have 150K builds in the
queue, in fact it tended to make browsers hang. The new one just
shows, for each jobset, the number of queued builds. The actual builds
can be seen by going to the corresponding jobset page and looking at
the evals.
Same problem as d744362e4a.
at /nix/store/ksvsbr7pg4z69bv6fbbc8h7x7rm2104m-gcc-4.9.3/include/c++/4.9.3/bits/predefined_ops.h:166
__last@entry=..., __comp=...) at /nix/store/ksvsbr7pg4z69bv6fbbc8h7x7rm2104m-gcc-4.9.3/include/c++/4.9.3/bits/stl_algo.h:1827
__comp=...) at /nix/store/ksvsbr7pg4z69bv6fbbc8h7x7rm2104m-gcc-4.9.3/include/c++/4.9.3/bits/stl_algo.h:4717
Respects <slack> blocks in the hydra config, with attributes:
* jobs: a regexp matching the job name (in the format project:jobset:job)
* url: The URL to a slack incoming webhook
* force: If true, always send messages. Otherwise, only when the build status changes
Multiple <slack> blocks are allowed