Commit graph

322 commits

Author SHA1 Message Date
Eelco Dolstra 2946899504
Turn hydra-notify into a daemon
It now receives notifications about started/finished builds/steps via
PostgreSQL. This gets rid of the (substantial) overhead of starting
hydra-notify for every event. It also allows other programs (even on
other machines) to listen to Hydra notifications.
2019-08-13 18:18:21 +02:00
Michael Bishop 3ad091faf3
allow using a shorter context and increase hydra-notify debug
(cherry picked from commit 1c76ad393669af2f728fd519a050f417319412a6)
2019-03-20 15:22:24 -04:00
Antoine Eiche 9a73ec6455 hydra-queue-runner: better error message if nix-store can not be started
The hydra-queue-runner opens a connection to the builder. If the
builder is 'localhost' it starts `nix-store`, otherwise it starts
'ssh'.

Currently, if the hydra-queue-runner can not start `nix-store` (not in
the PATH for instance), the error message is:

  cannot connect to ‘localhost’: error: cannot start ssh: No such file
  or directory

This is not useful since ssh is actually not started:/

With this patch the error message is now:

  cannot connect to ‘localhost’: error: cannot start nix-store: No such file
  or directory
2019-01-23 10:42:47 +01:00
Eelco Dolstra 423c0440ea
Typo 2018-12-20 12:07:02 +01:00
Eelco Dolstra 8d26144121
Fix building against nix master 2018-10-30 14:41:21 +01:00
Eelco Dolstra 4e27796eba
Allow setting GC_INITIAL_HEAP_SIZE for hydra-eval-jobs
This cannot be done in the hydra-evaluator systemd unit, since then
every other Nix process (e.g. hydra-evaluator and nix-prefetch-*) will
also allocate the specified heap size, probably leading to OOM.
2018-05-16 14:14:53 +02:00
Eelco Dolstra c0fac52872
Add some debug code 2018-03-07 10:23:43 +01:00
Eelco Dolstra 5a1f2a50e5
Handle derivations with system type 'builtin'
Fixes #540.
2018-03-07 10:22:35 +01:00
Eelco Dolstra 68afa2bf6f
Dump more system info in /queue-runner-status 2018-03-07 10:06:56 +01:00
Eelco Dolstra e9670641ec
Distinguish build step states
The web interface now shows whether a build step is connecting,
copying inputs/outputs, building, etc.
2017-12-07 15:35:31 +01:00
Eelco Dolstra 457483ba0e
Don't lock the BuildSteps table when inserting
Instead, optimistically insert a row and retry if there is a conflict.
2017-12-07 14:41:29 +01:00
Eelco Dolstra eef0d8861b
Remove test line 2017-10-19 13:17:29 +02:00
Eelco Dolstra b04dc6c76e
Fix root creation when the root already exists but is owned by another user 2017-10-19 12:28:38 +02:00
Eelco Dolstra cc64e51f75
USER -> LOGNAME for consistency
Don't remember why we use LOGNAME. Also ensure that it's set.
2017-10-18 11:23:00 +02:00
Will Dietz c81594f470 hydra-queue-runner: ensure roots directory exists
Fixes #513
2017-10-17 13:04:56 -05:00
Eelco Dolstra 45b138373b
hydra-queue-runner: Write GC roots for outputs paths
We lost this behaviour somewhere. So build outputs could be GC'ed when
running the collector with --option gc-keep-outputs false.
2017-10-12 18:55:38 +02:00
Eelco Dolstra 27103398c9
Make maxLogSize configurable 2017-09-22 15:23:58 +02:00
Eelco Dolstra b828224fee
Periodically close RemoteStore connections
This prevents an accumulation of temproots. See
89dc62c174.
2017-09-14 18:16:33 +02:00
Eelco Dolstra 6517446c34
Update to latest nixUnstable 2017-09-14 17:22:48 +02:00
Eelco Dolstra 4af97c57f5
Acquire the send lock only while actually sending
Thus, we no longer hold the send lock while substituting missing paths
on the build machine. This is a good thing in particular for macOS
builders which have a tendency to hang forever in curl downloads.
2017-09-01 16:28:49 +02:00
Eelco Dolstra 50ab80caf2
Don't wait forever to acquire the send lock 2017-09-01 15:29:06 +02:00
Eelco Dolstra 7c976d2aec
hydra-queue-runner: Make build notification more reliable
Previously, when hydra-queue-runner was restarted, any pending "build
finished" notifications were lost. Now hydra-queue-runner marks
finished but unnotified builds in the database and uses that to run
pending notifications at startup.
2017-07-26 15:17:51 +02:00
Will Dietz 719df63190 queue-monitor: never move lastBuildId forward without processing jobs. 2017-07-25 20:05:37 -05:00
Eelco Dolstra e117d85c2a
hydra-queue-runner: Set a thread title for the builder threads
This should make debugging slightly easier.
2017-07-25 15:59:41 +02:00
Eelco Dolstra e78b9fd4ee
hydra-queue-runner: Allow concurrent notifications
The queue runner can now run up to ‘max-concurrent-notifications’ in
parallel (default is 2). This is useful when some hydra-notify
invocations can take a long time to complete (e.g. because they need
to compress a giant build log) and we don't want this to block all
other notifications.
2017-07-24 16:35:34 +02:00
Eelco Dolstra bba383bf1b
hydra-queue-runner: Keep some notification statistics 2017-07-24 16:26:44 +02:00
Eelco Dolstra f46a21e16e
Slight cleanup 2017-07-21 17:22:11 +02:00
Eelco Dolstra dc5e0b120a
Fix a race that can cause hydra-queue-runner to ignore newly added builds
As @dtzWill discovered, with the concurrent hydra-evaluator, there can
be multiple active transactions adding builds to the database. As a
result, builds can become visible in a non-monotonically increasing
order, breaking the queue monitor's assumption that build IDs only go
up.

The fix is to have hydra-eval-jobset provide the lowest build ID it
just added in the builds_added notification, and have the queue
monitor check from there.

Fixes #496.
2017-07-21 14:34:48 +02:00
Eelco Dolstra 6fc851d376 Improve erorr message 2017-07-17 14:10:34 +02:00
Eelco Dolstra 66ae66024e Sync with latest Nix 2017-07-17 11:38:58 +02:00
Eelco Dolstra 1f94f03699
Fix build 2017-04-26 15:11:12 +02:00
Eelco Dolstra cc85208fe4
Fix build 2017-04-18 20:50:18 +02:00
Eelco Dolstra 426aea1236
hydra-queue-runner: Allow multiple concurrent daemon connections 2017-04-06 18:50:53 +02:00
Eelco Dolstra 5810042a3b
Periodically clear Store's path info cache
Otherwise the queue runner can consider paths as valid that have been
garbage-collected since the first time it queried them.
2017-04-06 17:20:23 +02:00
Eelco Dolstra 8364f4ec70
Upload log files to the right location
We were mixing up builds and steps. So for example

  https://cache.nixos.org/log/2w66a98iqbjdppc5s2b8qvhi3gprvy45-freecell-solver-4.8.0.drv

at the moment contains the log for
/nix/store/442r9d5ihbcpgq8q9dhijhvhlmplzp96-perl-namespace-autoclean-0.28.drv
because the latter is a step in http://hydra.nixos.org/build/51300420.
Oops.
2017-04-06 13:05:30 +02:00
Eelco Dolstra 4f11cf45dc
Fix build cancellation
We nowadays ignore SIGINT, so the sshd child process inherited this
and ignored SIGINT as well.
2017-04-05 11:01:57 +02:00
Eelco Dolstra 147ba3ca31
Set proper charset on log files 2017-03-31 18:00:08 +02:00
Eelco Dolstra 8771f7f913 Merge pull request #382 from shlevy/cached-build-notifications
Send BuildFinished notifications on cached build results.
2017-03-29 18:52:20 +02:00
Eelco Dolstra 57bc0eaead
hydra-queue-runner: Limit concurrent database connections
Adding a 96-core aarch64 build machine to the build farm caused the
potential number of database connections to increase a lot, so we
started hitting the Postgres connection limit.
2017-03-21 11:53:46 +01:00
Eelco Dolstra 150228d7de
Upload build logs to the binary cache 2017-03-15 16:59:57 +01:00
Eelco Dolstra 7e6486e694
Move log compression to a plugin 2017-03-15 16:59:57 +01:00
Eelco Dolstra d1afb42f12
Supress debug message 2017-03-15 16:59:56 +01:00
Eelco Dolstra 73900e9f5f Fix std::stoi exception 2017-03-08 15:07:52 +01:00
Eelco Dolstra edebdf33f0
hydra-queue-runner: Handle SIGINT 2017-03-03 12:41:00 +01:00
Eelco Dolstra 500c27e4d5
Add hydra.conf option "nar_buffer_size" to configure memoryTokens limit
It defaults to half the physical RAM.
2017-03-03 12:37:27 +01:00
Eelco Dolstra 53b1f7da64 Decrease memoryTokens 2017-02-03 14:44:52 +01:00
Eelco Dolstra a366f362e1 Use latest nixUnstable 2017-02-03 14:39:18 +01:00
Eelco Dolstra 8a120006f0
Fix version test 2016-12-08 16:03:50 +01:00
Eelco Dolstra 9989e6c0f4
Get exact build start/stop times from the remote 2016-12-07 16:10:21 +01:00
Eelco Dolstra f6081668dc
Allow determinism checking for entire jobsets
Setting

  xxx-jobset-repeats = patchelf:master:2

will cause Hydra to perform every build step in the specified jobset 2
additional times (i.e. 3 times in total). Non-determinism is not fatal
unless the derivation has the attribute "isDeterministic = true"; we
just note the lack of determinism in the Hydra database. This will
allow us to get stats about the (lack of) reproducibility of all of
Nixpkgs.
2016-12-07 15:57:13 +01:00
Eelco Dolstra 8bb36e79bd
Support testing build determinism
Builds can now specify the attribute "isDeterministic = true" to tell
Hydra to build with build-repeat > 0. If there is a mismatch between
rounds, the step / build fails with a suitable status.

Maybe this should be a meta attribute, but that makes it invisible to
hydra-queue-runner, and it seems reasonable to make a claim of
mandatory determinism part of the derivation (since e.g. enabling this
flag should trigger a rebuild).
2016-12-06 17:46:06 +01:00
Eelco Dolstra afb8765ae4
hydra-queue-runner: Bump memory limit to reflect more accurate accounting 2016-11-16 17:51:18 +01:00
Eelco Dolstra b4d32a3085
hydra-queue-runner: More accurate memory accounting
We now take into account the memory necessary for compressing the NAR
being exported to the binary cache, plus xz compression overhead.

Also, we now release the memory tokens for the NAR accessor *after*
releasing the NAR accessor. Previously the memory for the NAR accessor
might still be in use while another thread does an allocation, causing
the maximum to be exceeded temporarily.

Also, use notify_all instead of notify_one to wake up memory token
waiters. This is not very nice, but not every waiter is requesting the
same number of tokens, so some might be able to proceed.
2016-11-16 17:48:50 +01:00
Eelco Dolstra cb5e438a08 Bump Nix
Fixes #398.
2016-11-09 19:15:13 +01:00
Eelco Dolstra 1ecc8a4f40 hydra-queue-runner: Fix a race keeping cancelled steps alive
If a step is cancelled just as its builder step is starting,
doBuildStep() will return sRetry. This causes builder() to make the
step runnable again, since the queue monitor may have added new builds
referencing it. The idea is that if the latter condition is not true,
the step's reference count will drop to zero and it will be
deleted. However, if the dispatcher thread sees and locks the step
before the reference count can drop to zero in the builder thread, the
dispatcher thread will start a new builder thread for the step. Thus
the step can be kept alive for an indefinite amount of time.

The fix is for State::builder() to use a weak pointer to the step, to
ensure that the step's reference count can drop to zero *before* it's
added to the runnable queue.
2016-11-08 11:47:49 +01:00
Eelco Dolstra de9d7bcf25 hydra-queue-runner: Handle exceptions in the dispatcher thread
E.g. "resource unavailable" when creating new threads.
2016-11-08 11:25:43 +01:00
Eelco Dolstra 7863d2e1da Step cancellation: Don't use pthread_cancel()
This was a bad idea because pthread_cancel() is unsalvageable broken
in C++. Destructors are not allowed to throw exceptions (especially in
C++11), but pthread_cancel() can cause a __cxxabiv1::__forced_unwind
exception inside any destructor that invokes a cancellation
point. (This exception can be caught but *must* be rethrown.) So let's
just kill the builder process instead.
2016-11-07 19:38:24 +01:00
Eelco Dolstra d7453bd8be hydra-queue-runner: Fix message 2016-11-02 12:44:18 +01:00
Eelco Dolstra 4f08c85c69 hydra-queue-runner: Fix assertion failure
It was hitting

    assert(reservation.unique());

Since we do want the machine reservation to be released before calling
wakeDispatcher(), let's use a different object for keeping track of
active steps.
2016-11-02 12:41:00 +01:00
Eelco Dolstra b3169ce438 Kill active build steps when builds are cancelled
We now kill active build steps when there are no more referring
builds. This is useful e.g. for preventing cancelled multi-hour TPC-H
benchmark runs from hogging build machines.
2016-10-31 14:58:29 +01:00
Eelco Dolstra a816ef873d Warn against empty machines file 2016-10-31 11:40:36 +01:00
Eelco Dolstra 3b84d4711b Bump Nix 2016-10-26 15:10:56 +02:00
Eelco Dolstra 0b00d51baf Prevent orphaned build steps
If two active steps of the same build failed, then the first would be
marked as "failed", but the second would end up as "orphaned", causing
it to be marked as "aborted" later on. Now it's correctly marked as
"failed".
2016-10-26 14:42:28 +02:00
Eelco Dolstra 8e1d791d0c Truncate the log just before starting the remote build
This gets rid of all those remote substitution messages that were
polluting the build logs.
2016-10-26 13:41:51 +02:00
Eelco Dolstra 3fcfa20d1a Fix regression caused by ee2e9f53
‘basicDrv.inputSrcs’ also contains the outputs of inputDrvs. These
don't necessarily exist in the local store, so copying them may cause
an exception. We should only copy the real inputSrcs.
2016-10-24 16:49:11 +02:00
Eelco Dolstra a3efdcdfd9 Use std::regex 2016-10-21 18:06:26 +02:00
Eelco Dolstra e0b2921ff2 Concurrent hydra-evaluator
This rewrites the top-level loop of hydra-evaluator in C++. The Perl
stuff is moved into hydra-eval-jobset. (Rewriting the entire evaluator
would be nice but is a bit too much work.) The new version has some
advantages:

* It can run multiple jobset evaluations in parallel.

* It uses PostgreSQL notifications so it doesn't have to poll the
  database. So if a jobset is triggered via the web interface or from
  a GitHub / Bitbucket webhook, evaluation of the jobset will start
  almost instantaneously (assuming the evaluator is not at its
  concurrency limit).

* It imposes a timeout on evaluations. So if e.g. hydra-eval-jobset
  hangs connecting to a Mercurial server, it will eventually be
  killed.
2016-10-14 14:22:12 +02:00
Eelco Dolstra 16feddd5d4 Drop obsolete -laws-cpp-sdk-s3 2016-10-14 14:22:12 +02:00
Eelco Dolstra dd5af7637d Remove finally.hh 2016-10-14 14:22:12 +02:00
Eelco Dolstra ee2e9f5335 Update to reflect BinaryCacheStore changes
BinaryCacheStore no longer implements buildPaths() and ensurePath(),
so we need to use copyPath() / copyClosure().
2016-10-07 20:23:05 +02:00
Eelco Dolstra 6a313c691b hydra-queue-runner: Fix build 2016-10-06 16:58:54 +02:00
Alexander Ried 7089142fdc Add error/warnings for deprecated store specification 2016-10-06 15:10:14 +02:00
Alexander Ried a73f211bf2 Use store-api for binary cache instantiation 2016-10-06 15:09:44 +02:00
Alexander Ried 1c2f6281b9 Remove signing parameter (nix#f435f82) 2016-10-06 15:09:12 +02:00
Alexander Ried 232e6e8556 Replace buildVerbosity with verboseBuild (nix#5761827) 2016-10-06 15:08:02 +02:00
Alexander Ried 492d16074c Remove s3binarystore (moved to nix in d155d80) 2016-10-06 15:07:21 +02:00
Eelco Dolstra b1512a152a Fix build failure on GCC 5.4 2016-09-30 17:05:07 +02:00
Shea Levy 5962367ffc Send BuildFinished notifications on cached build results.
Fixes #342
2016-08-17 06:40:12 -04:00
Eelco Dolstra a55942603a Provide a plugin hook for when build steps finish
Fixes #318.
2016-05-27 14:35:32 +02:00
Eelco Dolstra b50a105ca7 S3BinaryCacheStore: Use disk cache 2016-04-20 15:29:40 +02:00
Eelco Dolstra afb86638cd Updates for negative .narinfo caching 2016-04-15 15:39:20 +02:00
Eelco Dolstra 177bf25d64 Queue monitor: Bail out earlier if a step has failed previously
Currently, the hydra.nixos.org queue contains 1000s of Darwin builds
that all depend on a stdenv-darwin that previously failed. However,
before, first createStep() would construct a dependency graph for each
build, then getQueuedBuilds() would discover that one of the steps had
failed previously and discard all those steps. Since the graph
construction involves a lot of uncached calls to isValidPath(), this
took several seconds per build.

Now createStep() detects the previous failure right away and bails
out.
2016-04-15 14:32:16 +02:00
Eelco Dolstra ef72569cc3 Merge pull request #280 from shlevy/github-status-api
Add a plugin to interact with the github status API.
2016-04-14 20:03:45 +02:00
Eelco Dolstra d6f188a01a Typo 2016-04-13 16:45:40 +02:00
Eelco Dolstra b1e36b550c max-output-size -> max_output_size
To be consistent with other Catalyst/Hydra config option names.
2016-04-13 16:30:52 +02:00
Eelco Dolstra 077ed3f571 Periodically clear orphaned build steps
These are build steps that remain "busy" in the database even though
they have finished, because they couldn't be updated (e.g. due to a
PostgreSQL connection problem). To prevent them from showing up as
busy in the "Machine status" page, we now periodically purge them.
2016-04-13 16:30:52 +02:00
Eelco Dolstra f3f661bac1 Reuse build products / metrics stored in the database
Previously, if the queue monitor thread encounters a build that Hydra
has previously built, it downloaded the output paths from the binary
cache, just to determine the build products and metrics. This is very
inefficient. In particular, when doing something like merging
nixpkgs:staging into nixpkgs:master, the queue monitor thread will be
locked up for a long time fetching files from S3, causing the build
farm to be mostly idle.

Of course this is entirely unnecessary, since the build
products/metrics are already in the Hydra database. So now we just
look up a previous build with the same output path, and copy the
products/metrics.
2016-04-13 16:30:52 +02:00
Eelco Dolstra 8c7edb1005 Fix narrowing conversion 2016-04-13 16:30:52 +02:00
Eelco Dolstra 00c78440b1 Disambiguate "marking build as succeeded" message 2016-04-13 16:30:52 +02:00
Eelco Dolstra ad834343b5 Fix build against current Nix master 2016-04-13 16:30:52 +02:00
Shea Levy 9b37cb89ae Add buildStarted plugin hook 2016-04-12 14:42:01 -04:00
Eelco Dolstra ddc9f3cc6a Temporarily disable machines on any exception, not just connection failures 2016-03-22 16:54:40 +01:00
Eelco Dolstra 0aecd65e59 /queue-runner-status: Include info about temporarily disabled machines 2016-03-22 16:54:06 +01:00
Eelco Dolstra 5535bc28ca Tweak 2016-03-10 16:46:15 +01:00
Eelco Dolstra 60e7930d2b Bump memory limit a bit 2016-03-10 16:46:01 +01:00
Eelco Dolstra 75e7b35477 Fix retry of transient failures 2016-03-10 16:44:26 +01:00
Eelco Dolstra 33da40f272 Doh 2016-03-09 17:31:57 +01:00
Eelco Dolstra 4b9c76e502 hydra-queue-runner: Ensure regular status dumps 2016-03-09 17:11:34 +01:00
Eelco Dolstra 4151be7e69 Make the output size limit configurable
The maximum output size per build step (as the sum of the NARs of each
output) can be set via hydra.conf, e.g.

  max-output-size = 1000000000

The default is 2 GiB.

Also refactored the build error / status handling a bit.
2016-03-09 17:00:09 +01:00
Eelco Dolstra dc790c5f7e Fix bad format string 2016-03-09 16:59:35 +01:00