Commit graph

3018 commits

Author SHA1 Message Date
Maximilian Bosch ee1234c15c ignoreException has been split into two
The Finally part is a destructor, so using `ignoreExceptionInDestructor`
seems to be the correct choice here.
2024-10-07 19:22:32 +02:00
Maximilian Bosch 7c7078cccf Fix build with latest Lix
Since ca1dc3f70bf98e2424b7b2666ee2180675b67451, the NAR parser has moved
the preallocate & receive steps into the file handle class to remove the
assumption that only one file can be handled at a time.
2024-10-07 19:22:32 +02:00
Ilya K f23ec71227 Add metric for builds waiting for download slot 2024-10-01 19:14:24 +03:00
Maximilian Bosch 6a88e647e7
flake.lock: Update; fix build
Flake lock file updates:

• Updated input 'lix':
    'git+https://git.lix.systems/lix-project/lix?ref=refs/heads/main&rev=278fddc317cf0cf4d3602d0ec0f24d1dd281fadb' (2024-08-17)
  → 'git+https://git.lix.systems/lix-project/lix?ref=refs/heads/main&rev=02eb07cfd539c34c080cb1baf042e5e780c1fcc2' (2024-09-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/c3d4ac725177c030b1e289015989da2ad9d56af0' (2024-08-15)
  → 'github:NixOS/nixpkgs/6e99f2a27d600612004fbd2c3282d614bfee6421' (2024-08-30)
2024-09-02 10:53:46 +02:00
Pierre Bourdon 8d5d4942e1
queue-runner: remove unused method from State 2024-08-27 02:57:37 +02:00
Pierre Bourdon e5a8ee5c17
web: require permissions for /api/push 2024-08-27 02:57:16 +02:00
Pierre Bourdon fd7fd0ad65
treewide: clang-tidy modernize 2024-08-27 01:33:12 +02:00
Pierre Bourdon d3fcedbcf5
treewide: enable clang-tidy bugprone findings
Fix some trivial findings throughout the codebase, mostly making
implicit casts explicit.
2024-08-27 00:43:17 +02:00
Pierre Bourdon 3891ad77e3
queue-runner: change Machine object creation to work around clang bug
https://github.com/llvm/llvm-project/issues/106123
2024-08-26 22:34:48 +02:00
emily ab6d81fad4
api: fix github webhook 2024-08-26 20:26:21 +02:00
Sandro 64df0cba47
Match URIs that don't end in .git
Co-authored-by: Charlotte <lotte@chir.rs>
2024-08-26 20:26:21 +02:00
Sandro Jäckel 6179b298cb
Add gitea push hook 2024-08-26 20:26:20 +02:00
Pierre Bourdon 44b9a7b95d
queue-runner: handle broken pg pool connections in builder code
Completes 9b62c52e5c with another location
that was initially missed.
2024-08-25 22:05:13 +02:00
Maximilian Bosch 3ee51dbe58 readIntoSocket: fix with store URIs containing an &
The third argument to `open()` in `-|` mode is passed to a shell if it's
a string. In my case the store URI contains
`?secret-key=${signingKey.directory}/secret&compression=zstd`

For the `nix store cat` case this means that

* until `&` the process will be started in the background. This fails
  immediately because no path to cat is specified.
* `compression=zstd` is a variable assignment
* the `$path` argument to `store cat` is attempted to be executed as
  another command

Passing just the list solves the problem.
2024-08-18 21:41:54 +00:00
Maximilian Bosch e987f74954
doc: drop dev-notes & make update-dbix more discoverable
`dev-notes` are severely outdated. I dropped everything except one note
that I moved to hacking.md. The parts about creating users are also
covered elsewhere.

The `update-dbix` part got a just command to make it discoverable again.
2024-08-18 14:47:09 +02:00
71rd 459aa0a598 Stream files from store instead of buffering them
When an artifact is requested from hydra the output is first copied
from the nix store into memory and then sent as a response, delaying
the download and taking up significant amounts of memory.

As reported in https://github.com/NixOS/hydra/issues/1357

Instead of calling a command and blocking while reading in the entire
output, this adds read_into_socket(). the function takes a
command, starting a subprocess with that command, returning a file
descriptor attached to stdout.
This file descriptor is then by responsebuilder of Catalyst to steam
the output directly
2024-08-13 22:09:48 +02:00
eldritch horrors f1b552ecbf update flake locks, fix compile errors 2024-08-12 22:45:34 +02:00
Pierre Bourdon 4b107e6ff3
hydra-eval-jobset: pass --workers and --max-memory-size to n-e-j
Some checks failed
Test / tests (push) Has been cancelled
Lost in the h-e-j -> n-e-j migration, causing evaluation to always be
single threaded and limited to 4GiB RAM. Follow the config settings like
h-e-j used to do (via C++ code).
2024-07-22 23:16:29 +02:00
Pierre Bourdon 4b886d9c45
autotools -> meson
Some checks are pending
Test / tests (push) Waiting to run
There are some known regressions regarding local testing setups - since
everything was kinda half written with the expectation that build dir =
source dir (which should not be true anymore). But everything builds and
the test suite runs fine, after several hours spent debugging random
crashes in libpqxx with MALLOC_PERTURB_...
2024-07-22 22:30:41 +02:00
Pierre Bourdon fbb894af4e
static: de-bundle vendored dependencies
The current way this whole build works is incompatible with having a
separate build dir, or at least with having a separate build dir. To be
improved in the future - maybe minimize the dependencies a bit. But this
isn't so much data that we really have to care.
2024-07-22 16:30:13 +02:00
Niklas Hambüchen 8a984efaef
renderInputDiff: Increase git hash length 8 -> 12
Some checks failed
Test / tests (push) Has been cancelled
See investigation on lengths required to be conflict-free in practice:

https://github.com/NixOS/hydra/pull/1258#issuecomment-1321891677
2024-07-21 12:23:29 +02:00
Pierre Bourdon abc9f11417
queue runner: fix store URI args being written to the SSH hosts file
Some checks are pending
Test / tests (push) Waiting to run
2024-07-20 16:09:07 +02:00
Pierre Bourdon 9a4a5dd624
jobset-eval: fix actions not showing up sometimes for new jobs
Some checks are pending
Test / tests (push) Waiting to run
New jobs have their "new" status take precedence over them being
"failed" or "queued", which means actions that can act on "failed" or
"queued" jobs weren't shown to the user when they could only act on
"new" jobs.
2024-07-20 13:09:39 +02:00
Pierre Bourdon b0e9b4b2f9
hydra-eval-jobset: incrementally ingest eval results
Some checks failed
Test / tests (push) Has been cancelled
Test / tests (pull_request) Has been cancelled
nix-eval-jobs streams output, unlike hydra-eval-jobs. Now that we've
migrated, we can use this to:

1. Use less RAM by avoiding buffering a whole eval's worth of metadata
   into a Perl string and an array of JSON objects.
2. Make evals latency a bit lower by allowing the queue runner to start
   ingesting builds faster.
2024-07-17 12:05:41 +02:00
Pierre Bourdon 370a4bf138
treewide: start removing tests related to constituents
Some checks failed
Test / tests (push) Waiting to run
Test / tests (pull_request) Has been cancelled
The feature cannot easily be ported to nix-eval-jobs since it requires
deep integration into the evaluator, and h.n.o doesn't use it. Later
more of this will be ripped out.
2024-07-17 08:31:19 +02:00
Pierre Bourdon ed7c58708c
hydra-eval-jobs: remove, replaced by nix-eval-jobs 2024-07-17 08:31:19 +02:00
Pierre Bourdon 6d4ccff43c
hydra-eval-jobset: use nix-eval-jobs instead of hydra-eval-jobs 2024-07-17 08:31:19 +02:00
Pierre Bourdon 6195cec6a3
hydra-queue-runner: adjust for Lix generators related changes 2024-07-16 04:35:44 +02:00
Pierre Bourdon fb9e29d4d0
queue runner: fix nullptr deref on build exception after releasing a machine reservation
Some checks failed
Test / tests (push) Has been cancelled
2024-07-13 06:12:35 +02:00
Pierre Bourdon a9a2679793
hydra-evaluator: fix regression from e9d0a3 (inverted assertion)
Some checks failed
Test / tests (push) Has been cancelled
2024-06-24 21:41:40 +02:00
Pierre Bourdon e9d0a3a754
Update to latest Lix main
Some checks are pending
Test / tests (push) Waiting to run
2024-06-24 20:25:35 +02:00
leo60228 cbe527a3ee
util.hh split
Some checks failed
Test / tests (push) Has been cancelled
2024-06-11 11:27:43 -04:00
leo60228 ca98f42b39
nixexpr -> lixexpr 2024-06-11 11:13:42 -04:00
Maximilian Bosch aff354e32f
Don't send gitea status update when build is started
This was the source of a flaky test because sometimes hydra-notify was
quick enough to send out `buildStarted` and sometimes it apparently
wasn't which was quickly spottable with `nix build --rebuild`.

Removing that status update doesn't make a difference functionally,
gitea doesn't differentiate between "queued" and "running", so we send
the same status ("pending") out on both events, so we'd even safe one
avoidable request.

(cherry picked from commit 806c375c338b4e6a1d276b96994018908784bf11)
2024-06-10 17:40:02 +02:00
leo60228 a053ef8fdf
lix api changes
Some checks are pending
Test / tests (push) Waiting to run
2024-05-10 15:00:54 -04:00
leo60228 803b8ee731
Revert "Update to Nix 2.19"
This reverts commit c922e73c11.
2024-05-10 14:47:11 -04:00
Pierre Bourdon b8d03adaf4
queue runner: attempt at slightly smarter scheduling criteria
Instead of just going for "whatever is the oldest build we know of",
use the following first:

- Is the step more constrained? If so, schedule it first to avoid
  filling up "more desirable" build slots with less constrained builds.

- Does the step have more dependents? If so, schedule it first to try
  and maximize open parallelism and breadth of scheduling options.
2024-04-21 17:36:16 +02:00
Pierre Bourdon ee1a7a7813
web: serveFile: also serve a CSP putting served HTML in its own origin 2024-04-21 16:14:24 +02:00
Pierre Bourdon 5c3e508e55
queue-runner: release machine reservation while copying outputs
This allows for better builder usage when the queue runner is busy. To
avoid running into uncontrollable imbalances between builder/queue
runner, we only release the machine reservation after the local
throttler has found a slot to start copying the outputs for that build.
2024-04-21 01:55:19 +02:00
Pierre Bourdon 026e3a3103
queue-runner: switch to pseudorandom ordering of builds processing
We don't rely on sequential / monotonic build IDs processing anymore, so
randomizing actually has the advantage of mixing builds for different
systems together, to avoid only one chunk of builds for a single system
getting processed while builders for other systems are starved.
2024-04-20 23:05:26 +02:00
Pierre Bourdon 6606a7f86e
queue runner: introduce some parallelism for remote paths lookup
Each output for a given step being ingested is looked up in parallel,
which should basically multiply the speed of builds ingestion by the
average number of outputs per derivation.
2024-04-20 22:28:18 +02:00
Pierre Bourdon f31b95d371
queue-runner: reduce the time between queue monitor restarts
This will induce more DB queries (though these are fairly cheap), but at
the benefit of processing bumps within 1m instead of within 10m.
2024-04-20 16:58:10 +02:00
Pierre Bourdon 54f8daf6b1
queue-runner: remove id > X from new builds query
Running the query with/without it shows that it makes no difference to
postgres, since there's an index on finished=0 already. This allows a
few simplifications, but also paves the way towards running multiple
parallel monitor threads in the future.
2024-04-20 16:53:52 +02:00
Pierre Bourdon cc6bafe538
queue-runner: add prom metrics to allow detecting internal bottlenecks
By looking at the ratio of running vs. waiting for the dispatcher and
the queue monitor, we should get better visibility into what hydra is
currently bottlenecked on.

There are other side effects we can try to measure to get to the same
result, but having a simple way doesn't cost us much.
2024-04-20 16:48:03 +02:00
Pierre Bourdon 6189ba9c5e
web: replace 'errormsg' with 'errormsg IS NULL' in most cases
This is implement in an extremely hacky way due to poor DBIx feature
support. Ideally, what we'd need is a way to tell DBIx to ignore the
errormsg column unless explicitly requested, and to automatically add a
computed 'errormsg IS NULL' column in others. Since it does not support
that, this commit instead hacks some support via method overrides while
taking care to not break anything obvious.
2024-04-12 20:14:09 +02:00
Pierre Bourdon 258e9314a9
web: include current step status on /machines 2024-04-11 17:15:58 +02:00
Pierre Bourdon a51bd392a2
queue-runner: limit parallelism of CPU intensive operations
My current theory is that running more parallel xz than available CPU
cores is reducing our overall throughput by requiring more scheduling
overhead and more cache thrashing.
2024-04-11 16:43:01 +02:00
Maximilian Bosch a596d6c3c1 Only show stepname if it doesn't equal the name of the drv
When building e.g. nixpkgs, the "Running builds" view will mostly look
like this

    hello.x86_64-linux (Build of hello-X.Y)
    exa.x86_64-linux (Build of exa-X.Y)
    ...

This doesn't provide any useful information. Showing the step name only
makes sense if it's not a child of the job's derivation. With this
patch, that information will only be shown if the drv name (i.e. w/o
`/nix/store/` prefix, .drv ext & hash) is not equal to the drv name of
the job itself (build.nixname).
2024-03-18 18:46:01 +01:00
Maximilian Bosch 415f9f2daa Running builds view: show build step names
When using Hydra to build machine configurations, you'll often see
"nixosConfigurations.foo" five times, i.e. for each build step being
run. This isn't very helpful I think because in such a case, a single
build step can also be compiling the Linux kernel.

This change also fetches the `drvpath` and `type` from the `buildsteps`
relation. We're already joining it, so this doesn't make much difference
(confirmed via query logging that this doesn't cause extra SQL queries).

Unfortunately build steps don't have a human readable name, so I'm
deriving it from the drvpath by stripping away the hash (assuming that
it'll never contain a `-` and that `/nix/store/` is used as prefix). I
decided against using the Nix bindings for that to avoid too much
overhead due to store operations for each build step.
2024-03-18 18:46:01 +01:00
Maximilian Bosch 9b465e7a67 Make "timed out" and "log limit exceeded" builds aborted
In 73694087a0 I gave builds that failed
because of a timeout or exceeded log limit a stop sign and I stand by
that reasoning: with that it's possible to distinguish between actual
build failures and rather transient things such as timeouts.

Back then I considered it a feature that these are shown in a different
tab, but I don't think that's a good idea anymore. When using a jobset to
e.g. track the regressions from a mass rebuild (like a compiler or gcc
update), "Newly failed builds" should exclusively display regressions (and
flaky builds of course, not much I can do about that).

Also, when a bunch of builds fail in such a jobset because of e.g. a
broken connection to a builder that results in a timeout, I want to be
able to restart them all w/o rebuilding actual regressions.

To make it clear that we not only have "Aborted" builds in the tab, I
renamed the label to "Aborted / Timed out".
2024-03-16 22:10:40 +01:00