Commit graph

4193 commits

Author SHA1 Message Date
Rick van Schijndel
8858abb1a6
t/test.pl: increase event-timeout, set qvf
Only log issues/failures when something's actually up.
It has irked me for a long time that so much output came
out of running the tests, this seems to silence it.
It does hide some warnings, but I think it makes the output
so much more readable that it's worth the tradeoff.

Helps for highly parallel running of jobs, sometimes they'd not give output for a while.
Setting this timeout higher appears to help.
Not completely sure if this is the right place to do it, but it works fine for me.
2024-08-11 16:08:35 +02:00
Rick van Schijndel
ef619eca99
t: increase timeouts for slow commands with high load
We've seen many fails on ofborg, at lot of them ultimately appear to come down to
a timeout being hit, resulting in something like this:

Failure executing slapadd -F /<path>/slap.d -b dc=example -l /<path>/load.ldif.

Hopefully this resolves it for most cases.
I've done some endurance testing and this helps a lot.
some other commands also regularly time-out with high load:

- hydra-init
- hydra-create-user
- nix-store --delete

This should address most issues with tests randomly failing.

Used the following script for endurance testing:

```

import os
import subprocess

run_counter = 0
fail_counter = 0

while True:
    try:
        run_counter += 1
        print(f"Starting run {run_counter}")
        env = os.environ
        env["YATH_JOB_COUNT"] = "20"
        result = subprocess.run(["perl", "t/test.pl"], env=env)
        if (result.returncode != 0):
            fail_counter += 1
        print(f"Finish run {run_counter}, total fail count: {fail_counter}")
    except KeyboardInterrupt:
        print(f"Finished {run_counter} runs with {fail_counter} fails")
        break
```

In case someone else wants to do it on their system :).
Note that YATH_JOB_COUNT may need to be changed loosely based on your
cores.
I only have 4 cores (8 threads), so for others higher numbers might
yield better results in hashing out unstable tests.
2024-08-11 16:08:09 +02:00
marius david
41dfa0e443
Document the default user and port in hacking.md 2024-08-11 16:06:08 +02:00
4b107e6ff3
hydra-eval-jobset: pass --workers and --max-memory-size to n-e-j
Lost in the h-e-j -> n-e-j migration, causing evaluation to always be
single threaded and limited to 4GiB RAM. Follow the config settings like
h-e-j used to do (via C++ code).
2024-07-22 23:16:29 +02:00
4b886d9c45
autotools -> meson
There are some known regressions regarding local testing setups - since
everything was kinda half written with the expectation that build dir =
source dir (which should not be true anymore). But everything builds and
the test suite runs fine, after several hours spent debugging random
crashes in libpqxx with MALLOC_PERTURB_...
2024-07-22 22:30:41 +02:00
fbb894af4e
static: de-bundle vendored dependencies
The current way this whole build works is incompatible with having a
separate build dir, or at least with having a separate build dir. To be
improved in the future - maybe minimize the dependencies a bit. But this
isn't so much data that we really have to care.
2024-07-22 16:30:13 +02:00
Niklas Hambüchen
8a984efaef
renderInputDiff: Increase git hash length 8 -> 12
See investigation on lengths required to be conflict-free in practice:

https://github.com/NixOS/hydra/pull/1258#issuecomment-1321891677
2024-07-21 12:23:29 +02:00
abc9f11417
queue runner: fix store URI args being written to the SSH hosts file 2024-07-20 16:09:07 +02:00
9a4a5dd624
jobset-eval: fix actions not showing up sometimes for new jobs
New jobs have their "new" status take precedence over them being
"failed" or "queued", which means actions that can act on "failed" or
"queued" jobs weren't shown to the user when they could only act on
"new" jobs.
2024-07-20 13:09:39 +02:00
ac406a9175
nixos-modules: hydra-queue-runner fix network-online.target eval warning 2024-07-19 09:13:32 +02:00
73616aa0d9
nixos-module: don't force Nix GC to keep outputs
This isn't actually needed (h.n.o even overrides it!).

Fix the use of deprecated `gc-keep-derivations` alias along the way.
2024-07-17 13:21:58 +02:00
d33fc08341
nixos-module: fix trusted users
- Use extra-trusted-users to avoid overriding the default set of trusted
  users and causing permission issues.
- Add hydra and hydra-www users which also need permissions.
2024-07-17 13:20:37 +02:00
b0e9b4b2f9
hydra-eval-jobset: incrementally ingest eval results
nix-eval-jobs streams output, unlike hydra-eval-jobs. Now that we've
migrated, we can use this to:

1. Use less RAM by avoiding buffering a whole eval's worth of metadata
   into a Perl string and an array of JSON objects.
2. Make evals latency a bit lower by allowing the queue runner to start
   ingesting builds faster.
2024-07-17 12:05:41 +02:00
370a4bf138
treewide: start removing tests related to constituents
The feature cannot easily be ported to nix-eval-jobs since it requires
deep integration into the evaluator, and h.n.o doesn't use it. Later
more of this will be ripped out.
2024-07-17 08:31:19 +02:00
ed7c58708c
hydra-eval-jobs: remove, replaced by nix-eval-jobs 2024-07-17 08:31:19 +02:00
6d4ccff43c
hydra-eval-jobset: use nix-eval-jobs instead of hydra-eval-jobs 2024-07-17 08:31:19 +02:00
684cc50d86
flake: add nix-eval-jobs as input 2024-07-17 08:17:32 +02:00
6195cec6a3
hydra-queue-runner: adjust for Lix generators related changes 2024-07-16 04:35:44 +02:00
1fbfed8162
flake: rename 'nix' input to 'lix'
For consistency with other Lix forks of Nix ecosystems projects, e.g.
nix-eval-jobs.
2024-07-16 03:59:38 +02:00
fb9e29d4d0
queue runner: fix nullptr deref on build exception after releasing a machine reservation 2024-07-13 06:12:35 +02:00
05d620a54f
flake.lock: Update
Flake lock file updates:

• Updated input 'nix':
    'git+https://git@git.lix.systems/lix-project/lix?ref=refs/heads/main&rev=4c3d93611f2848c56ebc69c85f2b1e18001ed3c7' (2024-06-24)
  → 'git+https://git@git.lix.systems/lix-project/lix?ref=refs/heads/main&rev=4b109ec1a8fc4550150f56f0f46f2f41d844bda8' (2024-07-11)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/e4509b3a560c87a8d4cb6f9992b8915abf9e36d8' (2024-06-23)
  → 'github:NixOS/nixpkgs/a046c1202e11b62cbede5385ba64908feb7bfac4' (2024-07-11)
2024-07-13 03:08:50 +02:00
a9a2679793
hydra-evaluator: fix regression from e9d0a3 (inverted assertion) 2024-06-24 21:41:40 +02:00
e9d0a3a754
Update to latest Lix main 2024-06-24 20:25:35 +02:00
cbe527a3ee
util.hh split 2024-06-11 11:27:43 -04:00
ca98f42b39
nixexpr -> lixexpr 2024-06-11 11:13:42 -04:00
John Ericson
62bc5b54b2
Try again to ensure hydra module is usable
Nixpkgs only contains a `hydra_unstable`, not `hydra`, package, so
adjust the default accordingly, and then override it to our package in
the separate module which does that.

(cherry picked from commit e149da7b9bbc04bd0b1ca03fa0768e958cbcd40e)
2024-06-10 17:40:02 +02:00
John Ericson
c98017b823
Factor out NixOS tests, and clean up
Due to newer nixpkgs, there were a number of things that could be
cleaned up in the process.

(cherry picked from commit 743795b2b090a5cdfe8bd90120add8db7770086a)
2024-06-10 17:40:02 +02:00
John Ericson
ebae7a31fe
Remove PrometheusTiny from overlay
It's in Nixpkgs for a good while now.

(cherry picked from commit 92155f9a07f5fe32e0778e474e7313997811e635)
2024-06-10 17:40:02 +02:00
aff354e32f
Don't send gitea status update when build is started
This was the source of a flaky test because sometimes hydra-notify was
quick enough to send out `buildStarted` and sometimes it apparently
wasn't which was quickly spottable with `nix build --rebuild`.

Removing that status update doesn't make a difference functionally,
gitea doesn't differentiate between "queued" and "running", so we send
the same status ("pending") out on both events, so we'd even safe one
avoidable request.

(cherry picked from commit 806c375c33)
2024-06-10 17:40:02 +02:00
925dc7544a
flake: fix gitea integration test
This is an integration test that confirms that jobset definitions from
git repositories are correctly built and status updates pushed to the
gitea instance. The following things needed to be fixed:

* We're still on 23.05 where gitea is marked as insecure. Not going to
  update nixpkgs right now, but going for the quick fix.
* Since gitea 1.19 tokens have scopes that describe what's possible.
  Not specifying the scope in the DB appears to imply that no
  permissions are granted.
* Apparently we have three status updates now (for three status hooks,
  queued/started/finished). No idea why that was broken before, but the
  behavior still looks correct.

(cherry picked from commit ceff5c5cfe)
2024-06-10 17:40:02 +02:00
a053ef8fdf
lix api changes 2024-05-10 15:00:54 -04:00
803b8ee731
Revert "Update to Nix 2.19"
This reverts commit c922e73c11.
2024-05-10 14:47:11 -04:00
249620b49e
use lix 2024-05-10 12:49:27 -04:00
b8d03adaf4
queue runner: attempt at slightly smarter scheduling criteria
Instead of just going for "whatever is the oldest build we know of",
use the following first:

- Is the step more constrained? If so, schedule it first to avoid
  filling up "more desirable" build slots with less constrained builds.

- Does the step have more dependents? If so, schedule it first to try
  and maximize open parallelism and breadth of scheduling options.
2024-04-21 17:36:16 +02:00
ee1a7a7813
web: serveFile: also serve a CSP putting served HTML in its own origin 2024-04-21 16:14:24 +02:00
5c3e508e55
queue-runner: release machine reservation while copying outputs
This allows for better builder usage when the queue runner is busy. To
avoid running into uncontrollable imbalances between builder/queue
runner, we only release the machine reservation after the local
throttler has found a slot to start copying the outputs for that build.
2024-04-21 01:55:19 +02:00
026e3a3103
queue-runner: switch to pseudorandom ordering of builds processing
We don't rely on sequential / monotonic build IDs processing anymore, so
randomizing actually has the advantage of mixing builds for different
systems together, to avoid only one chunk of builds for a single system
getting processed while builders for other systems are starved.
2024-04-20 23:05:26 +02:00
6606a7f86e
queue runner: introduce some parallelism for remote paths lookup
Each output for a given step being ingested is looked up in parallel,
which should basically multiply the speed of builds ingestion by the
average number of outputs per derivation.
2024-04-20 22:28:18 +02:00
f31b95d371
queue-runner: reduce the time between queue monitor restarts
This will induce more DB queries (though these are fairly cheap), but at
the benefit of processing bumps within 1m instead of within 10m.
2024-04-20 16:58:10 +02:00
54f8daf6b1
queue-runner: remove id > X from new builds query
Running the query with/without it shows that it makes no difference to
postgres, since there's an index on finished=0 already. This allows a
few simplifications, but also paves the way towards running multiple
parallel monitor threads in the future.
2024-04-20 16:53:52 +02:00
cc6bafe538
queue-runner: add prom metrics to allow detecting internal bottlenecks
By looking at the ratio of running vs. waiting for the dispatcher and
the queue monitor, we should get better visibility into what hydra is
currently bottlenecked on.

There are other side effects we can try to measure to get to the same
result, but having a simple way doesn't cost us much.
2024-04-20 16:48:03 +02:00
6189ba9c5e
web: replace 'errormsg' with 'errormsg IS NULL' in most cases
This is implement in an extremely hacky way due to poor DBIx feature
support. Ideally, what we'd need is a way to tell DBIx to ignore the
errormsg column unless explicitly requested, and to automatically add a
computed 'errormsg IS NULL' column in others. Since it does not support
that, this commit instead hacks some support via method overrides while
taking care to not break anything obvious.
2024-04-12 20:14:09 +02:00
258e9314a9
web: include current step status on /machines 2024-04-11 17:15:58 +02:00
a51bd392a2
queue-runner: limit parallelism of CPU intensive operations
My current theory is that running more parallel xz than available CPU
cores is reducing our overall throughput by requiring more scheduling
overhead and more cache thrashing.
2024-04-11 16:43:01 +02:00
a596d6c3c1 Only show stepname if it doesn't equal the name of the drv
When building e.g. nixpkgs, the "Running builds" view will mostly look
like this

    hello.x86_64-linux (Build of hello-X.Y)
    exa.x86_64-linux (Build of exa-X.Y)
    ...

This doesn't provide any useful information. Showing the step name only
makes sense if it's not a child of the job's derivation. With this
patch, that information will only be shown if the drv name (i.e. w/o
`/nix/store/` prefix, .drv ext & hash) is not equal to the drv name of
the job itself (build.nixname).
2024-03-18 18:46:01 +01:00
415f9f2daa Running builds view: show build step names
When using Hydra to build machine configurations, you'll often see
"nixosConfigurations.foo" five times, i.e. for each build step being
run. This isn't very helpful I think because in such a case, a single
build step can also be compiling the Linux kernel.

This change also fetches the `drvpath` and `type` from the `buildsteps`
relation. We're already joining it, so this doesn't make much difference
(confirmed via query logging that this doesn't cause extra SQL queries).

Unfortunately build steps don't have a human readable name, so I'm
deriving it from the drvpath by stripping away the hash (assuming that
it'll never contain a `-` and that `/nix/store/` is used as prefix). I
decided against using the Nix bindings for that to avoid too much
overhead due to store operations for each build step.
2024-03-18 18:46:01 +01:00
9b465e7a67 Make "timed out" and "log limit exceeded" builds aborted
In 73694087a0 I gave builds that failed
because of a timeout or exceeded log limit a stop sign and I stand by
that reasoning: with that it's possible to distinguish between actual
build failures and rather transient things such as timeouts.

Back then I considered it a feature that these are shown in a different
tab, but I don't think that's a good idea anymore. When using a jobset to
e.g. track the regressions from a mass rebuild (like a compiler or gcc
update), "Newly failed builds" should exclusively display regressions (and
flaky builds of course, not much I can do about that).

Also, when a bunch of builds fail in such a jobset because of e.g. a
broken connection to a builder that results in a timeout, I want to be
able to restart them all w/o rebuilding actual regressions.

To make it clear that we not only have "Aborted" builds in the tab, I
renamed the label to "Aborted / Timed out".
2024-03-16 22:10:40 +01:00
9b62c52e5c hydra-queue-runner: drop broken connections from pool
Closes #1336

When restarting postgresql, the connections are still reused in
`hydra-queue-runner` causing errors like this

    main thread: Lost connection to the database server.
    queue monitor: Lost connection to the database server.

and no more builds being processed.

`hydra-evaluator` doesn't have that issue since it crashes right away.
We could let it retry indefinitely as well (see below), but I don't
want to change too much.

If the DB is still unreachable 10s later, the process will stop with a
non-zero exit code because of a missing DB connection. This however
isn't such a big deal because it will be immediately restarted
afterwards. With the current configuration, Hydra will never give up,
but restart (and retry) infinitely. To me that seems reasonable, i.e. to
retry DB connections on a long-running process. If this doesn't work
out, the monitoring should fire anyways because the queue fills up, but
I'm open to discuss that.

Please note that this isn't reproducible with the DB and the queue
runner on the same machine when using `services.hydra-dev`, because of
the `Requires=` dependency `hydra-queue-runner.service` ->
`hydra-init.service` -> `postgresql.service` that causes the queue
runner to be restarted on `systemctl restart postgresql`.

Internally, Hydra uses Nix's pool data structure: it basically has N
slots (here DB connections) and whenever a new one is requested, an idle
slot is provided or a new one is created (when N slots are active, it'll
be waited until one slot is free). The issue in the code here is however
that whenever an error is encountered, the slot is released, however the
same broken connection will be reused the next time. By using
`Pool::Handle::markBad`, Nix will drop a broken slot. This is now being
done when `pqxx::broken_connection` was caught.
2024-03-16 22:10:40 +01:00
ef6be80f54 Use submit event in login form
It's a pet peeve from me when logging into my personal Hydra that I
always have to press the button rather than hitting Return after entering
my password.

Reason for that is that the form doesn't have a "submit" button, so far
it was always listened to the "click" event. Submit does that and you
can hit Return alternatively.
2024-03-16 22:10:40 +01:00
969eb3eeac urlencode drv names when fetching logs
Otherwise names with special characters like + break things.
2024-03-16 22:10:40 +01:00