hydra

Author	SHA1	Message	Date
Rick van Schijndel	ef619eca99	t: increase timeouts for slow commands with high load We've seen many fails on ofborg, at lot of them ultimately appear to come down to a timeout being hit, resulting in something like this: Failure executing slapadd -F /<path>/slap.d -b dc=example -l /<path>/load.ldif. Hopefully this resolves it for most cases. I've done some endurance testing and this helps a lot. some other commands also regularly time-out with high load: - hydra-init - hydra-create-user - nix-store --delete This should address most issues with tests randomly failing. Used the following script for endurance testing: ``` import os import subprocess run_counter = 0 fail_counter = 0 while True: try: run_counter += 1 print(f"Starting run {run_counter}") env = os.environ env["YATH_JOB_COUNT"] = "20" result = subprocess.run(["perl", "t/test.pl"], env=env) if (result.returncode != 0): fail_counter += 1 print(f"Finish run {run_counter}, total fail count: {fail_counter}") except KeyboardInterrupt: print(f"Finished {run_counter} runs with {fail_counter} fails") break ``` In case someone else wants to do it on their system :). Note that YATH_JOB_COUNT may need to be changed loosely based on your cores. I only have 4 cores (8 threads), so for others higher numbers might yield better results in hashing out unstable tests.	2024-08-11 16:08:09 +02:00
marius david	41dfa0e443	Document the default user and port in hacking.md Some checks are pending Test / tests (push) Waiting to run	2024-08-11 16:06:08 +02:00
Pierre Bourdon	4b107e6ff3	hydra-eval-jobset: pass --workers and --max-memory-size to n-e-j Some checks failed Test / tests (push) Has been cancelled Lost in the h-e-j -> n-e-j migration, causing evaluation to always be single threaded and limited to 4GiB RAM. Follow the config settings like h-e-j used to do (via C++ code).	2024-07-22 23:16:29 +02:00
Pierre Bourdon	4b886d9c45	autotools -> meson Some checks are pending Test / tests (push) Waiting to run There are some known regressions regarding local testing setups - since everything was kinda half written with the expectation that build dir = source dir (which should not be true anymore). But everything builds and the test suite runs fine, after several hours spent debugging random crashes in libpqxx with MALLOC_PERTURB_...	2024-07-22 22:30:41 +02:00
Pierre Bourdon	fbb894af4e	static: de-bundle vendored dependencies The current way this whole build works is incompatible with having a separate build dir, or at least with having a separate build dir. To be improved in the future - maybe minimize the dependencies a bit. But this isn't so much data that we really have to care.	2024-07-22 16:30:13 +02:00
Niklas Hambüchen	8a984efaef	renderInputDiff: Increase git hash length 8 -> 12 Some checks failed Test / tests (push) Has been cancelled See investigation on lengths required to be conflict-free in practice: https://github.com/NixOS/hydra/pull/1258#issuecomment-1321891677	2024-07-21 12:23:29 +02:00
Pierre Bourdon	abc9f11417	queue runner: fix store URI args being written to the SSH hosts file Some checks are pending Test / tests (push) Waiting to run	2024-07-20 16:09:07 +02:00
Pierre Bourdon	9a4a5dd624	jobset-eval: fix actions not showing up sometimes for new jobs Some checks are pending Test / tests (push) Waiting to run New jobs have their "new" status take precedence over them being "failed" or "queued", which means actions that can act on "failed" or "queued" jobs weren't shown to the user when they could only act on "new" jobs.	2024-07-20 13:09:39 +02:00
Janik H.	ac406a9175	nixos-modules: hydra-queue-runner fix network-online.target eval warning Some checks failed Test / tests (pull_request) Has been cancelled Test / tests (push) Has been cancelled	2024-07-19 09:13:32 +02:00
Pierre Bourdon	73616aa0d9	nixos-module: don't force Nix GC to keep outputs Some checks failed Test / tests (push) Has been cancelled This isn't actually needed (h.n.o even overrides it!). Fix the use of deprecated `gc-keep-derivations` alias along the way.	2024-07-17 13:21:58 +02:00
Pierre Bourdon	d33fc08341	nixos-module: fix trusted users - Use extra-trusted-users to avoid overriding the default set of trusted users and causing permission issues. - Add hydra and hydra-www users which also need permissions.	2024-07-17 13:20:37 +02:00
Pierre Bourdon	b0e9b4b2f9	hydra-eval-jobset: incrementally ingest eval results Some checks failed Test / tests (push) Has been cancelled Test / tests (pull_request) Has been cancelled nix-eval-jobs streams output, unlike hydra-eval-jobs. Now that we've migrated, we can use this to: 1. Use less RAM by avoiding buffering a whole eval's worth of metadata into a Perl string and an array of JSON objects. 2. Make evals latency a bit lower by allowing the queue runner to start ingesting builds faster.	2024-07-17 12:05:41 +02:00
Pierre Bourdon	370a4bf138	treewide: start removing tests related to constituents Some checks failed Test / tests (push) Waiting to run Test / tests (pull_request) Has been cancelled The feature cannot easily be ported to nix-eval-jobs since it requires deep integration into the evaluator, and h.n.o doesn't use it. Later more of this will be ripped out.	2024-07-17 08:31:19 +02:00
Pierre Bourdon	ed7c58708c	hydra-eval-jobs: remove, replaced by nix-eval-jobs	2024-07-17 08:31:19 +02:00
Pierre Bourdon	6d4ccff43c	hydra-eval-jobset: use nix-eval-jobs instead of hydra-eval-jobs	2024-07-17 08:31:19 +02:00
Pierre Bourdon	684cc50d86	flake: add nix-eval-jobs as input	2024-07-17 08:17:32 +02:00
Pierre Bourdon	6195cec6a3	hydra-queue-runner: adjust for Lix generators related changes	2024-07-16 04:35:44 +02:00
Pierre Bourdon	1fbfed8162	flake: rename 'nix' input to 'lix' For consistency with other Lix forks of Nix ecosystems projects, e.g. nix-eval-jobs.	2024-07-16 03:59:38 +02:00
Pierre Bourdon	fb9e29d4d0	queue runner: fix nullptr deref on build exception after releasing a machine reservation Some checks failed Test / tests (push) Has been cancelled	2024-07-13 06:12:35 +02:00
Pierre Bourdon	05d620a54f	flake.lock: Update Some checks are pending Test / tests (push) Waiting to run Flake lock file updates: • Updated input 'nix': 'git+https://git@git.lix.systems/lix-project/lix?ref=refs/heads/main&rev=4c3d93611f2848c56ebc69c85f2b1e18001ed3c7' (2024-06-24) → 'git+https://git@git.lix.systems/lix-project/lix?ref=refs/heads/main&rev=4b109ec1a8fc4550150f56f0f46f2f41d844bda8' (2024-07-11) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/e4509b3a560c87a8d4cb6f9992b8915abf9e36d8' (2024-06-23) → 'github:NixOS/nixpkgs/a046c1202e11b62cbede5385ba64908feb7bfac4' (2024-07-11)	2024-07-13 03:08:50 +02:00
Pierre Bourdon	a9a2679793	hydra-evaluator: fix regression from e9d0a3 (inverted assertion) Some checks failed Test / tests (push) Has been cancelled	2024-06-24 21:41:40 +02:00
Pierre Bourdon	e9d0a3a754	Update to latest Lix main Some checks are pending Test / tests (push) Waiting to run	2024-06-24 20:25:35 +02:00
leo60228	cbe527a3ee	util.hh split Some checks failed Test / tests (push) Has been cancelled	2024-06-11 11:27:43 -04:00
leo60228	ca98f42b39	nixexpr -> lixexpr	2024-06-11 11:13:42 -04:00
John Ericson	62bc5b54b2	Try again to ensure hydra module is usable Some checks are pending Test / tests (push) Waiting to run Nixpkgs only contains a `hydra_unstable`, not `hydra`, package, so adjust the default accordingly, and then override it to our package in the separate module which does that. (cherry picked from commit e149da7b9bbc04bd0b1ca03fa0768e958cbcd40e)	2024-06-10 17:40:02 +02:00
John Ericson	c98017b823	Factor out NixOS tests, and clean up Due to newer nixpkgs, there were a number of things that could be cleaned up in the process. (cherry picked from commit 743795b2b090a5cdfe8bd90120add8db7770086a)	2024-06-10 17:40:02 +02:00
John Ericson	ebae7a31fe	Remove `PrometheusTiny` from overlay It's in Nixpkgs for a good while now. (cherry picked from commit 92155f9a07f5fe32e0778e474e7313997811e635)	2024-06-10 17:40:02 +02:00
Maximilian Bosch	aff354e32f	Don't send gitea status update when build is started This was the source of a flaky test because sometimes hydra-notify was quick enough to send out `buildStarted` and sometimes it apparently wasn't which was quickly spottable with `nix build --rebuild`. Removing that status update doesn't make a difference functionally, gitea doesn't differentiate between "queued" and "running", so we send the same status ("pending") out on both events, so we'd even safe one avoidable request. (cherry picked from commit 806c375c338b4e6a1d276b96994018908784bf11)	2024-06-10 17:40:02 +02:00
Maximilian Bosch	925dc7544a	flake: fix gitea integration test This is an integration test that confirms that jobset definitions from git repositories are correctly built and status updates pushed to the gitea instance. The following things needed to be fixed: * We're still on 23.05 where gitea is marked as insecure. Not going to update nixpkgs right now, but going for the quick fix. * Since gitea 1.19 tokens have scopes that describe what's possible. Not specifying the scope in the DB appears to imply that no permissions are granted. * Apparently we have three status updates now (for three status hooks, queued/started/finished). No idea why that was broken before, but the behavior still looks correct. (cherry picked from commit ceff5c5cfeaf211691f4d1156f358a940b5ef7b4)	2024-06-10 17:40:02 +02:00
leo60228	a053ef8fdf	lix api changes Some checks are pending Test / tests (push) Waiting to run	2024-05-10 15:00:54 -04:00
leo60228	803b8ee731	Revert "Update to Nix 2.19" This reverts commit `c922e73c11`.	2024-05-10 14:47:11 -04:00
leo60228	249620b49e	use lix	2024-05-10 12:49:27 -04:00
Pierre Bourdon	b8d03adaf4	queue runner: attempt at slightly smarter scheduling criteria Instead of just going for "whatever is the oldest build we know of", use the following first: - Is the step more constrained? If so, schedule it first to avoid filling up "more desirable" build slots with less constrained builds. - Does the step have more dependents? If so, schedule it first to try and maximize open parallelism and breadth of scheduling options.	2024-04-21 17:36:16 +02:00
Pierre Bourdon	ee1a7a7813	web: serveFile: also serve a CSP putting served HTML in its own origin	2024-04-21 16:14:24 +02:00
Pierre Bourdon	5c3e508e55	queue-runner: release machine reservation while copying outputs This allows for better builder usage when the queue runner is busy. To avoid running into uncontrollable imbalances between builder/queue runner, we only release the machine reservation after the local throttler has found a slot to start copying the outputs for that build.	2024-04-21 01:55:19 +02:00
Pierre Bourdon	026e3a3103	queue-runner: switch to pseudorandom ordering of builds processing We don't rely on sequential / monotonic build IDs processing anymore, so randomizing actually has the advantage of mixing builds for different systems together, to avoid only one chunk of builds for a single system getting processed while builders for other systems are starved.	2024-04-20 23:05:26 +02:00
Pierre Bourdon	6606a7f86e	queue runner: introduce some parallelism for remote paths lookup Each output for a given step being ingested is looked up in parallel, which should basically multiply the speed of builds ingestion by the average number of outputs per derivation.	2024-04-20 22:28:18 +02:00
Pierre Bourdon	f31b95d371	queue-runner: reduce the time between queue monitor restarts This will induce more DB queries (though these are fairly cheap), but at the benefit of processing bumps within 1m instead of within 10m.	2024-04-20 16:58:10 +02:00
Pierre Bourdon	54f8daf6b1	queue-runner: remove id > X from new builds query Running the query with/without it shows that it makes no difference to postgres, since there's an index on finished=0 already. This allows a few simplifications, but also paves the way towards running multiple parallel monitor threads in the future.	2024-04-20 16:53:52 +02:00
Pierre Bourdon	cc6bafe538	queue-runner: add prom metrics to allow detecting internal bottlenecks By looking at the ratio of running vs. waiting for the dispatcher and the queue monitor, we should get better visibility into what hydra is currently bottlenecked on. There are other side effects we can try to measure to get to the same result, but having a simple way doesn't cost us much.	2024-04-20 16:48:03 +02:00
Pierre Bourdon	6189ba9c5e	web: replace 'errormsg' with 'errormsg IS NULL' in most cases This is implement in an extremely hacky way due to poor DBIx feature support. Ideally, what we'd need is a way to tell DBIx to ignore the errormsg column unless explicitly requested, and to automatically add a computed 'errormsg IS NULL' column in others. Since it does not support that, this commit instead hacks some support via method overrides while taking care to not break anything obvious.	2024-04-12 20:14:09 +02:00
Pierre Bourdon	258e9314a9	web: include current step status on /machines	2024-04-11 17:15:58 +02:00
Pierre Bourdon	a51bd392a2	queue-runner: limit parallelism of CPU intensive operations My current theory is that running more parallel xz than available CPU cores is reducing our overall throughput by requiring more scheduling overhead and more cache thrashing.	2024-04-11 16:43:01 +02:00
Maximilian Bosch	a596d6c3c1	Only show stepname if it doesn't equal the name of the drv When building e.g. nixpkgs, the "Running builds" view will mostly look like this hello.x86_64-linux (Build of hello-X.Y) exa.x86_64-linux (Build of exa-X.Y) ... This doesn't provide any useful information. Showing the step name only makes sense if it's not a child of the job's derivation. With this patch, that information will only be shown if the drv name (i.e. w/o `/nix/store/` prefix, .drv ext & hash) is not equal to the drv name of the job itself (build.nixname).	2024-03-18 18:46:01 +01:00
Maximilian Bosch	415f9f2daa	Running builds view: show build step names When using Hydra to build machine configurations, you'll often see "nixosConfigurations.foo" five times, i.e. for each build step being run. This isn't very helpful I think because in such a case, a single build step can also be compiling the Linux kernel. This change also fetches the `drvpath` and `type` from the `buildsteps` relation. We're already joining it, so this doesn't make much difference (confirmed via query logging that this doesn't cause extra SQL queries). Unfortunately build steps don't have a human readable name, so I'm deriving it from the drvpath by stripping away the hash (assuming that it'll never contain a `-` and that `/nix/store/` is used as prefix). I decided against using the Nix bindings for that to avoid too much overhead due to store operations for each build step.	2024-03-18 18:46:01 +01:00
Maximilian Bosch	9b465e7a67	Make "timed out" and "log limit exceeded" builds aborted In `73694087a0` I gave builds that failed because of a timeout or exceeded log limit a stop sign and I stand by that reasoning: with that it's possible to distinguish between actual build failures and rather transient things such as timeouts. Back then I considered it a feature that these are shown in a different tab, but I don't think that's a good idea anymore. When using a jobset to e.g. track the regressions from a mass rebuild (like a compiler or gcc update), "Newly failed builds" should exclusively display regressions (and flaky builds of course, not much I can do about that). Also, when a bunch of builds fail in such a jobset because of e.g. a broken connection to a builder that results in a timeout, I want to be able to restart them all w/o rebuilding actual regressions. To make it clear that we not only have "Aborted" builds in the tab, I renamed the label to "Aborted / Timed out".	2024-03-16 22:10:40 +01:00
Maximilian Bosch	9b62c52e5c	hydra-queue-runner: drop broken connections from pool Closes #1336 When restarting postgresql, the connections are still reused in `hydra-queue-runner` causing errors like this main thread: Lost connection to the database server. queue monitor: Lost connection to the database server. and no more builds being processed. `hydra-evaluator` doesn't have that issue since it crashes right away. We could let it retry indefinitely as well (see below), but I don't want to change too much. If the DB is still unreachable 10s later, the process will stop with a non-zero exit code because of a missing DB connection. This however isn't such a big deal because it will be immediately restarted afterwards. With the current configuration, Hydra will never give up, but restart (and retry) infinitely. To me that seems reasonable, i.e. to retry DB connections on a long-running process. If this doesn't work out, the monitoring should fire anyways because the queue fills up, but I'm open to discuss that. Please note that this isn't reproducible with the DB and the queue runner on the same machine when using `services.hydra-dev`, because of the `Requires=` dependency `hydra-queue-runner.service` -> `hydra-init.service` -> `postgresql.service` that causes the queue runner to be restarted on `systemctl restart postgresql`. Internally, Hydra uses Nix's pool data structure: it basically has N slots (here DB connections) and whenever a new one is requested, an idle slot is provided or a new one is created (when N slots are active, it'll be waited until one slot is free). The issue in the code here is however that whenever an error is encountered, the slot is released, however the same broken connection will be reused the next time. By using `Pool::Handle::markBad`, Nix will drop a broken slot. This is now being done when `pqxx::broken_connection` was caught.	2024-03-16 22:10:40 +01:00
Maximilian Bosch	ef6be80f54	Use `submit` event in login form It's a pet peeve from me when logging into my personal Hydra that I always have to press the button rather than hitting Return after entering my password. Reason for that is that the form doesn't have a "submit" button, so far it was always listened to the "click" event. Submit does that and you can hit Return alternatively.	2024-03-16 22:10:40 +01:00
K900	969eb3eeac	urlencode drv names when fetching logs Otherwise names with special characters like + break things.	2024-03-16 22:10:40 +01:00
Pierre Bourdon	18466e8326	queue-runner: try larger pipe buffer sizes	2024-03-16 22:10:40 +01:00

1 2 3 4 5 ...

4192 commits