hydra

Author	SHA1	Message	Date
Pierre Bourdon	026e3a3103	queue-runner: switch to pseudorandom ordering of builds processing We don't rely on sequential / monotonic build IDs processing anymore, so randomizing actually has the advantage of mixing builds for different systems together, to avoid only one chunk of builds for a single system getting processed while builders for other systems are starved.	2024-04-20 23:05:26 +02:00
Pierre Bourdon	6606a7f86e	queue runner: introduce some parallelism for remote paths lookup Each output for a given step being ingested is looked up in parallel, which should basically multiply the speed of builds ingestion by the average number of outputs per derivation.	2024-04-20 22:28:18 +02:00
Pierre Bourdon	f31b95d371	queue-runner: reduce the time between queue monitor restarts This will induce more DB queries (though these are fairly cheap), but at the benefit of processing bumps within 1m instead of within 10m.	2024-04-20 16:58:10 +02:00
Pierre Bourdon	54f8daf6b1	queue-runner: remove id > X from new builds query Running the query with/without it shows that it makes no difference to postgres, since there's an index on finished=0 already. This allows a few simplifications, but also paves the way towards running multiple parallel monitor threads in the future.	2024-04-20 16:53:52 +02:00
Pierre Bourdon	cc6bafe538	queue-runner: add prom metrics to allow detecting internal bottlenecks By looking at the ratio of running vs. waiting for the dispatcher and the queue monitor, we should get better visibility into what hydra is currently bottlenecked on. There are other side effects we can try to measure to get to the same result, but having a simple way doesn't cost us much.	2024-04-20 16:48:03 +02:00
Pierre Bourdon	6189ba9c5e	web: replace 'errormsg' with 'errormsg IS NULL' in most cases This is implement in an extremely hacky way due to poor DBIx feature support. Ideally, what we'd need is a way to tell DBIx to ignore the errormsg column unless explicitly requested, and to automatically add a computed 'errormsg IS NULL' column in others. Since it does not support that, this commit instead hacks some support via method overrides while taking care to not break anything obvious.	2024-04-12 20:14:09 +02:00
Pierre Bourdon	258e9314a9	web: include current step status on /machines	2024-04-11 17:15:58 +02:00
Pierre Bourdon	a51bd392a2	queue-runner: limit parallelism of CPU intensive operations My current theory is that running more parallel xz than available CPU cores is reducing our overall throughput by requiring more scheduling overhead and more cache thrashing.	2024-04-11 16:43:01 +02:00
Maximilian Bosch	a596d6c3c1	Only show stepname if it doesn't equal the name of the drv When building e.g. nixpkgs, the "Running builds" view will mostly look like this hello.x86_64-linux (Build of hello-X.Y) exa.x86_64-linux (Build of exa-X.Y) ... This doesn't provide any useful information. Showing the step name only makes sense if it's not a child of the job's derivation. With this patch, that information will only be shown if the drv name (i.e. w/o `/nix/store/` prefix, .drv ext & hash) is not equal to the drv name of the job itself (build.nixname).	2024-03-18 18:46:01 +01:00
Maximilian Bosch	415f9f2daa	Running builds view: show build step names When using Hydra to build machine configurations, you'll often see "nixosConfigurations.foo" five times, i.e. for each build step being run. This isn't very helpful I think because in such a case, a single build step can also be compiling the Linux kernel. This change also fetches the `drvpath` and `type` from the `buildsteps` relation. We're already joining it, so this doesn't make much difference (confirmed via query logging that this doesn't cause extra SQL queries). Unfortunately build steps don't have a human readable name, so I'm deriving it from the drvpath by stripping away the hash (assuming that it'll never contain a `-` and that `/nix/store/` is used as prefix). I decided against using the Nix bindings for that to avoid too much overhead due to store operations for each build step.	2024-03-18 18:46:01 +01:00
Maximilian Bosch	9b465e7a67	Make "timed out" and "log limit exceeded" builds aborted In `73694087a0` I gave builds that failed because of a timeout or exceeded log limit a stop sign and I stand by that reasoning: with that it's possible to distinguish between actual build failures and rather transient things such as timeouts. Back then I considered it a feature that these are shown in a different tab, but I don't think that's a good idea anymore. When using a jobset to e.g. track the regressions from a mass rebuild (like a compiler or gcc update), "Newly failed builds" should exclusively display regressions (and flaky builds of course, not much I can do about that). Also, when a bunch of builds fail in such a jobset because of e.g. a broken connection to a builder that results in a timeout, I want to be able to restart them all w/o rebuilding actual regressions. To make it clear that we not only have "Aborted" builds in the tab, I renamed the label to "Aborted / Timed out".	2024-03-16 22:10:40 +01:00
Maximilian Bosch	9b62c52e5c	hydra-queue-runner: drop broken connections from pool Closes #1336 When restarting postgresql, the connections are still reused in `hydra-queue-runner` causing errors like this main thread: Lost connection to the database server. queue monitor: Lost connection to the database server. and no more builds being processed. `hydra-evaluator` doesn't have that issue since it crashes right away. We could let it retry indefinitely as well (see below), but I don't want to change too much. If the DB is still unreachable 10s later, the process will stop with a non-zero exit code because of a missing DB connection. This however isn't such a big deal because it will be immediately restarted afterwards. With the current configuration, Hydra will never give up, but restart (and retry) infinitely. To me that seems reasonable, i.e. to retry DB connections on a long-running process. If this doesn't work out, the monitoring should fire anyways because the queue fills up, but I'm open to discuss that. Please note that this isn't reproducible with the DB and the queue runner on the same machine when using `services.hydra-dev`, because of the `Requires=` dependency `hydra-queue-runner.service` -> `hydra-init.service` -> `postgresql.service` that causes the queue runner to be restarted on `systemctl restart postgresql`. Internally, Hydra uses Nix's pool data structure: it basically has N slots (here DB connections) and whenever a new one is requested, an idle slot is provided or a new one is created (when N slots are active, it'll be waited until one slot is free). The issue in the code here is however that whenever an error is encountered, the slot is released, however the same broken connection will be reused the next time. By using `Pool::Handle::markBad`, Nix will drop a broken slot. This is now being done when `pqxx::broken_connection` was caught.	2024-03-16 22:10:40 +01:00
Maximilian Bosch	ef6be80f54	Use `submit` event in login form It's a pet peeve from me when logging into my personal Hydra that I always have to press the button rather than hitting Return after entering my password. Reason for that is that the form doesn't have a "submit" button, so far it was always listened to the "click" event. Submit does that and you can hit Return alternatively.	2024-03-16 22:10:40 +01:00
Ilya K	969eb3eeac	urlencode drv names when fetching logs Otherwise names with special characters like + break things.	2024-03-16 22:10:40 +01:00
Pierre Bourdon	18466e8326	queue-runner: try larger pipe buffer sizes	2024-03-16 22:10:40 +01:00
ajs124	6ed21490ee	lazy-load evaluation errors Closes #1362	2024-03-16 22:10:40 +01:00
John Ericson	b503280256	Add migration to drop non-null constraints	2024-01-26 11:53:58 -05:00
John Ericson	323b556dc8	Minimal CA support This verison has a worse UI, but also chnages the schema less: One non-null constraint is removed, but no new columns are added. Co-Authored-By: Andrea Ciceri <andrea.ciceri@autistici.org> Co-Authored-By: regnat <rg@regnat.ovh>	2024-01-26 00:34:58 -05:00
John Ericson	fcde5908d8	More CA derivations prep Again, with care not to change the schema in any way.	2024-01-25 21:32:22 -05:00
John Ericson	083ef46c12	Merge pull request #1344 from delroth/google-popup web: disable Sign in with Google popup	2024-01-25 16:36:16 -05:00
John Ericson	c64eed7d07	Simplify `StoreConfig::getDefaultSystemFeatures` call That method is now static.	2024-01-25 15:58:07 -05:00
Pierre Bourdon	6df06b089e	web: disable Sign in with Google popup	2024-01-25 09:27:46 +01:00
John Ericson	b1fa6b3aac	Use `StoreConfig::getDefaultSystemFeatures` for default machine config We have to oddly make a `StoreConfig` subclass to get it, but https://github.com/NixOS/nix/pull/9848 will fix that. The purpose of this is to ensure that, absent an explicit config, `localhost` includes `ca-derivations` and `recursive-nix` if those experimental features are enabled. Very much the complement of #1342, the previous PR.	2024-01-24 21:37:13 -05:00
John Ericson	07cb5d1b7c	Use `nix::ParsedDerivation::getRequiredSystemFeatures()` A slight dedup, and also ensures that floating CA derivations require a `ca-derivations` experimental feature. This fixes the scheduling issue that @SuperSandro2000 found.	2024-01-24 21:04:14 -05:00
John Ericson	d45e14fd43	Merge pull request #1316 from NixOS/ca-derivations-prep Prepare for CA derivation support with lower impact changes	2024-01-24 18:12:42 -05:00
John Ericson	70e5469303	Use Nix's `Machine` type in a mimimal way This is just using the fields from that type, and only where the types coincide. (There are two fields with different types, `speedFactor` most interestingly.) No code is reused, so we can be sure that no behavior is changed. Once the types are reconciled on the Nix side, then we can start carefully actually reusing code. Progress on #1164	2024-01-23 12:18:57 -05:00
John Ericson	2e6ee28f9b	`Machine` -> `::Machine` so we don't conflict with Nix's	2024-01-23 11:03:19 -05:00
John Ericson	4e8fbaa3d6	Replace `Child` with `SSHMaster::Connection` Nix defines basically an identical struct for the same purpose, so let's just use that.	2024-01-23 01:11:46 -05:00
John Ericson	588a0c5269	Merge remote-tracking branch 'upstream/master' into ca-derivations-prep	2023-12-23 19:19:54 -05:00
John Ericson	75f26f1fc4	Clean up `std::optional` dereferencing in the queue runner Instead of doing this partial operation a number of times, assert (with a comment, get a reference to the thing inside, and use that just once. (This refactor was done twice, "just once" for each time.)	2023-12-23 19:10:58 -05:00
John Ericson	6e67884ff1	One more `queryDerivationOutputMap` should use the eval store param	2023-12-11 14:05:18 -05:00
John Ericson	a6b6c5a539	Revert query -- those columns don't exist yet!	2023-12-11 12:58:54 -05:00
John Ericson	ebfefb9161	Sync up with some changes done to the main CA branch	2023-12-11 12:46:36 -05:00
John Ericson	8783dd53f6	Merge remote-tracking branch 'upstream/master' into ca-derivations-prep	2023-12-11 12:42:43 -05:00
John Ericson	831021808c	Merge pull request #1318 from obsidiansystems/use-build-result-serialiser Use factored-out `BuildResult` serializer	2023-12-08 11:25:05 -05:00
John Ericson	2ee0068fdc	Do not copy for both stores for now It has a performance cost, and as the comment says we should be doing the better solution. We want to land this preparatory change on prod while the rest is still on staging, so we should just skip it for now. Skipping it will not affect regular fixed-output and input-addressed derivations, which are the only ones prod would deal with upon getting this code. The main CA derivations support branch will revert this commit so it still works.	2023-12-07 15:05:03 -05:00
John Ericson	31ea6458ca	Merge remote-tracking branch 'upstream/master' into ca-derivations-prep	2023-12-07 15:01:35 -05:00
John Ericson	6a54ab24e2	Use factored-out `BuildResult` serializer For the record, here is the Nix 2.19 version: https://github.com/NixOS/nix/blob/2.19-maintenance/src/libstore/serve-protocol.cc, which is what we would initially use. It is a more complete version of what Hydra has today except for one thing: it always unconditionally sets the start/stop times. I think that is correct at the other end seems to unconditionally measure them, but just to be extra careful, I reproduced the old behavior of falling back on Hydra's own measurements if `startTime` is 0. The only difference is that the fallback `stopTime` is now measured from after the entire `BuildResult` is transferred over the wire, but I think that should be negligible if it is measurable at all. (And remember, this is fallback case I already suspect is dead code.)	2023-12-07 02:00:22 -05:00
John Ericson	86cd5e9076	`copyClosureTo`: Use `SubstituteFlag` instead of `bool` This matches Nix (in the same serialization logic in `src/libstore/legacy-ssh-store.cc`) and adds clarity.	2023-12-07 00:18:50 -05:00
John Ericson	11f8030b0f	Add comment from GitHub about adding to store as code comment	2023-12-06 17:59:25 -05:00
John Ericson	3df8feb3a2	Add TODO about setting `null` instead of empty string in JSON An empty string is a sneaky way to avoid hard failures --- things that expect strings still get strings, but it does conversely open the door up to soft failures (spooky-action-at-a-distance ones because the string did not have the expected invariants). "Fail fast" with null will ultimately make the system more robust, but force us to fix more things up front, and I don't want to change this without also fixing those things up front, especially as this commit is for now just part of the the preparatory PR for which this is dead code.	2023-12-05 11:31:06 -05:00
John Ericson	069b7775c5	hydra-eval-jobs: Ensure we have output path if ca-derivations is disabled Brought up by @thufschmitt in https://github.com/NixOS/hydra/pull/1316#discussion_r1415111329 . This makes this closer to what was originally there --- which just dispatched off the experimental feature rather than the presence/absense of the output, too.	2023-12-05 11:26:26 -05:00
John Ericson	e3443cd22a	Put back nicer `copyClosure` instead of manual closure + copy It looks like we accidentally got the old code back, probably after a merge conflict resolution.	2023-12-04 17:41:11 -05:00
John Ericson	8046ec2668	Remove unused `outputHashes` variable This looks like a stray copy paste.	2023-12-04 16:21:56 -05:00
John Ericson	9ba4417940	Prepare for CA derivation support with lower impact changes This is just C++ changes without any Perl / Frontend / SQL Schema changes. The idea is that it should be possible to redeploy Hydra with these chnages with (a) no schema migration and also (b) no regressions. We should be able to much more safely deploy these to a staging server and then production `hydra.nixos.org`. Extracted from #875 Co-Authored-By: Théophane Hufschmitt <theophane.hufschmitt@tweag.io> Co-Authored-By: Alexander Sosedkin <monk@unboiled.info> Co-Authored-By: Andrea Ciceri <andrea.ciceri@autistici.org> Co-Authored-By: Charlotte 🦝 Delenk Mlotte@chir.rs> Co-Authored-By: Sandro Jäckel <sandro.jaeckel@gmail.com>	2023-12-04 16:14:47 -05:00
John Ericson	a5d44b60ea	Merge pull request #1313 from obsidiansystems/split-buildRemote Split the `buildRemote` function, take 2	2023-12-04 11:37:36 -05:00
John Ericson	363604846a	Again, use `const` in for loop As requested by @teh. Was lost in merge with master, now added back.	2023-12-04 11:31:05 -05:00
John Ericson	162b538912	Remove unused `thisArrow` variable	2023-12-04 11:27:39 -05:00
John Ericson	104baef503	Document the connection initialization process	2023-12-04 09:42:04 -05:00
Janne Heß	874fcae1e8	Merge pull request #1301 from delroth/queue-runner-perf queue-runner: only re-sort runnables by prio once per dispatch cycle	2023-12-04 15:27:14 +01:00

1 2 3 4 5 ...

2979 commits