Commit graph

173 commits

Author SHA1 Message Date
db46b01ae9 feat(monitoring): add pyroscope to the infrastructure
Vendored for the time being.
See https://cl.forkos.org/c/nixpkgs/+/181 for upstreaming properly.

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-08-23 20:43:00 +02:00
c380f29937 fix(grafana): remove the global pgsql module dependency for now
We should re-introduce it once things are a bit scoped out.

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-08-23 20:43:00 +02:00
5dc6165c2e feat(gerrit): add git in the environment to perform git-native clones
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-08-23 20:43:00 +02:00
58c0dd3d2e feat(public): add listmonk instance on news.forkos.org
To prepare for public communications and updates.

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-08-21 16:45:12 +02:00
8c35dfa8e0 fix(gerrit): tinker a bit with gerrit defaults for transfer & caching
We had some issues in the past with too many packfiles and timeout
during transfers, let's try to provide a bit of relief in bad scenarios.

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-08-21 16:31:16 +02:00
cfc24abfe1 adjust hydra-gc numbers
for the new ssds
2024-08-20 12:08:49 +02:00
f938fcb24e
hydra: increase git operations timeout 2024-08-16 17:44:45 +02:00
d3e053809c
hydra: log_prefix needs to be / terminated 2024-08-16 09:25:46 +02:00
e2a990c982
hydra: listen on 127.0.0.1 instead of localhost
For some cursed reasons, the latter doesn't work on build-coord:

Aug 16 07:06:22 build-coord hydra-server[109560]: Resolved [localhost]:3000 to [::1]:3000, IPv6
Aug 16 07:06:22 build-coord hydra-server[109560]: Resolved [localhost]:3000 to [127.0.0.1]:3000, IPv4
Aug 16 07:06:22 build-coord hydra-server[109560]: Binding to TCP port 3000 on host ::1 with IPv6
Aug 16 07:06:22 build-coord hydra-server[109560]: Binding to TCP port 3000 on host 127.0.0.1 with IPv4
Aug 16 07:06:22 build-coord hydra-server[109560]: 2024/08/16-07:06:22 Can't connect to TCP port 3000 on 127.0.0.1 [Invalid argument]
2024-08-16 09:20:49 +02:00
c33326f836
hydra: switch to using mTLS instead of local peer auth 2024-08-16 08:19:18 +02:00
0dd333c573
postgres: add mTLS support
New client certs can be minted via the provided script, which is meant
to be run on the postgres server (where the CA private key is
conveniently deployed).
2024-08-16 07:59:12 +02:00
29babfc5c4
Revert "Partial revert "Add Grapevine Matrix server and matrix-hookshot""
This reverts commit 17c342b33e.

Grapevine's use of IFD was fixed upstream.
2024-08-15 16:22:22 +02:00
90325344a3
Reserve builder-11 for build coordination, rename to build-coord 2024-08-13 19:12:36 +02:00
17c342b33e
Partial revert "Add Grapevine Matrix server and matrix-hookshot"
This partially reverts commit d2f3ca5624.

Said commit requires IFD to eval, which is generally unwanted, and is
currently forbidden on Hydra (imo: rightfully so, we should try to
properly separate evals from builds).

The services/ file for grapevine is kept but will not work without the
flake.nix change reapplied.
2024-08-13 00:35:10 +02:00
84efd0976d feat(alerts): add a sync failed too often alert
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-08-09 16:25:34 +02:00
e2f5a7b0e4 feat(alerts): add basic postgresql alerts
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-08-09 16:06:34 +02:00
7388de79c4 feat(alerts): add some basic "host & hardware" alerts
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-08-09 16:06:34 +02:00
f8cad42b5c Set up alertmanager-hookshot-adapter 2024-08-09 14:03:56 +00:00
9ad279a505 Set up admins + DNS for hookshot 2024-08-09 14:03:56 +00:00
d2f3ca5624 Add Grapevine Matrix server and matrix-hookshot
It doesn't want to work.
2024-08-09 14:03:56 +00:00
b6375b8294 add staging sync services 2024-08-08 15:16:04 +02:00
420e6915df Vous avez des branches divergentes et vous devez spécifier comment les réconcilier 2024-08-08 10:39:00 +02:00
dbb4e03292 Revert "builders: direct buildbot to /mnt store via ForceCommand"
This reverts commit dfd48f2179.
2024-08-08 10:37:42 +02:00
cd0621ba55 builders/netboot: add separate firmware_part output 2024-08-06 13:26:51 +02:00
dfd48f2179 builders: direct buildbot to /mnt store via ForceCommand 2024-08-06 13:26:35 +02:00
77ff556583 builders: fix provisioning of ssh hostkeys 2024-08-05 08:18:20 +02:00
fe3cb577c1 fix eval 2024-08-05 07:20:59 +02:00
20fc4c8f96 builders: move provisioning of ssh hostkeys to a systemd service
at first activation it does not yet have a working network setup
2024-08-05 07:17:45 +02:00
bce44930b1 builders: provision ssh hostkeys on boot 2024-08-04 18:12:02 +02:00
79dea0686b add 'notipxe' netboot loader based on systemd-initrd + u-root 2024-08-03 20:28:57 +02:00
aeb8102ae4 builders: do not mount / and /boot on netboot systems 2024-08-03 20:01:39 +02:00
830dcbf6bc builders: do not mount / and /boot on netboot systems 2024-08-03 18:41:01 +02:00
93822775a9 baremetal-builders: do not create swapfile on rootfs when netbooting 2024-08-03 18:10:59 +02:00
dd028656ac builders: fix serial console 2024-08-02 13:21:04 +02:00
88317d099c attempt to fix netboot hydra jobs 2024-08-02 01:05:20 +02:00
1cbf286f18 build netboot files from hydra 2024-08-01 22:47:25 +02:00
6dc424dd43 wob01: serve an ipxe over iusb-spoof 2024-08-01 22:16:48 +02:00
504a443acc adjust hydra-gc numbers
we want to see how garbage collection would behave on a 480GB drive
2024-07-31 23:44:08 +02:00
96d58bbd41
forgejo: disable users explore page
This was requested and should make it a decent bit more difficult to get
a somewhat complete list of users on this instance.

We are, however, aware of other endpoints that can be used to get to a
similar result. Those just aren't as convenient nor obvious.

https://forgejo.org/docs/latest/admin/config-cheat-sheet/#service---explore-serviceexplore
2024-07-31 01:42:05 +02:00
5154906aac fix eval in assignments.nix 2024-07-30 17:23:54 +02:00
f3828368e6 hydra: set reasonable max-jobs and cores 2024-07-30 17:03:12 +02:00
4e2d21930f baremetal-builders: detect percent_filled for the correct partition 2024-07-30 13:59:46 +02:00
99259356f2 make buildbot-signing-key accessible to buildbot-worker 2024-07-28 23:30:38 +02:00
5474832b07 baremetal builders: filesystem optimizations 2024-07-28 19:20:23 +02:00
15a684c5d7 baremetal-builders: more 'intelligent' gc 2024-07-26 12:17:27 +02:00
74e06ac6d0 hydra gc every 20h
metrics analysis has showed that this is unlikely to fill up the builders
2024-07-24 09:35:18 +02:00
e5a3ce2283 buildbot fixes (#76)
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
Signed-off-by: Yureka <yureka@forkos.org>
Co-authored-by: raito <raito@noreply.git.lix.systems>
Co-committed-by: raito <raito@noreply.git.lix.systems>
2024-07-24 06:44:25 +00:00
bebc7f2586 We have nothing to hide 2024-07-23 18:09:49 +03:00
608c0e5973
hydra: bump to 16 evaluation workers, we have enough RAM and cores to afford it 2024-07-22 23:13:33 +02:00
62ccc0282b fix(ows): per-job runtime directories + proper local refspec
The local refspec was weird and exploiting a edge case for the nixpkgs
jobs where local and from were the same.

We are more explicit now, which fixes the sandbox jobs.

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-22 15:41:47 +02:00
d84a43b781 builders: run gc 3x per day
We can still adjust it if the disks fill up, but currently it is too frequent
2024-07-21 19:49:21 +02:00
2dc5899660 baremetal: run hydra store gc as builder user 2024-07-20 17:00:39 +02:00
adaf4b0aef baremetal: tmp on the same filesystem as hydra store 2024-07-20 17:00:39 +02:00
5bde7e2358 use dedicated store partition for hydra builds 2024-07-20 15:14:00 +02:00
d9809e1e78 gerrit-one-way-sync: disallow auto-merging a staging iteration into master 2024-07-20 15:14:00 +02:00
3fa4a25d87 gerrit-one-way-sync: set git user info 2024-07-20 15:14:00 +02:00
0ff5eea4ed gerrit-one-way-sync: merge instead of rebase 2024-07-20 15:14:00 +02:00
80c4757571 gerrit01: add a one-way-sync service
It's basic and does not handle conflicts which needs to be manually
managed.

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-19 17:52:44 +02:00
d1e64b6610 Fix eval warning here too 2024-07-19 12:06:03 +03:00
766dc4c383 Mimir also wants network-online.target
Thank you helpful eval warning
2024-07-19 12:03:55 +03:00
65b07a936b Make sure Mimir starts after network is up 2024-07-19 12:00:52 +03:00
8afcf249d6 buildbot: upgrade to local machine specifications
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-18 12:18:02 +02:00
4473717e9f gerrit: introduce buildbot checks plugin
It's a modified version of @puck's Lix buildbot checks for
gerrit.lix.systems with a slight generalization in the configuration for
many repositories.

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-18 10:56:46 +02:00
da7175303c buildbot: add support for remote builders via baremetal machines
For now, only builder-3 is used.

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-17 18:28:26 +02:00
7789e9ce75 services/buildbot: init
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-17 18:00:51 +02:00
fda59ee6c0 gerrit: factor more configuration in the NixOS module for external consumption
Other modules may require information to configure themselves from the
Gerrit module.

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-17 15:43:35 +02:00
95b58de737
forgejo: use redis as cache and session provider 2024-07-16 20:09:15 +02:00
8b9d33d70c
forgejo: disable registrations, enable auto-registration for SSO 2024-07-16 17:14:23 +02:00
dd069c40d7
forgejo: init service 2024-07-16 15:44:06 +02:00
e3e60a5e72 services/monitoring: add scraping of Gerrit's internal metrics 2024-07-15 11:02:54 +00:00
2e86babc8a services/gerrit: add metrics-prometheus-exporter 2024-07-15 11:02:54 +00:00
7a937e837a Unlimit Mimir max series 2024-07-13 15:52:46 +03:00
7d9461808c
builders: configure a swapfile + zswap 2024-07-13 04:40:51 +02:00
293bc52ace
hydra: reduce number of parallel builds per builder to limit RAM consumption 2024-07-13 04:38:24 +02:00
756341ea4c
builders: tune sshd MaxStartups to avoid rate limiting Hydra 2024-07-12 21:57:04 +02:00
e6ead602f0 builders get a special treatment for dns64 2024-07-11 02:05:58 +02:00
b14f155d55 add ipmitool on vpn-gw and builders 2024-07-10 20:49:17 +02:00
d2336262fb
hydra: set allowed URIs in restricted mode for flake inputs 2024-07-10 18:52:22 +02:00
411d514ab9
hydra: user hydra-www needs nix-daemon access too 2024-07-10 17:36:39 +02:00
f74d1ca0f6
hydra: start signing paths 2024-07-10 17:34:57 +02:00
e84b362b7a Allow 12 hour of backfill for metrics
This is somewhat experimental and may explode, but we'll see, I guess
2024-07-10 14:59:09 +03:00
9e7e6d42ab Make nginx/loki/mimir go fast 2024-07-10 14:55:28 +03:00
f2c2bc5ab6
hydra: output machine host key as base64 in the generated machines.conf 2024-07-10 02:16:45 +02:00
f214da9228
hydra: add hydra to nix trusted-users 2024-07-10 02:03:33 +02:00
82db8f7f1e gerrit01: some more tuning
* flip off proxy_buffering again
* enable REVWALK_USE_PRIORITY_QUEUE
* enable delta compression, because that's not a bottleneck and it's
  nicer on bandwidth
2024-07-10 00:27:36 +01:00
9988811be5 hydra: unplug the EPYC
thank you for your testing services

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-10 01:13:10 +02:00
2308870aa5 builders: add a nice tag to deploy all of them at once
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-10 00:59:31 +02:00
645ad7d062 builders: add builder user
currently hardcoded to hydra's coordinator public key

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-10 00:55:25 +02:00
a30c1f7d78 hydra: wire up new builders
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-10 00:45:02 +02:00
eb21cb6916 add baremetal builders 2024-07-10 00:35:01 +02:00
3828721e4f services/netbox: enable OIDC via Lix SSO
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
2024-07-09 02:45:58 +02:00
8a9ff8c40d services/gerrit: migrate to Gerrit from the-distro/nix-gerrit flake 2024-07-08 23:30:59 +01:00
7f46e5d9a4
services: add ofborg, currently running rabbitmq only 2024-07-08 23:55:11 +02:00
b55475c12e Fix up the rest of the dashboards 2024-07-08 11:43:57 +03:00
9f0e601d84 Scrape grafana/loki/mimir own metrics 2024-07-08 10:25:15 +03:00
209f71c63a Update node_exporter dashboard for new metrics structure 2024-07-08 10:16:37 +03:00
563e0685d4 Metrics fixups
- fix grafana-agent config format
- rekey metrics-push-password for fodwatch
2024-07-08 10:01:25 +03:00
8d2a367e92 grafana-agent: make bagel.monitoring.grafana-agent.exporters an attrset
This allows us to use multiple jobs, one for each additional exporter,
and set their `job_name` accordingly.

`job_name` is exported as `job` label on the resulting metrics.
This allows us to quickly get an understanding what metrics of an
exporter are actually available by simply filtering all metrics by
`{job="$jobname"}`
2024-07-08 09:34:26 +03:00
db8c831c2f grafana-agent: set hostname label on all metrics
This is handy to quickly see all metrics exported by a node, without
having to mangle with the already existing `instance` label.

`hostname` is essentially a variant of `instance` but without ports.
2024-07-08 09:34:26 +03:00
ba0d50624d Switch to push metrics with Grafana Agent 2024-07-08 09:34:24 +03:00