VM workloads for hypervisors #271

Merged
raito merged 7 commits from vm-workloads into main 2025-08-26 21:44:22 +00:00
Owner

This implements an ad-hoc mechanism to provision declarative VMs on our hypervisors based on microvm.nix, cloud-hypervisor and ZFS.

Simple modus operandi:

  • Drop a NixOS definition in vm/$hypervisor_attr_name_in_flake_dot_nix/$vm_name/default.nix
  • Configures hardware.vm as shown by test01.

To connect to a VM from the host, you can use vmsh which is a poor's man console access, relying on screen and virtio-console (TODO: add patches for screen resizing from SpectrumOS).
If you need to exchange files: /run/microvm/<vm name>/xchg exist and is mounted on both sides.

VMs are as stateless as it can be, i.e. they boot from the host Nix store and they have access to the host Nix store, this achieves a nice deduplication effect for all VMs OS information. Volumes are usually meant ONLY for /var, side effect: /etc/ gets reset on each reboot. An exception has been allowed for sshd and is mounted on the host as well in the host /var, this enables preprovisioning of the SSH host keys for secret provisioning.

A custom kernel is used and enable very fast booting at the cost of playing Whack'n'mole with what is broken or what is missing in the Kconfig.

One thing is very broken:

  • /var/lib/microvms/%i/sshd directory creation, systemd-tmpfiles does NOT always kick in. If you manually remove the inode and reswitch to configuration, tmpfiles will not be run as there is no configuration change. A better provisioning technique should be adopted than systemd-tmpfiles here.

Other than that, the approach may be improved a lot. Perhaps, we don't need a XFS journal separate with a different block device and it's premature optimization.

A real-world useful example is provided with n64gw01 meant as a NAT64 gateway via jool, perhaps a bit complicated and contains some hacks due to the networking aspects.

This implements an ad-hoc mechanism to provision declarative VMs on our hypervisors based on microvm.nix, cloud-hypervisor and ZFS. Simple modus operandi: - Drop a NixOS definition in `vm/$hypervisor_attr_name_in_flake_dot_nix/$vm_name/default.nix` - Configures `hardware.vm` as shown by `test01`. To connect to a VM from the host, you can use `vmsh` which is a poor's man console access, relying on screen and virtio-console (TODO: add patches for screen resizing from SpectrumOS). If you need to exchange files: `/run/microvm/<vm name>/xchg` exist and is mounted on both sides. VMs are as stateless as it can be, i.e. they boot from the **host Nix store** and they have access to the **host Nix store**, this achieves a nice deduplication effect for all VMs OS information. Volumes are usually meant ONLY for `/var`, side effect: `/etc/` gets reset on each reboot. An exception has been allowed for `sshd` and is mounted on the host as well in the host `/var`, this enables preprovisioning of the SSH host keys for secret provisioning. A custom kernel is used and enable very fast booting at the cost of playing Whack'n'mole with what is broken or what is missing in the Kconfig. One thing is very broken: - `/var/lib/microvms/%i/sshd` directory creation, systemd-tmpfiles does NOT always kick in. If you manually remove the inode and reswitch to configuration, tmpfiles will not be run as there is no configuration change. A better provisioning technique should be adopted than systemd-tmpfiles here. Other than that, the approach may be improved a lot. Perhaps, we don't need a XFS journal separate with a different block device and it's premature optimization. A real-world useful example is provided with n64gw01 meant as a NAT64 gateway via jool, perhaps a bit complicated and contains some hacks due to the networking aspects.
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
NixOS vanilla kernels contains many good things but are not optimized
for "cloud" instances, read: VM instances on an hypervisor implementing
modern features, e.g. cloud-hypervisor.

As a result, they cause long boot times for no good reason.

With this commit, we ship a minimal KVM kernelconfig. Job will be to
find a way to maintain it sanely.

[root@test01:~]# systemd-analyze time
Startup finished in 206ms (kernel) + 3.371s (initrd) + 2.307s (userspace) = 5.885s
multi-user.target reached after 2.294s in userspace.

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
This is a NAT64 node.

Signed-off-by: Raito Bezarius <masterancpp@gmail.com>
delroth left a comment
Owner

Quick initial pass without looking too much into the details.

Quick initial pass without looking too much into the details.
@ -4,0 +38,4 @@
# TODO(Raito): replace me by a `vmDefinitionsPath` rather.
readVMs = hypervisorName:
mapAttrs (n: _: mkVM ../../../vm/${hypervisorName}/${n}
Owner

I'm not entirely convinced by the idea of treating vms differently from other hosts (which are in hosts/). Do you have arguments pro/con?

I'm not entirely convinced by the idea of treating vms differently from other hosts (which are in `hosts/`). Do you have arguments pro/con?
Author
Owner

The way I see it, something like cp $vm_1 ../$hyp/$vm_1 or mv should be the simplest way to handle things, aside from some unavoidable state issues. That works fine for (internal) VMs, but not for baremetal, which can't just come online by moving files around and prepping some state.

The whole point here is to get automatic VM loading based on the directory tree, where the hypervisor is implied. That breaks if we go into hosts/.

Maybe this is overkill and we can revisit it later. The upside is (limited for now) simplicity in managing VMs. The downside is that this adds compute in a weird way, it's not your typical VPS, and it's not baremetal either.

The way I see it, something like `cp $vm_1 ../$hyp/$vm_1` or `mv` should be the simplest way to handle things, aside from some unavoidable state issues. That works fine for (internal) VMs, but not for baremetal, which can't just come online by moving files around and prepping some state. The whole point here is to get automatic VM loading based on the directory tree, where the hypervisor is implied. That breaks if we go into `hosts/`. Maybe this is overkill and we can revisit it later. The upside is (limited for now) simplicity in managing VMs. The downside is that this adds compute in a weird way, it's not your typical VPS, and it's not baremetal either.
delroth marked this conversation as resolved
@ -0,0 +287,4 @@
let
systemd-openbao = import inputs.systemd-openbao { };
in
[
Owner

This will inevitably drift and lead to confusion. Any idea on how we could avoid this?

This will inevitably drift and lead to confusion. Any idea on how we could avoid this?
Author
Owner

We would need to extract the modules used by the colmena's hive instantiation function and apply it here. I think feasible but I cannot think of the path to achieve it right now, OTOH.

It's safe to have the full colmena modules because we don't use any that possess computational meaning.

We would need to extract the modules used by the colmena's hive instantiation function and apply it here. I think feasible but I cannot think of the path to achieve it right now, OTOH. It's safe to have the full colmena modules because we don't use any that possess computational meaning.
Author
Owner

I implemented it in #279/commits/dbba80616f because I obviously got hit by it.

I implemented it in https://git.lix.systems/the-distro/infra/pulls/279/commits/dbba80616f2349b05ed23d0855bc21f78cb2187d because I obviously got hit by it.
raito marked this conversation as resolved
@ -0,0 +393,4 @@
networking.nftables.enable = true;
services.dbus.implementation = "broker";
systemd.services.systemd-oomd = {
requires = [ "userborn.service" ];
Owner

Can you document why?

Can you document why?
Author
Owner

IIRC, an upstream bug. I need to double check. systemd-oomd depends on user to be ready, it actually doesn't order itself well.

IIRC, an upstream bug. I need to double check. systemd-oomd depends on user to be ready, it actually doesn't order itself well.
Author
Owner
Not necessary anymore since https://github.com/NixOS/nixpkgs/pull/424035#pullrequestreview-3010253359 which we do have.
raito marked this conversation as resolved
@ -0,0 +419,4 @@
systemd.network.enable = true;
systemd.network.networks = mapAttrs' mkNetworks cfg.interfaces;
# Otherwise, it's really annoying at redeployment time.
Owner

Any reason to not do it globally for all monitoring agent proms then?

Any reason to not do it globally for all monitoring agent proms then?
Author
Owner

No good reason.

No good reason.
raito marked this conversation as resolved
@ -4,0 +28,4 @@
vmOptions = {
options = {
evalModule = mkOption {
Author
Owner

some good feedback from pennae: just dont do that.

some good feedback from pennae: just dont do that.
raito marked this conversation as resolved
@ -14,0 +67,4 @@
environment.systemPackages = [
(pkgs.writeShellScriptBin "vmsh" ''
NAME=$1
[[ -d /var/lib/microvms/$NAME ]] || (echo "No such VM '$NAME'"; exit 1)
Author
Owner

broken parsing

broken parsing
raito marked this conversation as resolved
@ -0,0 +1,111 @@
{ lib, pkgs, ... }:
{
microvm.vsock.cid = 5;
Author
Owner

automatic numbering by hash of name is better

automatic numbering by hash of name is better
raito marked this conversation as resolved
@ -0,0 +369,4 @@
};
boot.kernelPackages = pkgs.linuxPackages_custom {
version = "6.6.100";
src = pkgs.fetchurl {
Author
Owner

use nixpkgs source and infer the ver from there instead

use nixpkgs source and infer the ver from there instead
raito marked this conversation as resolved
@ -0,0 +4,4 @@
# This is critical to ensure that the host sends IPv4 packets directly to this VM's IPv4 interface.
microvm.binScripts.tap-up = ''
${lib.getExe' pkgs.iproute2 "ip"} replace 57.129.18.76 dev vm-n64gw01-v4 scope link
Author
Owner

this is absolutely wrong i think and doesn't work.

this is absolutely wrong i think and doesn't work.
raito marked this conversation as resolved
raito force-pushed vm-workloads from e1fad9d828 to c19639a1ad 2025-08-26 18:14:41 +00:00 Compare
raito changed title from WIP: VM workloads for hypervisors to VM workloads for hypervisors 2025-08-26 18:17:15 +00:00
raito force-pushed vm-workloads from c19639a1ad to cb4a70bb44 2025-08-26 18:29:41 +00:00 Compare
raito force-pushed vm-workloads from cb4a70bb44 to 05b60ba890 2025-08-26 18:38:24 +00:00 Compare
raito force-pushed vm-workloads from 05b60ba890 to bee1fd09bf 2025-08-26 19:23:34 +00:00 Compare
raito force-pushed vm-workloads from bee1fd09bf to e98152f82e 2025-08-26 19:26:15 +00:00 Compare
requested review from delroth 2025-08-26 20:02:07 +00:00
delroth approved these changes 2025-08-26 21:04:08 +00:00
@ -0,0 +126,4 @@
mkCreateScript = name: { path, pool, size, properties, ... }:
let
max = a: b: if a <= b then b else a;
# journal size: 10MB if <1GB
Owner

16

16
raito marked this conversation as resolved
raito force-pushed vm-workloads from e98152f82e to 21e37eeef1 2025-08-26 21:44:16 +00:00 Compare
raito merged commit 21e37eeef1 into main 2025-08-26 21:44:22 +00:00
raito deleted branch vm-workloads 2025-08-26 21:44:23 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: the-distro/infra#271
No description provided.