Generic eval service #4

Open
ma27 wants to merge 71 commits from ma27/evolive:generic-eval-service into main
Owner

The goal is to have something that can substitute hydra-eval-jobset shelling out to nix-eval-jobs, including a way to push drvs into a remote store (S3 only at the time).

The goal is to have something that can substitute `hydra-eval-jobset` shelling out to `nix-eval-jobs`, including a way to push drvs into a remote store (S3 only at the time).
ma27 added 71 commits 2026-01-04 13:17:26 +00:00
This essentially mimics the traditional Hydra way of evaluating stuff: a
"jobset" consists of a number of inputs and the file to evaluate is a
path relative to a selected input. This is the first step towards making
this useful for Hydra.

So far, the following differences exist:

* We accept only libfetchers-style URLs for git, not Hydra-style URLs
  (i.e. `git-url branch`). The former can be trivially constructed given
  the latter though.

* Only git inputs are accepted, yet. Also, the libfetchers-support is
  far from complete, e.g. submodules are not implemented, yet.

* Hydra determines revCount et al. and passes it to the evaluation which
  is e.g. important for nixpkgs. This is also missing here, so far.

Before, the identifier of an evaluation was the nixpkgs revision, but this
is not unique anymore. Hence, we generate a fingerprint that is a hash
of all input URLs, input revisions, names and the input to evaluate.
This hash is used for the log-file and in all messages to identify the
evaluation in the server logs.
There's no place where two consecutive control messages are supposed to
be returned, so it doesn't make sense to wrap those in a list.
The old build-system was kinda old already. Both in terms of nixpkgs
stuff and in terms of Python tooling. Since it's only me working on it,
let's turn it into another playground for uv+uv2nix to find out if this
is something I want to do in more projects.
* A bit more reformat than I would like since I fixed pre-commit after
  the first commit.

* Fix parquet (doesn't seem to work with dicts, though this is just a
  temporary measure)

But on the bright side, we have a single entrypoint for everyting now.
Yay!
This is a very fatal error, i.e. nothing that we could cache (we cannot
even fingerprint the evaluation). So let's stop early in this case.
Apparently, doing a git fetch of a ref doesn't make the ref available in
the checkout. Hence, we'll now resolve it on the remote-side and then
only work with the rev instead of ref.
Introduce a function execute_git_command that's not part of GitRepo
since it is necessary for e.g. resolving refs to revs on remote
repository which happens at a time, no GitRepo was instantiated yet.

Also, use `create_subprocess_exec` since we don't have to worry about
shell injection of we don't use a shell.
This isn't as easy as I hoped and I'm wondering how bad this really is.
We not only need `git`, but also other stuff such as plain Nix
expressions, dicts (I figured that's a nicer option than Nix code in a
lot of cases) and further VCS inputs in the future.

This is essentially the git/fetching part abstracted away from before.
Usually, you don't want to run too many evaluations at once.
Since there's no standard way of doing a semaphore between multiple
processes with the lock being a file, we'll just allow one evaluation
per worker.
This should be reconsidered eventually.
After a bit of research, I don't think there's a general-purpose way
of obtaining metadata such as rev count for shallow clones.

I think it's still valuable to keep shallow=false here since the target
audience is nixpkgs evaluations, but you may still want to have full
metadata, e.g. for "release" evaluations. I'm not sure if it's a wise
choice to use the same repository for this, but we'll see about that.
The idea is to have a websocket connection that is left open while
streaming the response from evolive's evaluation endpoint. Whenever you
get a new derivation, you request its transfer on this websocket
endpoint.

This is long-lived to avoid retransmissions as much as possible: if a
path or a .drv was already transferred, don't do this again.

Internally, this does the following:

* compute the drv's closure via the daemon socket: this gets hopefully
  replaced by an RPC call soon because this implementation sucks ass.
  But still better than shelling out for this.

* Dump a path to a NAR by hand. This happens on purpose like this
  because streaming NARs from the daemon is a dumb idea (requires
  parsing the NAR to know when the stream ends) and we support a local
  store anyways. While we could do chroot stores in the future under a
  different prefix, experience has shown that evaluation on any other
  more or less remote store is incredibly slow and thus irrelevant for
  this project.
That way information like the store-path is included and import is
hopefully easier.
Yielding from within keeps it alive for the rest of the session.
If something goes wrong very badly, error out early instead of awaiting
the timeout when the task gets awaited. The latter is still useful as
last resort, but it's still better to give this feedback as early as
possible.
Next step will be to add tasks for all the drvs to serve. However, we
still have to send some messages continuously, so we lock here.
Upload performance now matches Nix CLI!
The overhead is negligible. It costs additional 2s on a full nixpkgs
evaluation and far less on smaller ones.
* No locking: the synchronisation has a major overhead (I've had
  slowdowns of up to 100% compared to the time it takes now). If we do
  upload one thing twice, it's most likely less bad than the overall
  overhead of synchronisation.

* Remove the semaphore: one my first shot yesterday I observed the
  upload jobs to hang and suspected this might be some problem with too
  many tasks, hence the semaphore. As it turns out, this was probably
  wrong, uploading still works fine.
The problem with running `nix copy` itself is that it'd become pretty
slow over time to shell out for each store-path. Also, the entire logic
is handled on the client-side of Nix, so we cannot leverage a worker
operation for that. Hence, this essentially does the following now for
each instantiated job J:

* computeFSClosure(J).
* upload paths in topological generations such that the S3 store remains
  consistent.

Additionally, a few small improvements were added:

* Cap the amount of concurrent uploads to s3 itself and instantiated
  jobs being uploaded. Each job gets its own Nix daemon-client, hence
  the locking isn't necessary anymore.

* Cache existence of paths in a local Redis.

* Use uvloop as runtime.

Also cache machine id per evaluation now.
@ -16,0 +86,4 @@
shallow clones. `shallow` must be explicitly turned off.
- Paths are marked as uploaded before the actual job starts to prevent duplicated uploads. Also,
the in-memory cache isn't flushed immediately into Rest. This has two implications:
Owner

Redis

Redis
@ -0,0 +20,4 @@
'';
};
lixEvalJobsPackage = lib.mkOption {
Owner

interpreterPackages (like kernelPackages) ?

interpreterPackages (like kernelPackages) ?
@ -47,0 +297,4 @@
def write_parquet() -> None:
df.write_parquet(
write_side,
metadata={
Owner

probably architecture/OS makes sense as well

probably architecture/OS makes sense as well
raito approved these changes 2026-01-22 23:54:10 +00:00
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u generic-eval-service:ma27-generic-eval-service
git switch ma27-generic-eval-service

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch main
git merge --no-ff ma27-generic-eval-service
git switch ma27-generic-eval-service
git rebase main
git switch main
git merge --ff-only ma27-generic-eval-service
git switch ma27-generic-eval-service
git rebase main
git switch main
git merge --no-ff ma27-generic-eval-service
git switch main
git merge --squash ma27-generic-eval-service
git switch main
git merge --ff-only ma27-generic-eval-service
git switch main
git merge ma27-generic-eval-service
git push origin main
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
the-distro/evolive!4
No description provided.