Better assignment of jobs to builders than required/supported features #604

Open
opened 2024-12-19 18:13:33 +00:00 by lunnova · 2 comments

I have multiple remote builders configured.

Routing jobs to a specific builder is currently very difficult.
It's in theory possible with supported and required features but these are in practice unusable because changing the features a package requires will change its hash and make it unbuildable by default for others who don't have your custom features for routing set up.

Describe the solution you'd like

I'd like to be able to route jobs based on more dynamic things like available memory, disk space and load.
I'd like to be able to change job routing without impacting the package hash, with appropriate guarantees that this isn't impacting inside the derivation/visible as env vars in it. I don't know how this would work.

Context

I'm working on packaging rocm 6.3 and multiple packages in it have multi-hour builds, require over 8GB per core given to the build, or have conditions like needing to spew 200GB of assembly into /build which make it easy to have a package run on a builder that is guaranteed to fail an hour in.
Before working on this project I occasionally ran into issues with build routing but it was usually just slightly frustrating.

## Is your feature request related to a problem? Please describe. I have multiple remote builders configured. Routing jobs to a specific builder is currently very difficult. It's in theory possible with supported and required features but these are in practice unusable because changing the features a package requires will change its hash and make it unbuildable by default for others who don't have your custom features for routing set up. ## Describe the solution you'd like I'd like to be able to route jobs based on more dynamic things like available memory, disk space and load. I'd like to be able to change job routing without impacting the package hash, with appropriate guarantees that this isn't impacting inside the derivation/visible as env vars in it. I don't know how this would work. ## Context I'm working on packaging rocm 6.3 and multiple packages in it have multi-hour builds, require over 8GB per core given to the build, or have conditions like needing to spew 200GB of assembly into /build which make it easy to have a package run on a builder that is guaranteed to fail an hour in. Before working on this project I occasionally ran into issues with build routing but it was usually just slightly frustrating.
Owner

There are two problems here in the same question.

(1) A better scheduler for Nix remote builds: we are fully with you on this, and we desire the same, it requires architectural changes which are developed by pennae so that we can someday attain this goal.
(2) Adding to the hash derivation modulo list the *SystemFeatures fields and this brings a question of: "can system features cause major differences in the output?"

The answer is yes in practice due to CPU microarchitectural differences causing to derivations which are badly written impurities. This can result in worse issues if you say that a derivation can cause many different binary outputs because you route it to a Zen 1, Zen 2 and Zen 3 architecture and that you are passing -march=native. Nix cannot know what you are doing with your compiler flags IMHO.

It feels like to me that (2) would be a hack and (1) would be the real solution to the problem.

There are two problems here in the same question. (1) A better scheduler for Nix remote builds: we are fully with you on this, and we desire the same, it requires architectural changes which are developed by pennae so that we can someday attain this goal. (2) Adding to the hash derivation modulo list the `*SystemFeatures` fields and this brings a question of: "can system features cause major differences in the output?" The answer is yes in practice due to CPU microarchitectural differences causing to derivations which are badly written impurities. This can result in _worse_ issues if you say that a derivation can cause many different binary outputs because you route it to a Zen 1, Zen 2 and Zen 3 architecture and that you are passing -march=native. Nix cannot know what you are doing with your compiler flags IMHO. It feels like to me that (2) would be a hack and (1) would be the real solution to the problem.
Author

The answer is yes in practice due to CPU microarchitectural differences causing to derivations which are badly written impurities. This can result in worse issues if you say that a derivation can cause many different binary outputs because you route it to a Zen 1, Zen 2 and Zen 3 architecture and that you are passing -march=native. Nix cannot know what you are doing with your compiler flags IMHO.

This happens without adding routing features when something has march=native and was never tested except with local builds and then someone with builders configured uses it.

Wonder if there's a not too inelegant way to add an extension point that'd allow hacky solutions to routing in the short term without causing maintenance burden.

> The answer is yes in practice due to CPU microarchitectural differences causing to derivations which are badly written impurities. This can result in worse issues if you say that a derivation can cause many different binary outputs because you route it to a Zen 1, Zen 2 and Zen 3 architecture and that you are passing -march=native. Nix cannot know what you are doing with your compiler flags IMHO. This happens without adding routing features when something has march=native and was never tested except with local builds and then someone with builders configured uses it. Wonder if there's a not too inelegant way to add an extension point that'd allow hacky solutions to routing in the short term without causing maintenance burden.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#604
No description provided.