A single build failure should (optionally?) not kill other running builds #878

Open
opened 2025-06-25 18:04:54 +00:00 by k900 · 10 comments
Member

We've all been there. You're building your system config, your Chromium build that took five hours is almost done, and oops, farts.conf failed to validate, and you forgot to --keep-going, and now the entire build tree is dead and so are five hours of your life. We should have a mode that's kind of like --keep-going, except it stops scheduling new builds, but waits for existing ones to complete before failing, and we should have that be the default.

We've all been there. You're building your system config, your Chromium build that took five hours is almost done, and oops, `farts.conf` failed to validate, and you forgot to `--keep-going`, and now the entire build tree is dead and so are five hours of your life. We should have a mode that's kind of like `--keep-going`, except it stops scheduling _new_ builds, but waits for existing ones to complete before failing, and we should have that be the default.
Member

Suggestions for name:

  • --finish-builds
  • --complete-builds
  • --complete-builds-on-failure is probably too long

Another option:

Enhancing --keep-going with =$MODE (only = form accepted) where --keep-going being left unqualified resolves to the mode name equivalent to the current behaviour (i.e. exhaust every builds). So e.g. --keep-going=only-complete-builds or some more concise mode name.

Suggestions for name: - `--finish-builds` - `--complete-builds` - ~~`--complete-builds-on-failure`~~ is probably too long Another option: Enhancing `--keep-going` with `=$MODE` (only `=` form accepted) where `--keep-going` being left unqualified resolves to the mode name equivalent to the current behaviour (i.e. exhaust every builds). So e.g. `--keep-going=only-complete-builds` or some more concise mode name.
Member

If we turn this around then the flag could be --fail-fast, inspired by various CI systems.

If we turn this around then the flag could be `--fail-fast`, inspired by various CI systems.
Author
Member

So --fail-fast and --fail-faster? xp

So `--fail-fast` and `--fail-faster`? xp
Member

One issue I'd have is --fail-fast seems like the current strategy. That is, it aborts ASAP all the builds on first failure.

--fail-fast would make sense with a (not that much) breaking change where the default is to finish current builds, and --fail-fast becomes the flag to get the current behaviour.

One issue I'd have is `--fail-fast` seems like the current strategy. That is, it aborts ASAP all the builds on first failure. `--fail-fast` would make sense with a (not that much) breaking change where the *default* is to finish current builds, and `--fail-fast` becomes the flag to get the current behaviour.
Author
Member

My non-joke proposal would be:

  • default is the new behavior
  • --fail-fast becomes the old keep-going = false behavior
  • --keep-going keeps the old keep-going = true behavior
My non-joke proposal would be: - default is the new behavior - `--fail-fast` becomes the old `keep-going = false` behavior - `--keep-going` keeps the old `keep-going = true` behavior
Member

Changing the default also makes sense in a "principle of least astonishment" point of view, that builds that are not yet failed, and could be successful, be tried to be completed. After all, we know that they did not depend on the current failure. So I guess it's likely enough that they will not be affected by whatever change would fix the current failure.

Changing the default also makes sense in a "principle of least astonishment" point of view, that builds that are not yet failed, and could be successful, be tried to be completed. After all, we know that they did not depend on the current failure. So I guess it's likely enough that they will not be affected by whatever change would fix the current failure.
Owner

Let’s do it.

Let’s do it.
Owner

I'm going to implement this: --fail-fast is retained. In general, we don't like contradictory flags, so this might be held up by concerns about whether we should transition to a proper enum to represent states perfectly.

What needs to be done:

  • Introduce a mechanism to prevent scheduling of new builds and gate it behind a general flag that is controlled via the settings
  • Use failFast instead of keepGoing to let exceptions bubble
  • Use keepGoing to set the "prevent new builds" flag

This should happen for builds and copies as well.

A major caveat is that the remote build protocol relies on keepGoing flags, as long as we cannot extend it, this feature cannot travel to the remote builders. It's local-only, which might make it impossible to implement until we solve this (if we want to remain consistent).

I'm going to implement this: `--fail-fast` is retained. In general, we don't like contradictory flags, so this might be held up by concerns about whether we should transition to a proper enum to represent states perfectly. What needs to be done: - Introduce a mechanism to prevent scheduling of new builds and gate it behind a general flag that is controlled via the settings - Use `failFast` instead of `keepGoing` to let exceptions bubble - Use `keepGoing` to set the "prevent new builds" flag This should happen for builds and copies as well. A major caveat is that the remote build protocol relies on `keepGoing` flags, as long as we cannot extend it, this feature cannot travel to the remote builders. It's local-only, which might make it impossible to implement until we solve this (if we want to remain consistent).
Member

This issue was mentioned on Gerrit on the following CLs:

  • commit message in cl/4673 ("libstore/build: make build gracefully stops after first error")
<!-- GERRIT_LINKBOT: {"cls": [{"backlink": "https://gerrit.lix.systems/c/lix/+/4673", "number": 4673, "kind": "commit message"}], "cl_meta": {"4673": {"change_title": "libstore/build: make build gracefully stops after first error"}}} --> This issue was mentioned on Gerrit on the following CLs: * commit message in [cl/4673](https://gerrit.lix.systems/c/lix/+/4673) ("libstore/build: make build gracefully stops after first error")
Owner

A major caveat is that the remote build protocol relies on keepGoing flags, as long as we cannot extend it, this feature cannot travel to the remote builders. It's local-only, which might make it impossible to implement until we solve this (if we want to remain consistent).

this is not fully true. keepGoing is transferred as an unsigned, not a boolean, so currently all values not-equal 0 are handled the same (and --keep-going passes 1 to the remote). we could abuse this to transport an enum or bitmask as well, with cppnix and older lix versions simply treating them all the same. this is an excellent source of confusion though and we should not do it

this is also another case of the cli needing an enum argument defaulting to something so that we can retain compatbility with the previous versions: --keep-going [ never | scheduled | all ] or something (with never being plain behavior we have now, all what we can currently opt into, and scheduled the new behavior)

> A major caveat is that the remote build protocol relies on keepGoing flags, as long as we cannot extend it, this feature cannot travel to the remote builders. It's local-only, which might make it impossible to implement until we solve this (if we want to remain consistent). this is not fully true. `keepGoing` is transferred as an *unsigned*, not a boolean, so currently all values not-equal 0 are handled the same (and `--keep-going` passes `1` to the remote). we could abuse this to transport an enum or bitmask as well, with cppnix and older lix versions simply treating them all the same. this is an excellent source of confusion though and we *should not* do it this is also another case of the cli needing an enum argument defaulting to something so that we can retain compatbility with the previous versions: `--keep-going [ never | scheduled | all ]` or something (with `never` being plain behavior we have now, `all` what we can currently opt into, and `scheduled` the new behavior)
pennae added this to the 2.97 milestone 2025-12-01 14:51:11 +00:00
pennae modified the milestone from 2.97 to 2.95 2025-12-01 14:52:00 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
7 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lix-project/lix#878
No description provided.