Test failure of Functional2 in CI (and potentially elsewhere) #1017

Open
opened 2025-10-18 21:12:10 +00:00 by commentatorforall · 1 comment

Describe the bug

Darwin builders seem to sometimes struggle with the timeout tests and throw a bad error message leading to test failure.
See Exhibit A Exhibit B

it is currently unknown if this behavior was present in F1 already and only got copied or if it was introduced by the migration to f2

Steps To Reproduce

  1. Look at long relation chains and spot CI failure of (lowdown)-aarch64-darwin machines
  2. check if the build failure was due to an assertion error or due to a lix crash (if the latter, check #1016)

Expected behavior

The CI should not fail. Either by fixing the underlying issue, or - if expected behavior - allowing the currently failing message

Additional context

See functional2 room on matrix

## Describe the bug Darwin builders seem to sometimes struggle with the timeout tests and throw a bad error message leading to test failure. See [Exhibit A](https://buildkite.com/lix-project/lix/builds/5259#0199e27a-7937-4593-844b-dee89bdc6e50/27-2836) [Exhibit B](https://buildkite.com/lix-project/lix/builds/5254#0199e238-e0c8-4b17-a629-c2f33b027853/27-2787) it is currently unknown if this behavior was present in F1 already and only got copied or if it was introduced by the migration to f2 ## Steps To Reproduce 1. Look at long relation chains and spot CI failure of (lowdown)-aarch64-darwin machines 2. check if the build failure was due to an assertion error or due to a lix crash (if the latter, check #1016) ## Expected behavior The CI should not fail. Either by fixing the underlying issue, or - if expected behavior - allowing the currently failing message ## Additional context See functional2 room on matrix
Owner

we would guess this is kill order dependent: during sandbox teardown we kill the process group, but we think it's possible that in a process tree nix-daemon < sandbox parent < sandbox child the child gets killed first and the parent propagates the exit code to the daemon before it too gets killed. if that's the case (which should be easy to verify for folks running darwin) there's nothing we can do about this and this is not an error. (we can't rely on signal exit being 128 + n because that's a shell-specific convention)

we would guess this is kill order dependent: during sandbox teardown we kill the process group, but we think it's possible that in a process tree `nix-daemon < sandbox parent < sandbox child` the child gets killed first and the parent propagates the exit code to the daemon before it too gets killed. if that's the case (which should be easy to verify for folks running darwin) there's nothing we can do about this and this is not an error. (we can't rely on signal exit being `128 + n` because that's a shell-specific convention)
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lix-project/lix#1017
No description provided.