nix-eval-jobs flakey test #703
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#703
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
https://buildkite.com/lix-project/lix/builds/301#01954565-28ac-4f84-a4f9-9c20d31e0172
This issue was mentioned on Gerrit on the following CLs:
cc @ma27 as owner of the n-e-j integration, though this is probably a gremlins grade bug
execution speed of the test is also incredibly variable, running in ~30 seconds on most platforms while on darwin it regularly needs five minutes or more
Oh come on... I ran the test quite a few times while preparing the CL and now that it landed, I managed to trigger this 🙈
I'll try to work something out this weekend.
So, regarding the flaky test: when n-e-j notices that the pipe is broken, it will look at what happened to the worker:
The second case is what happens once in a while (I need a few hundred runs of the test on my machine to hit this).
I essentially let a script print out the time when the testcase failed (and ran the test in a loop). The date and the last entry in coredumpctl tell me that the process still segfaulted (rather dumb approach, I know!). So what probably happened is that
and given that the read must happen in between, that'd also explain why it took me a while to trigger this.
Unless I'm missing something, the only real option is to wait in the handler since we can't really know if the pipe got closed by whatever issue or if the worker itself died and I don't really think that's reasonable here since a broken pipe with a running worker is also a case we have to catch.
Given that this is only triggered under certain circumstances (recursion that doesn't get caught by the evaluator), it happens rather seldomly and I'd expect people to use
nix eval
instead ofnix-eval-jobs
for the actual debugging of that, I'm wondering if the right call is to just skip the test in CI (I think I'd prefer to keep it however as it might be helpful for debugging this) or to use something like pytest-retry (although I don't really like these band-aid things in general).wdyt? @jade @pennae
Yeah, can confirm.
Interestingly, the worst offenders are all testcases that involve flakes:
Since you work more on Lix core than I do, any idea why that's the case?
Otherwise, I'll try to narrow it down more ~tomorrowish.
wait, case 2 is the bug? the coordinator notices a pipe closing before the process is actually gone? we really shouldn't be treating that as a bug on the first time around without applying some kind of wait timeout, exiting is a racy process (and nix code makes it even racier). only a process that hasn't done the assigned work to the coordinator's satisfaction ought to trigger a bug. killing the worker seems fine, if maybe a bit excessively eager
test is disabled for now because it was flaking so much:
cl/2725