nix-eval-jobs flakey test #703

Closed
opened 2025-02-27 06:00:29 +00:00 by jade · 7 comments
Owner

https://buildkite.com/lix-project/lix/builds/301#01954565-28ac-4f84-a4f9-9c20d31e0172

nix-eval-jobs-tests> >           assert "possible infinite recursion" in stderr
nix-eval-jobs-tests> E           assert 'possible infinite recursion' in "fetching path input 'path:/build/source/tests/assets'\nfetching path input 'path:/nix/store/k06gswns3afji80ypdjh78015...ding result for attrPath 'packageWithInfiniteRecursion', worker pipe got closed but evaluation worker still running?\n"
nix-eval-jobs-tests> tests/test_eval.py:142: AssertionError
nix-eval-jobs-tests> fetching path input 'path:/build/source/tests/assets'
nix-eval-jobs-tests> fetching path input 'path:/nix/store/k06gswns3afji80ypdjh7801514dx9an-lszshlfg2dsvdap2xwnksqin9i6h2zdg-source'
nix-eval-jobs-tests> warning: not writing modified lock file of flake 'path:/build/source/tests/assets':
nix-eval-jobs-tests> • Updated input 'nixpkgs':
nix-eval-jobs-tests>     'github:NixOS/nixpkgs/2873a73123077953f3e6f34964466018876d87c4' (2023-12-10)
nix-eval-jobs-tests>   → 'path:/nix/store/k06gswns3afji80ypdjh7801514dx9an-lszshlfg2dsvdap2xwnksqin9i6h2zdg-source?lastModified=1&narHash=sha256-b4JrUmqT0vFNx42aEN9LTWOHomkTKL/ayLopflVf81U%3D' (1970-01-01)
nix-eval-jobs-tests> error: BUG: while reading result for attrPath 'packageWithInfiniteRecursion', worker pipe got closed but evaluation worker still running?
nix-eval-jobs-tests> fetching path input 'path:/build/source/tests/assets'
nix-eval-jobs-tests> fetching path input 'path:/nix/store/k06gswns3afji80ypdjh7801514dx9an-lszshlfg2dsvdap2xwnksqin9i6h2zdg-source'
nix-eval-jobs-tests> warning: not writing modified lock file of flake 'path:/build/source/tests/assets':
nix-eval-jobs-tests> • Updated input 'nixpkgs':
nix-eval-jobs-tests>     'github:NixOS/nixpkgs/2873a73123077953f3e6f34964466018876d87c4' (2023-12-10)
nix-eval-jobs-tests>   → 'path:/nix/store/k06gswns3afji80ypdjh7801514dx9an-lszshlfg2dsvdap2xwnksqin9i6h2zdg-source?lastModified=1&narHash=sha256-b4JrUmqT0vFNx42aEN9LTWOHomkTKL/ayLopflVf81U%3D' (1970-01-01)
nix-eval-jobs-tests> error: BUG: while reading result for attrPath 'packageWithInfiniteRecursion', worker pipe got closed but evaluation worker still running?
https://buildkite.com/lix-project/lix/builds/301#01954565-28ac-4f84-a4f9-9c20d31e0172 ``` nix-eval-jobs-tests> > assert "possible infinite recursion" in stderr nix-eval-jobs-tests> E assert 'possible infinite recursion' in "fetching path input 'path:/build/source/tests/assets'\nfetching path input 'path:/nix/store/k06gswns3afji80ypdjh78015...ding result for attrPath 'packageWithInfiniteRecursion', worker pipe got closed but evaluation worker still running?\n" nix-eval-jobs-tests> tests/test_eval.py:142: AssertionError ``` ``` nix-eval-jobs-tests> fetching path input 'path:/build/source/tests/assets' nix-eval-jobs-tests> fetching path input 'path:/nix/store/k06gswns3afji80ypdjh7801514dx9an-lszshlfg2dsvdap2xwnksqin9i6h2zdg-source' nix-eval-jobs-tests> warning: not writing modified lock file of flake 'path:/build/source/tests/assets': nix-eval-jobs-tests> • Updated input 'nixpkgs': nix-eval-jobs-tests> 'github:NixOS/nixpkgs/2873a73123077953f3e6f34964466018876d87c4' (2023-12-10) nix-eval-jobs-tests> → 'path:/nix/store/k06gswns3afji80ypdjh7801514dx9an-lszshlfg2dsvdap2xwnksqin9i6h2zdg-source?lastModified=1&narHash=sha256-b4JrUmqT0vFNx42aEN9LTWOHomkTKL/ayLopflVf81U%3D' (1970-01-01) nix-eval-jobs-tests> error: BUG: while reading result for attrPath 'packageWithInfiniteRecursion', worker pipe got closed but evaluation worker still running? nix-eval-jobs-tests> fetching path input 'path:/build/source/tests/assets' nix-eval-jobs-tests> fetching path input 'path:/nix/store/k06gswns3afji80ypdjh7801514dx9an-lszshlfg2dsvdap2xwnksqin9i6h2zdg-source' nix-eval-jobs-tests> warning: not writing modified lock file of flake 'path:/build/source/tests/assets': nix-eval-jobs-tests> • Updated input 'nixpkgs': nix-eval-jobs-tests> 'github:NixOS/nixpkgs/2873a73123077953f3e6f34964466018876d87c4' (2023-12-10) nix-eval-jobs-tests> → 'path:/nix/store/k06gswns3afji80ypdjh7801514dx9an-lszshlfg2dsvdap2xwnksqin9i6h2zdg-source?lastModified=1&narHash=sha256-b4JrUmqT0vFNx42aEN9LTWOHomkTKL/ayLopflVf81U%3D' (1970-01-01) nix-eval-jobs-tests> error: BUG: while reading result for attrPath 'packageWithInfiniteRecursion', worker pipe got closed but evaluation worker still running? ```
Member

This issue was mentioned on Gerrit on the following CLs:

  • comment in cl/2607 ("feat: add a --priority flag to nix-env --install")
  • commit message in cl/2798 ("nix-eval-jobs: reenable tests, rework waitpid handling")
<!-- GERRIT_LINKBOT: {"cls": [{"backlink": "https://gerrit.lix.systems/c/lix/+/2607", "number": 2607, "kind": "comment"}, {"backlink": "https://gerrit.lix.systems/c/lix/+/2798", "number": 2798, "kind": "commit message"}], "cl_meta": {"2607": {"change_title": "feat: add a --priority flag to nix-env --install"}, "2798": {"change_title": "nix-eval-jobs: reenable tests, rework waitpid handling"}}} --> This issue was mentioned on Gerrit on the following CLs: * comment in [cl/2607](https://gerrit.lix.systems/c/lix/+/2607) ("feat: add a --priority flag to nix-env --install") * commit message in [cl/2798](https://gerrit.lix.systems/c/lix/+/2798) ("nix-eval-jobs: reenable tests, rework waitpid handling")
Author
Owner

cc @ma27 as owner of the n-e-j integration, though this is probably a gremlins grade bug

cc @ma27 as owner of the n-e-j integration, though this is probably a gremlins grade bug
Owner

execution speed of the test is also incredibly variable, running in ~30 seconds on most platforms while on darwin it regularly needs five minutes or more

execution speed of the test is also incredibly variable, running in ~30 seconds on most platforms while on darwin it regularly needs five minutes or more
Member

Oh come on... I ran the test quite a few times while preparing the CL and now that it landed, I managed to trigger this 🙈
I'll try to work something out this weekend.

Oh come on... I ran the test quite a few times while preparing the CL and now that it landed, I managed to trigger this 🙈 I'll try to work something out this weekend.
Member

So, regarding the flaky test: when n-e-j notices that the pipe is broken, it will look at what happened to the worker:

  • if it's a SIGSEGV, we assume an infinite recursion (that's what the test is for)
  • if it's still running, we assume that only the pipe is broken and SIGKILL the worker.
  • there are a few other checks, but those are not relevant for the issue.

The second case is what happens once in a while (I need a few hundred runs of the test on my machine to hit this).

I essentially let a script print out the time when the testcase failed (and ran the test in a loop). The date and the last entry in coredumpctl tell me that the process still segfaulted (rather dumb approach, I know!). So what probably happened is that

  • the fds in the worker are closed
  • n-e-j tries to read from the pipe
  • the process exited

and given that the read must happen in between, that'd also explain why it took me a while to trigger this.

Unless I'm missing something, the only real option is to wait in the handler since we can't really know if the pipe got closed by whatever issue or if the worker itself died and I don't really think that's reasonable here since a broken pipe with a running worker is also a case we have to catch.

Given that this is only triggered under certain circumstances (recursion that doesn't get caught by the evaluator), it happens rather seldomly and I'd expect people to use nix eval instead of nix-eval-jobs for the actual debugging of that, I'm wondering if the right call is to just skip the test in CI (I think I'd prefer to keep it however as it might be helpful for debugging this) or to use something like pytest-retry (although I don't really like these band-aid things in general).

wdyt? @jade @pennae


execution speed of the test is also incredibly variable, running in ~30 seconds on most platforms while on darwin it regularly needs five minutes or more

Yeah, can confirm.
Interestingly, the worst offenders are all testcases that involve flakes:

nix-eval-jobs-tests-aarch64-darwin> 163.19s call     tests/test_eval.py::test_query_cache_status
nix-eval-jobs-tests-aarch64-darwin> 113.14s call     tests/test_eval.py::test_flake
nix-eval-jobs-tests-aarch64-darwin> 34.15s call     tests/test_eval.py::test_eval_error
nix-eval-jobs-tests-aarch64-darwin> 20.62s call     tests/test_eval.py::test_recursion_error
nix-eval-jobs-tests-aarch64-darwin> 3.48s call     tests/test_eval.py::test_expression

Since you work more on Lix core than I do, any idea why that's the case?
Otherwise, I'll try to narrow it down more ~tomorrowish.

So, regarding the flaky test: when n-e-j notices that the pipe is broken, it will look at what happened to the worker: * if it's a SIGSEGV, we assume an infinite recursion (that's what the test is for) * if it's still running, we assume that only the pipe is broken and SIGKILL the worker. * there are a few other checks, but those are not relevant for the issue. The second case is what happens once in a while (I need a few hundred runs of the test on my machine to hit this). I essentially let a script print out the time when the testcase failed (and ran the test in a loop). The date and the last entry in coredumpctl tell me that the process still segfaulted (rather dumb approach, I know!). So what probably happened is that * the fds in the worker are closed * n-e-j tries to read from the pipe * the process exited and given that the read must happen in between, that'd also explain why it took me a while to trigger this. Unless I'm missing something, the only real option is to wait in the handler since we can't really know if the pipe got closed by whatever issue or if the worker itself died and I don't really think that's reasonable here since a broken pipe with a running worker is also a case we have to catch. Given that this is only triggered under certain circumstances (recursion that doesn't get caught by the evaluator), it happens rather seldomly and I'd expect people to use `nix eval` instead of `nix-eval-jobs` for the actual debugging of that, I'm wondering if the right call is to just skip the test in CI (I think I'd prefer to keep it however as it might be helpful for debugging this) or to use something like pytest-retry (although I don't really like these band-aid things in general). wdyt? @jade @pennae --- > execution speed of the test is also incredibly variable, running in ~30 seconds on most platforms while on darwin it regularly needs five minutes or more Yeah, can confirm. Interestingly, the worst offenders are all testcases that involve flakes: ``` nix-eval-jobs-tests-aarch64-darwin> 163.19s call tests/test_eval.py::test_query_cache_status nix-eval-jobs-tests-aarch64-darwin> 113.14s call tests/test_eval.py::test_flake nix-eval-jobs-tests-aarch64-darwin> 34.15s call tests/test_eval.py::test_eval_error nix-eval-jobs-tests-aarch64-darwin> 20.62s call tests/test_eval.py::test_recursion_error nix-eval-jobs-tests-aarch64-darwin> 3.48s call tests/test_eval.py::test_expression ``` Since you work more on Lix core than I do, any idea why that's the case? Otherwise, I'll try to narrow it down more ~tomorrowish.
Owner

wdyt?

wait, case 2 is the bug? the coordinator notices a pipe closing before the process is actually gone? we really shouldn't be treating that as a bug on the first time around without applying some kind of wait timeout, exiting is a racy process (and nix code makes it even racier). only a process that hasn't done the assigned work to the coordinator's satisfaction ought to trigger a bug. killing the worker seems fine, if maybe a bit excessively eager

> wdyt? wait, case 2 is the bug? the coordinator notices a pipe closing before the process is actually gone? we really shouldn't be treating that as a bug on the first time around without applying some kind of wait timeout, exiting is a racy process (and nix code makes it even racier). only a process that hasn't done the assigned work to the coordinator's satisfaction ought to trigger a bug. killing the worker seems fine, if maybe a bit excessively eager
Owner

test is disabled for now because it was flaking so much: cl/2725

test is disabled for now because it was flaking so much: cl/2725
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#703
No description provided.