CI jobs timing out due to plausible daemon bug? #549

Closed
opened 2024-10-15 21:16:35 +00:00 by jade · 3 comments
Owner

Various CI runs time out after 3 hours due to what appears to be a deadlock or broken async code. I have a core dump of a bad build, but we don't have debuginfo on kj, so it's not practical to figure out the async state that led to it getting busted.

Generally the jobs that have this problem have been nixos tests. The CI cluster has a bunch of interesting properties that might be relevant: it is building with something like:

401863 nix build -L --option keep-going true --max-jobs 0 --max-silent-time 1200 --timeout 7200 --store ssh-ng://nix@epyc.infra.newtype.fr?ssh-key=/run/agenix/buildbot-remote-builder-key&base64-ssh-public-host-key=c3NoLWVkMjU1MTkgQUFBQUMzTnphQzFsWkRJMU5URTVBQUFBSU9YVDlJbml0MU1oS3Q0cmpCQU5McTB0MGJQd3cvV1FaOTZ1QjRBRURybWw= --eval-store ssh-ng://localhost --out-link result-tests.remoteBuilds_local_2_3 /nix/store/357fgw5lrpr8z3fkrvcr5rl4g0gnl33j-vm-test-run-remoteBuilds_local_2_3.drv^*

The client side has been observed to be waiting for stderr, I believe, somewhere around here: ee0c195eba/src/libstore/remote-store.cc (L902)

So it's almost certainly stuck on the daemon side.

Anyway here is a stack trace of the daemon from the core dump I pulled:

~/lix » NIX_DEBUG_INFO_DIRS=/nix/store/v43xfm90rl4048p9wwyp352qdwm73vp5-lix-2.92.0-devpre20241009_9865eba-debug/lib/debug gdb --core nix-haunting.874300 /nix/store/qa8skf32hczz2xp1d1dk393brqx6vlga-lix-2.92.0-devpre20241009_9865eba/bin/nix-daemon
Reading symbols from /nix/store/qa8skf32hczz2xp1d1dk393brqx6vlga-lix-2.92.0-devpre20241009_9865eba/bin/nix-daemon...
Reading symbols from /nix/store/v43xfm90rl4048p9wwyp352qdwm73vp5-lix-2.92.0-devpre20241009_9865eba-debug/lib/debug/.build-id/65/25b582d3304854f3418e7db123f22f0ba2ea92.debug...
[New LWP 874300]
[New LWP 874354]
[New LWP 874301]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/3dyw8dzj9ab4m8hv5dpyx7zii8d0w6fi-glibc-2.39-52/lib/libthread_db.so.1".
Core was generated by `nix-daemon'.
#0  0x00007f351431951e in epoll_wait ()
   from /nix/store/5adwdl39g3k9a2j0qadvirnliv4r7pwd-glibc-2.39-52/lib/libc.so.6
=> 0x00007f351431951e <epoll_wait+110>:	48 3d 00 f0 ff ff  	cmp    $0xfffffffffffff000,%rax
[Current thread is 1 (Thread 0x7f3512a49580 (LWP 874300))]
(gdb) bt
#0  0x00007f351431951e in epoll_wait ()
   from /nix/store/5adwdl39g3k9a2j0qadvirnliv4r7pwd-glibc-2.39-52/lib/libc.so.6
#1  0x00007f35151e3096 in kj::UnixEventPort::wait() ()
   from /nix/store/qa8skf32hczz2xp1d1dk393brqx6vlga-lix-2.92.0-devpre20241009_9865eba/lib/liblixstore.so
#2  0x00007f35151d01d6 in kj::EventLoop::wait() ()
   from /nix/store/qa8skf32hczz2xp1d1dk393brqx6vlga-lix-2.92.0-devpre20241009_9865eba/lib/liblixstore.so
#3  0x00007f35151d1fd4 in kj::_::waitImpl(kj::Own<kj::_::PromiseNode, kj::_::PromiseDisposer>&&, kj::_::ExceptionOrValue&, kj::WaitScope&, kj::SourceLocation) ()
   from /nix/store/qa8skf32hczz2xp1d1dk393brqx6vlga-lix-2.92.0-devpre20241009_9865eba/lib/liblixstore.so
#4  0x00007f351519e242 in kj::Promise<boost::outcome_v2::basic_result<std::map<std::shared_ptr<nix::Goal>, nix::Goal::WorkResult, std::less<std::shared_ptr<nix::Goal> >, std::allocator<std::pair<std::shared_ptr<nix::Goal> const, nix::Goal::WorkResult> > >, std::__exception_ptr::exception_ptr, boost::outcome_v2::policy::exception_ptr_rethrow<std::map<std::shared_ptr<nix::Goal>, nix::Goal::WorkResult, std::less<std::shared_ptr<nix::Goal> >, std::allocator<std::pair<std::shared_ptr<nix::Goal> const, nix::Goal::WorkResult> > >, std::__exception_ptr::exception_ptr, void> > >::wait (this=0xfffffffffffffffc, this@entry=0x7ffc6a940ef8, waitScope=..., location=...)
    at /nix/store/q59xanm47akjc5adhmlrd0r9fjxjbsry-capnproto-1.0.2/include/kj/async-inl.h:1357
#5  0x00007f351519343d in nix::Worker::run (this=0x7ffc6a941178, req=...) at src/libstore/build/worker.cc:255
#6  0x00007f351514948e in nix::runWorker<nix::Store::buildPathsWithResults(std::vector<nix::DerivedPath, std::allocator<nix::DerivedPath> > const&, nix::BuildMode, std::shared_ptr<nix::Store>)::$_0>(nix::Worker&, nix::Store::buildPathsWithResults(std::vector<nix::DerivedPath, std::allocator<nix::DerivedPath> > const&, nix::BuildMode, std::shared_ptr<nix::Store>)::$_0) (worker=..., mkGoals=...) at src/libstore/build/entry-points.cc:11
#7  nix::Store::buildPathsWithResults (this=0x5569a0af9e90, reqs=std::vector of length 1, capacity 1 = {...},
    buildMode=nix::bmNormal, evalStore=std::shared_ptr<nix::Store> (empty) = {...})
    at src/libstore/build/entry-points.cc:62
#8  0x00007f3514fc27fc in nix::daemon::performOp (logger=logger@entry=0x5569a0addd70, store=...,
    trusted=<optimized out>, recursive=<optimized out>, clientVersion=291, from=..., to=...,
    op=nix::WorkerProto::Op::BuildPathsWithResults) at src/libstore/daemon.cc:571
#9  0x00007f3514fbf0b5 in nix::daemon::processConnection (store=..., from=..., to=..., trusted=<optimized out>,
    recursive=<optimized out>) at src/libstore/daemon.cc:1075
Various CI runs time out after 3 hours due to what appears to be a deadlock or broken async code. I have a core dump of a bad build, but we don't have debuginfo on kj, so it's not practical to figure out the async state that led to it getting busted. Generally the jobs that have this problem have been nixos tests. The CI cluster has a bunch of interesting properties that might be relevant: it is building with something like: ``` 401863 nix build -L --option keep-going true --max-jobs 0 --max-silent-time 1200 --timeout 7200 --store ssh-ng://nix@epyc.infra.newtype.fr?ssh-key=/run/agenix/buildbot-remote-builder-key&base64-ssh-public-host-key=c3NoLWVkMjU1MTkgQUFBQUMzTnphQzFsWkRJMU5URTVBQUFBSU9YVDlJbml0MU1oS3Q0cmpCQU5McTB0MGJQd3cvV1FaOTZ1QjRBRURybWw= --eval-store ssh-ng://localhost --out-link result-tests.remoteBuilds_local_2_3 /nix/store/357fgw5lrpr8z3fkrvcr5rl4g0gnl33j-vm-test-run-remoteBuilds_local_2_3.drv^* ``` The client side has been observed to be waiting for stderr, I believe, somewhere around here: https://git.lix.systems/lix-project/lix/src/ee0c195eba7d16b796fd9883e3fe88c0d64ff0bf/src/libstore/remote-store.cc#L902 So it's almost certainly stuck on the daemon side. Anyway here is a stack trace of the daemon from the core dump I pulled: ``` ~/lix » NIX_DEBUG_INFO_DIRS=/nix/store/v43xfm90rl4048p9wwyp352qdwm73vp5-lix-2.92.0-devpre20241009_9865eba-debug/lib/debug gdb --core nix-haunting.874300 /nix/store/qa8skf32hczz2xp1d1dk393brqx6vlga-lix-2.92.0-devpre20241009_9865eba/bin/nix-daemon Reading symbols from /nix/store/qa8skf32hczz2xp1d1dk393brqx6vlga-lix-2.92.0-devpre20241009_9865eba/bin/nix-daemon... Reading symbols from /nix/store/v43xfm90rl4048p9wwyp352qdwm73vp5-lix-2.92.0-devpre20241009_9865eba-debug/lib/debug/.build-id/65/25b582d3304854f3418e7db123f22f0ba2ea92.debug... [New LWP 874300] [New LWP 874354] [New LWP 874301] This GDB supports auto-downloading debuginfo from the following URLs: <https://debuginfod.archlinux.org> Debuginfod has been disabled. To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit. [Thread debugging using libthread_db enabled] Using host libthread_db library "/nix/store/3dyw8dzj9ab4m8hv5dpyx7zii8d0w6fi-glibc-2.39-52/lib/libthread_db.so.1". Core was generated by `nix-daemon'. #0 0x00007f351431951e in epoll_wait () from /nix/store/5adwdl39g3k9a2j0qadvirnliv4r7pwd-glibc-2.39-52/lib/libc.so.6 => 0x00007f351431951e <epoll_wait+110>: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax [Current thread is 1 (Thread 0x7f3512a49580 (LWP 874300))] (gdb) bt #0 0x00007f351431951e in epoll_wait () from /nix/store/5adwdl39g3k9a2j0qadvirnliv4r7pwd-glibc-2.39-52/lib/libc.so.6 #1 0x00007f35151e3096 in kj::UnixEventPort::wait() () from /nix/store/qa8skf32hczz2xp1d1dk393brqx6vlga-lix-2.92.0-devpre20241009_9865eba/lib/liblixstore.so #2 0x00007f35151d01d6 in kj::EventLoop::wait() () from /nix/store/qa8skf32hczz2xp1d1dk393brqx6vlga-lix-2.92.0-devpre20241009_9865eba/lib/liblixstore.so #3 0x00007f35151d1fd4 in kj::_::waitImpl(kj::Own<kj::_::PromiseNode, kj::_::PromiseDisposer>&&, kj::_::ExceptionOrValue&, kj::WaitScope&, kj::SourceLocation) () from /nix/store/qa8skf32hczz2xp1d1dk393brqx6vlga-lix-2.92.0-devpre20241009_9865eba/lib/liblixstore.so #4 0x00007f351519e242 in kj::Promise<boost::outcome_v2::basic_result<std::map<std::shared_ptr<nix::Goal>, nix::Goal::WorkResult, std::less<std::shared_ptr<nix::Goal> >, std::allocator<std::pair<std::shared_ptr<nix::Goal> const, nix::Goal::WorkResult> > >, std::__exception_ptr::exception_ptr, boost::outcome_v2::policy::exception_ptr_rethrow<std::map<std::shared_ptr<nix::Goal>, nix::Goal::WorkResult, std::less<std::shared_ptr<nix::Goal> >, std::allocator<std::pair<std::shared_ptr<nix::Goal> const, nix::Goal::WorkResult> > >, std::__exception_ptr::exception_ptr, void> > >::wait (this=0xfffffffffffffffc, this@entry=0x7ffc6a940ef8, waitScope=..., location=...) at /nix/store/q59xanm47akjc5adhmlrd0r9fjxjbsry-capnproto-1.0.2/include/kj/async-inl.h:1357 #5 0x00007f351519343d in nix::Worker::run (this=0x7ffc6a941178, req=...) at src/libstore/build/worker.cc:255 #6 0x00007f351514948e in nix::runWorker<nix::Store::buildPathsWithResults(std::vector<nix::DerivedPath, std::allocator<nix::DerivedPath> > const&, nix::BuildMode, std::shared_ptr<nix::Store>)::$_0>(nix::Worker&, nix::Store::buildPathsWithResults(std::vector<nix::DerivedPath, std::allocator<nix::DerivedPath> > const&, nix::BuildMode, std::shared_ptr<nix::Store>)::$_0) (worker=..., mkGoals=...) at src/libstore/build/entry-points.cc:11 #7 nix::Store::buildPathsWithResults (this=0x5569a0af9e90, reqs=std::vector of length 1, capacity 1 = {...}, buildMode=nix::bmNormal, evalStore=std::shared_ptr<nix::Store> (empty) = {...}) at src/libstore/build/entry-points.cc:62 #8 0x00007f3514fc27fc in nix::daemon::performOp (logger=logger@entry=0x5569a0addd70, store=..., trusted=<optimized out>, recursive=<optimized out>, clientVersion=291, from=..., to=..., op=nix::WorkerProto::Op::BuildPathsWithResults) at src/libstore/daemon.cc:571 #9 0x00007f3514fbf0b5 in nix::daemon::processConnection (store=..., from=..., to=..., trusted=<optimized out>, recursive=<optimized out>) at src/libstore/daemon.cc:1075 ```
jade added the
devx
bug
E/hard
labels 2024-10-15 21:16:35 +00:00
Member

This issue was mentioned on Gerrit on the following CLs:

  • commit message in cl/2074 ("packaging: use in-tree capnproto derivation")
  • commit message in cl/2087 ("libstore: always release build/substitution slot tokens")
<!-- GERRIT_LINKBOT: {"cls": [{"backlink": "https://gerrit.lix.systems/c/lix/+/2074", "number": 2074, "kind": "commit message"}, {"backlink": "https://gerrit.lix.systems/c/lix/+/2087", "number": 2087, "kind": "commit message"}], "cl_meta": {"2074": {"change_title": "packaging: use in-tree capnproto derivation"}, "2087": {"change_title": "libstore: always release build/substitution slot tokens"}}} --> This issue was mentioned on Gerrit on the following CLs: * commit message in [cl/2074](https://gerrit.lix.systems/c/lix/+/2074) ("packaging: use in-tree capnproto derivation") * commit message in [cl/2087](https://gerrit.lix.systems/c/lix/+/2087) ("libstore: always release build/substitution slot tokens")
Owner

the CI builder has an ssh://localhost remote builder configured. if it ever chooses this builder to build a derivation it'll deadlock on the derivation lock (which is at that point held by the original daemon instance, and the faux ssh store tries to acquire it as well)

in fact, haven't we seen things like this break before already? only perhaps less frequently for some reason, which could totally be related to the scheduling algorithm changing

the CI builder has an `ssh://localhost` remote builder configured. if it ever chooses this builder to build a derivation it'll deadlock on the derivation lock (which is at that point held by the original daemon instance, and the faux ssh store tries to acquire it as well) in fact, haven't we seen things like this break before already? only perhaps less frequently for some reason, which could *totally* be related to the scheduling algorithm changing
Author
Owner

cc @raito

cc @raito
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#549
No description provided.