<C-C>
(Control-C/Ctrl-C) no longer interrupts builds on HEAD #729
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#729
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Reported elsewhere by @jade. Current head seems to fail to interrupt builds properly when getting interrupted with
<C-C>
. (As of85a140accb
)cannot reproduce that so far, do you have a specific example that doesn't work?
I caused this on: 2.93.0-dev-pre20250324-4181796 on macOS.
Setup: running
NIX_CONFIG="builders = @/path/to/machines" nix flake check --eval-system x86_64-linux -Lv
in the lix-module repo when having to build a lix, then ctrl-c'ing the build of the lix.Symptom: Ctrl-C of the client does not cause the daemon process to stop. Further, the ssh process held by the daemon stays alive. Further, the
nix-daemon --stdio
on the remote builder's side remains alive even after killing the Mac's daemon.macOS daemon back trace:
Remote builder end:
we cannot reproduce this on linux with a darwin eval-system. could be a platform issue of some sort?
likely. it happened yesterday on my Mac.
then we can't be of any assistance here :(
This issue was mentioned on Gerrit on the following CLs:
2.91 used to just terminate the daemon by it trying to write to a hung-up socket which caused it to terminate. This used to happen:
This was probably a bug that MonitorFdHup wasn't catching it? I don't know why the broken pipe isn't causing the 2.93 daemon to throw an error and terminate on that basis.
Interestingly, it seems that it needs to get to a sufficiently complex build to actually have the problem to begin with; the above does not repro it. This is wild.
Okay. So it only happens if you
nix eval
on the same connection first.... WHAT.Reproducer (thanks to horrors pointing to the right direction):
Here's the minimal debug log of it. I've disabled build-hook and removed the queryPaths invocations in
nix-build
, so it's just AddToStore, AddToStore, then BuildPaths.I had the hypothesis that kqueue and poll on the same fd were causing the relevant event to disappear, since it has been observed to not be delivered to the monitor fd hup thread.
However:
So I still don't know what is going wrong here, but a trivial program doesn't do a bad thing, so presumably there's more to it than that. Time for more dtrace.
This is my test setup patch on Lix:
The stuff in nar.cc is unnecessary; I was hoping to repro without needing an evaluator but it turned out that the evaluator isn't that bad since I can do it without nixpkgs.
The repro above inconsistently creates the problem. I have dtruss output, but it's not of much help because the kqueue calls aren't dumped usefully.
I think the first step I should do dtrace-wise is to dump kqueue calls correctly.
This seems to disprove it being kqueue misuse, since the kqueue is not touching that fd. However, it doesn't explain why it is happening. I wonder if we are doing some stupid shit like not CLOEXEC'ing it and having it fall into a child process? Beats me, but it is progress!
The client is undeniably closing it and its side is getting DEALLOC'd. That's not quite the problem.
Nobody's calling knote. We're not dispatching SOF_KNOTE (and this is probably fine). We might be dispatching SB_KNOTE, but, hm.
The sb is the peer receiving it right?
flags=0x144 is SB_UNIX | SB_KNOTE | SB_RECV
flags=0x104 is SB_UNIX | SB_RECV
HEY why is sb_waiters == 0????! OH, no, it's blocking vs knote!! That makes sense, that's fine.
So it's supposed to be delivered as a knote.
HEY WAIT why is sowakeup not getting called on the close sometimes???!
That's:
I think this is just sketchy stack traces... It doesn't make any sense that the stacks diverge at soclose_locked.
Probably is calling into sodisconnectlocked and then pru_disconnect is uipc_disconnect?
Either way, we have our smoke here. sowakeup isn't getting called when the bad behaviour happens. Now, where's the fire...
I have an update: namely, that Apple's poll implementation is really busted lmao. What it's doing is that it's receiving a kevent of some kind, which is one-shot, which is NOT the EOF, and then it is unregistering from notifying about the EOF. It is not even delivering even a spurious notification to the nix daemon (which, if it DID do that, it would be fine and would retry!!). Instead it is
not returning from poll. Hilarious!!
This hot mess is what I've been using to diagnose this:
Things that are the case by disconnect time in the fault case:
On the client side socket:
WELL THERES YOUR PROBLEM! There's not a knote attached to it. How did that happen??
On the good client side socket:
To recap the problem:
uipc_disconnect
's call tree isn't callingsowakeup
fromsoisdisconnected
of the client's socket because daemon_side_sock.so_rcv.sb_flags does not haveSB_KNOTE
set on it.Contextual note: uipc_disconnect calls soisdisconnected on the facing socket first: that is, the daemon's socket.
Sidebar on defunct
I think that defunct isn't relevant; it seems to be some kind of idle kill thing?
BUT WHY
Catch a load of this: you call poll() on the daemon and then it unregisters the bloody event without returning anything to userland! lmfao!!!!!
knote->kn_flags = 0x1011: EV_ADD | EV_ONESHOT | EV_FLAG0
Call stack:
And then it gets dropped here:
8d741a5de7/bsd/kern/kern_event.c (L4448-L4457)
Trace:
I think that my advice for us is that we write a native kqueue based implementation of MonitorFdHup because Apple apparently did not write a poll implementation that doesn't deadlock, and it appears that the problem is in poll receiving spurious events internally and then turning itself off. Fortunately writing such a kqueue usage isn't exactly hard.
The other option is that we add enough spurious things to the poll event description that it delivers the events anyway.
While writing this comment I realized what the actual problem is: if the daemon isn't blockingly waiting on the socket, the kqueue of
poll
gets notified about it becoming readable:sb_waiters == 0 (ie no blocking waiting) always precedes an instance of non-delivered EOF event.
Indeed poll(2) is using EVFILT_READ:
However, it seems like kqueue_scan is not returning inside of poll either, which is .. more baffling. Clearly it got the event and deleted it but it still didn't return?
Diagnosing this bug was fun.
Aha!
kec_process_noutputs
is not getting incremented becauserevents
remains zero due to the lack of matchingevents
being requested and thuskqueue_process
is not returning early as intended, I believe!8d741a5de7/bsd/kern/kern_event.c (L7870-L7873)
That is:
The
poll_callback
is called here:8d741a5de7/bsd/kern/kern_event.c (L4465-L4467)
Okay, so, yeah. The Apple poll implementation is broken. We could maybe bisect why it wasn't happening on Lix 2.91; I have two hypotheses:
The other alternative to writing a correct implementation based on kqueue directly is that we set a poll timeout of the maximum time that we tolerate stale daemons lying around, since I think that the timeout will still probably be respected, and poll should be fine being level-triggered.
https://openradar.appspot.com/FB17447257
Copy pasted report
poll(2) fails to deliver HUP events when socket becomes readable before HUP, with the event filter POLLHUP
First, let's start out with a POSIX violation: you are not supposed to require POLLHUP to deliver HUP events, but meh, XNU does a POSIX violation, sure, that's rdar://37537852.
Not this bug.
Our upstream bug URL with lots of kernel debugging details: #729 (comment)
This bug was found in Lix, which has a thread listening for HUP events on client connections and a main thread performing blocking IO on the client connection.
The HUP events thread calls poll(2) on two FDs, the client connection one with events = POLLHUP, and a terminate wakeup pipe with events = POLLIN.
The terminate wakeup pipe is used to knock the HUP events thread out of sleep when the main thread wants to terminate it.
We observed that if the following situation happens, the kernel loses the HUP event:
Then the call to poll(2) does not terminate.
Expected behaviour:
poll(2) returns when an FD with events=POLLHUP receives a HUP event.
Actual behaviour:
poll(2) does not return.
Kernel bug diagnosis:
This is because the XNU kernel translates poll(2) with events = POLLHUP to kqueue with a filter of EVFILT_READ and using the flag EV_ONESHOT:
8d741a5de7/bsd/kern/sys_generic.c (L1749-L1757)
The kernel kevent
poll_callback
receives the event, changes no flags since it's not an EOF so it doesn't match the poll filter, then doesn't increment kec_process_noutputs.Since it's EV_ONESHOT, the event subscription is deleted so no further events arrive on that fd:
8d741a5de7/bsd/kern/kern_event.c (L7870-L7873)
Incrementing kec_process_noutputs is required to cause the call to terminate:
8d741a5de7/bsd/kern/kern_event.c (L7870-L7873)
This also only gets stuck if you have two fds going into poll, since I am guessing that listening on no kevents causes waiting for kevents to terminate.
That is:
This is a bug.
I have attached the janky dtrace script I used to diagnose this (n.b. requires process name ==
nix
) as well as a reproducer that will terminate when built with Clang on Linux, but will block forever on macOS.Build the reproducer with
clang++ test.cc -o test
.This is the symptom as observed by the dtrace script
wip.d
that led me to realize what the problem was:sb_waiters == 0 (ie no blocking waiting) always precedes an instance of non-delivered EOF event.
test.cc
wip.d