darwin: workaround PROC_PIDLISTFDS on processes with no fds

This has been causing various seemingly spurious CI failures as well as
some failures on people running tests on beta builds.

lix> ++(nix-collect-garbage-dry-run.sh:20) nix-store --gc --print-dead
lix> ++(nix-collect-garbage-dry-run.sh:20) wc -l
lix> finding garbage collector roots...
lix> error: Listing pid 87261 file descriptors: Undefined error: 0

There is no real way to write a proper test for this, other than to
start a process like the following:

int main(void) {
    for (int i = 0; i < 1000; ++i) {
        close(i);
    }
    sleep(10000);
}

and then let Lix's gc look at it.

I have a relatively high confidence this *will* fix the problem since I
have manually confirmed the behaviour of the libproc call is
as-unexpected, and it would perfectly explain the observed symptom.

Fixes: lix-project/lix#446
Change-Id: I67669b98377af17895644b3bafdf42fc33abd076
This commit is contained in:
jade 2024-08-07 02:00:50 -07:00
parent 529eed74c4
commit 1437d3df15
2 changed files with 31 additions and 1 deletions

View file

@ -0,0 +1,15 @@
---
synopsis: "Fix unexpectedly-successful GC failures on macOS"
cls: 1723
issues: fj#446
credits: jade
category: Fixes
---
Has the following happened to you on macOS? This failure has been successfully eliminated, thanks to our successful deployment of advanced successful-failure detection technology (it's just `if (failed && errno == 0)`. Patent pending<sup>not really</sup>):
```
$ nix-store --gc --print-dead
finding garbage collector roots...
error: Listing pid 87261 file descriptors: Undefined error: 0
```

View file

@ -56,12 +56,27 @@ void DarwinLocalStore::findPlatformRoots(UncheckedRoots & unchecked)
while (fdBufSize > fds.size() * sizeof(struct proc_fdinfo)) {
// Reserve some extra size so we don't fail too much
fds.resize((fdBufSize + fdBufSize / 8) / sizeof(struct proc_fdinfo));
errno = 0;
fdBufSize = proc_pidinfo(
pid, PROC_PIDLISTFDS, 0, fds.data(), fds.size() * sizeof(struct proc_fdinfo)
);
// errno == 0???! Yes, seriously. This is because macOS has a
// broken syscall wrapper for proc_pidinfo that has no way of
// dealing with the system call successfully returning 0. It
// takes the -1 error result from the errno-setting syscall
// wrapper and turns it into a 0 result. But what if the system
// call actually returns 0? Then you get an errno of success.
//
// https://github.com/apple-opensource/xnu/blob/4f43d4276fc6a87f2461a3ab18287e4a2e5a1cc0/libsyscall/wrappers/libproc/libproc.c#L100-L110
// https://git.lix.systems/lix-project/lix/issues/446#issuecomment-5483
// FB14695751
if (fdBufSize <= 0) {
throw SysError("Listing pid %1% file descriptors", pid);
if (errno == 0) {
break;
} else {
throw SysError("Listing pid %1% file descriptors", pid);
}
}
}
fds.resize(fdBufSize / sizeof(struct proc_fdinfo));