helpers don't report errno back to caller (gc-socket can fail if unix-bind-connect helper gets used) #1184

Closed
opened 2026-04-18 22:05:02 +00:00 by lunaphied · 6 comments
Owner

This is related to #1113 and occurs whenever the the TMPDIR path is long enough that the $NIX_DATA_DIR/gc-socket/socket used becomes long enough that the unix-bind-connect helper gets used. This issue should exist in versions ever since that was added up to at least 15c95b95d6 which is the most recent commit I've tested on as of posting.

The primary reproducer is the gc-non-blocking functional test. You can easily see this reproduce by running the test with TMPDIR set to an incredibly long path (note it needs to be an empty directory to satisfy the testing harness for whatever reason). The error given will be that it fails to chdir into the path but this is because the gc-socket in general hasn't been created yet.

Normally in this situation the build part of the test will try to add the temp root, fail to grab the global GC lock since the GC is in progress, and then that code should loop on trying to either grab the lock or connect to the GC server's socket to inform it to add the root itself. This relies on being able to catch the normal SysError that the failure to connect would throw and then checking the errno of the result to make sure it's one of the expected failure modes (e.errNo == ECONNREFUSED || e.errNo == ENOENT).

But using the connection helper doesn't produce the same failure reporting it simply dies when it fails to connect (in this test's case the initial failure is due to the directory not existing yet when the connect helper tries to connect, this is because the test holds the GC process in a state of having grabbed the lock but not yet started the GC process which includes building the path to the socket along with binding it into existence). This failure to connect is treated as fatal and a retry is not attempted.

This is likely a larger general issue with the helpers and we need to find a way to bubble up these errors likely through some kind of upstream reporting of errno instead of just string errors. Exit codes probably won't work as they technically don't fully have the ability to fit an int like errno is defined as.

This is related to #1113 and occurs whenever the the `TMPDIR` path is long enough that the `$NIX_DATA_DIR/gc-socket/socket` used becomes long enough that the `unix-bind-connect` helper gets used. This issue should exist in versions ever since that was added up to at least 15c95b95d609a15bc66835f487056dc831162f42 which is the most recent commit I've tested on as of posting. The primary reproducer is the `gc-non-blocking` functional test. You can easily see this reproduce by running the test with `TMPDIR` set to an incredibly long path (note it needs to be an empty directory to satisfy the testing harness for whatever reason). The error given will be that it fails to chdir into the path but this is because the gc-socket in general hasn't been created yet. Normally in this situation the build part of the test will try to add the temp root, fail to grab the global GC lock since the GC is in progress, and then that code should loop on trying to either grab the lock or connect to the GC server's socket to inform it to add the root itself. This relies on being able to catch the normal `SysError` that the failure to connect would throw and then checking the `errno` of the result to make sure it's one of the expected failure modes (`e.errNo == ECONNREFUSED || e.errNo == ENOENT`). But using the connection helper doesn't produce the same failure reporting it simply dies when it fails to connect (in this test's case the initial failure is due to the directory not existing yet when the connect helper tries to connect, this is because the test holds the GC process in a state of having grabbed the lock but not yet started the GC process which includes building the path to the socket along with binding it into existence). This failure to connect is treated as fatal and a retry is not attempted. This is likely a larger general issue with the helpers and we need to find a way to bubble up these errors likely through some kind of upstream reporting of `errno` instead of just string errors. Exit codes probably won't work as they technically don't fully have the ability to fit an `int` like `errno` is defined as.
Owner

does cl/5490 fix this to your satisfaction?

does cl/5490 fix this to your satisfaction?
Owner

I think you meant https://gerrit.lix.systems/c/lix/+/5479 here.

I think you meant https://gerrit.lix.systems/c/lix/+/5479 here.
Author
Owner

So I'm a little confused why I wasn't added as a reviewer on that CL, but that's not quite what I had in mind, since theoretically this issue exists for all helper scripts. What I was planning is to have DIE_UNLESS_SYS itself forward errors through a pipe

I also wanted to see a test for this failure more generically than just it being caught by randomly long TMPDIRs in testing or a weird daemon tempdir

So I'm a little confused why I wasn't added as a reviewer on that CL, but that's not *quite* what I had in mind, since theoretically this issue exists for all helper scripts. What I was planning is to have `DIE_UNLESS_SYS` itself forward errors through a pipe I also wanted to see a test for this failure more generically than just it being caught by randomly long `TMPDIR`s in testing or a weird daemon tempdir
Owner

that's not quite what I had in mind, since theoretically this issue exists for all helper scripts. What I was planning is to have DIE_UNLESS_SYS itself forward errors through a pipe

that's intentional. for most helpers the syscall errors are not actionable and make little sense to incur the infrastructure overhead to forward them. unix-bind-connect is the only one where it does marginally make sense, and even that only because it's still a terrible mashup of things; ideally we'd move the entire bind/connect thing into the helper and have useful reporting from there instead of transporting errnos :/

> that's not quite what I had in mind, since theoretically this issue exists for all helper scripts. What I was planning is to have DIE_UNLESS_SYS itself forward errors through a pipe that's intentional. for most helpers the syscall errors are not actionable and make little sense to incur the infrastructure overhead to forward them. `unix-bind-connect` is the only one where it does marginally make sense, and even that only because it's still a terrible mashup of things; ideally we'd move the *entire* bind/connect thing into the helper and have useful reporting from there instead of transporting errnos :/
Author
Owner

Alright I'm good with that, do you want me to open an issue for eventually migrating the rest of the logic into the bind/connect helper?

Alright I'm good with that, do you want me to open an issue for eventually migrating the rest of the logic into the bind/connect helper?
Owner

yeah, sure :) that would also make for a good thing to port to rust because it's pretty much disconnected from everything else, and it'd give us some valuable information about just how feasible porting all the libexec helpers actually is

yeah, sure :) that would also make for a good thing to port to rust because it's pretty much disconnected from everything else, and it'd give us some valuable information about just how feasible porting all the libexec helpers actually is
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lix-project/lix#1184
No description provided.