helpers don't report errno back to caller (gc-socket can fail if unix-bind-connect helper gets used) #1184

Open
opened 2026-04-18 22:05:02 +00:00 by lunaphied · 2 comments
Owner

This is related to #1113 and occurs whenever the the TMPDIR path is long enough that the $NIX_DATA_DIR/gc-socket/socket used becomes long enough that the unix-bind-connect helper gets used. This issue should exist in versions ever since that was added up to at least 15c95b95d6 which is the most recent commit I've tested on as of posting.

The primary reproducer is the gc-non-blocking functional test. You can easily see this reproduce by running the test with TMPDIR set to an incredibly long path (note it needs to be an empty directory to satisfy the testing harness for whatever reason). The error given will be that it fails to chdir into the path but this is because the gc-socket in general hasn't been created yet.

Normally in this situation the build part of the test will try to add the temp root, fail to grab the global GC lock since the GC is in progress, and then that code should loop on trying to either grab the lock or connect to the GC server's socket to inform it to add the root itself. This relies on being able to catch the normal SysError that the failure to connect would throw and then checking the errno of the result to make sure it's one of the expected failure modes (e.errNo == ECONNREFUSED || e.errNo == ENOENT).

But using the connection helper doesn't produce the same failure reporting it simply dies when it fails to connect (in this test's case the initial failure is due to the directory not existing yet when the connect helper tries to connect, this is because the test holds the GC process in a state of having grabbed the lock but not yet started the GC process which includes building the path to the socket along with binding it into existence). This failure to connect is treated as fatal and a retry is not attempted.

This is likely a larger general issue with the helpers and we need to find a way to bubble up these errors likely through some kind of upstream reporting of errno instead of just string errors. Exit codes probably won't work as they technically don't fully have the ability to fit an int like errno is defined as.

This is related to #1113 and occurs whenever the the `TMPDIR` path is long enough that the `$NIX_DATA_DIR/gc-socket/socket` used becomes long enough that the `unix-bind-connect` helper gets used. This issue should exist in versions ever since that was added up to at least 15c95b95d609a15bc66835f487056dc831162f42 which is the most recent commit I've tested on as of posting. The primary reproducer is the `gc-non-blocking` functional test. You can easily see this reproduce by running the test with `TMPDIR` set to an incredibly long path (note it needs to be an empty directory to satisfy the testing harness for whatever reason). The error given will be that it fails to chdir into the path but this is because the gc-socket in general hasn't been created yet. Normally in this situation the build part of the test will try to add the temp root, fail to grab the global GC lock since the GC is in progress, and then that code should loop on trying to either grab the lock or connect to the GC server's socket to inform it to add the root itself. This relies on being able to catch the normal `SysError` that the failure to connect would throw and then checking the `errno` of the result to make sure it's one of the expected failure modes (`e.errNo == ECONNREFUSED || e.errNo == ENOENT`). But using the connection helper doesn't produce the same failure reporting it simply dies when it fails to connect (in this test's case the initial failure is due to the directory not existing yet when the connect helper tries to connect, this is because the test holds the GC process in a state of having grabbed the lock but not yet started the GC process which includes building the path to the socket along with binding it into existence). This failure to connect is treated as fatal and a retry is not attempted. This is likely a larger general issue with the helpers and we need to find a way to bubble up these errors likely through some kind of upstream reporting of `errno` instead of just string errors. Exit codes probably won't work as they technically don't fully have the ability to fit an `int` like `errno` is defined as.
Owner

does cl/5490 fix this to your satisfaction?

does cl/5490 fix this to your satisfaction?
Owner

I think you meant https://gerrit.lix.systems/c/lix/+/5479 here.

I think you meant https://gerrit.lix.systems/c/lix/+/5479 here.
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lix-project/lix#1184
No description provided.