helpers don't report errno back to caller (gc-socket can fail if unix-bind-connect helper gets used) #1184
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
awaiting
author
awaiting
contributors
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
diagnostics
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
Feature/S3
Importance
High
Importance
Low
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
Release Blocking
Non-urgent
Release Blocking
Urgent
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
Urgency
High
Urgency
Low
ux
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lix-project/lix#1184
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This is related to #1113 and occurs whenever the the
TMPDIRpath is long enough that the$NIX_DATA_DIR/gc-socket/socketused becomes long enough that theunix-bind-connecthelper gets used. This issue should exist in versions ever since that was added up to at least15c95b95d6which is the most recent commit I've tested on as of posting.The primary reproducer is the
gc-non-blockingfunctional test. You can easily see this reproduce by running the test withTMPDIRset to an incredibly long path (note it needs to be an empty directory to satisfy the testing harness for whatever reason). The error given will be that it fails to chdir into the path but this is because the gc-socket in general hasn't been created yet.Normally in this situation the build part of the test will try to add the temp root, fail to grab the global GC lock since the GC is in progress, and then that code should loop on trying to either grab the lock or connect to the GC server's socket to inform it to add the root itself. This relies on being able to catch the normal
SysErrorthat the failure to connect would throw and then checking theerrnoof the result to make sure it's one of the expected failure modes (e.errNo == ECONNREFUSED || e.errNo == ENOENT).But using the connection helper doesn't produce the same failure reporting it simply dies when it fails to connect (in this test's case the initial failure is due to the directory not existing yet when the connect helper tries to connect, this is because the test holds the GC process in a state of having grabbed the lock but not yet started the GC process which includes building the path to the socket along with binding it into existence). This failure to connect is treated as fatal and a retry is not attempted.
This is likely a larger general issue with the helpers and we need to find a way to bubble up these errors likely through some kind of upstream reporting of
errnoinstead of just string errors. Exit codes probably won't work as they technically don't fully have the ability to fit anintlikeerrnois defined as.does
cl/5490fix this to your satisfaction?I think you meant https://gerrit.lix.systems/c/lix/+/5479 here.
So I'm a little confused why I wasn't added as a reviewer on that CL, but that's not quite what I had in mind, since theoretically this issue exists for all helper scripts. What I was planning is to have
DIE_UNLESS_SYSitself forward errors through a pipeI also wanted to see a test for this failure more generically than just it being caught by randomly long
TMPDIRs in testing or a weird daemon tempdirthat's intentional. for most helpers the syscall errors are not actionable and make little sense to incur the infrastructure overhead to forward them.
unix-bind-connectis the only one where it does marginally make sense, and even that only because it's still a terrible mashup of things; ideally we'd move the entire bind/connect thing into the helper and have useful reporting from there instead of transporting errnos :/Alright I'm good with that, do you want me to open an issue for eventually migrating the rest of the logic into the bind/connect helper?
yeah, sure :) that would also make for a good thing to port to rust because it's pretty much disconnected from everything else, and it'd give us some valuable information about just how feasible porting all the libexec helpers actually is