helpers don't report errno back to caller (gc-socket can fail if unix-bind-connect helper gets used) #1184
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
awaiting
author
awaiting
contributors
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
diagnostics
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
Feature/S3
Importance
High
Importance
Low
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
Release Blocking
Non-urgent
Release Blocking
Urgent
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
Urgency
High
Urgency
Low
ux
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lix-project/lix#1184
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This is related to #1113 and occurs whenever the the
TMPDIRpath is long enough that the$NIX_DATA_DIR/gc-socket/socketused becomes long enough that theunix-bind-connecthelper gets used. This issue should exist in versions ever since that was added up to at least15c95b95d6which is the most recent commit I've tested on as of posting.The primary reproducer is the
gc-non-blockingfunctional test. You can easily see this reproduce by running the test withTMPDIRset to an incredibly long path (note it needs to be an empty directory to satisfy the testing harness for whatever reason). The error given will be that it fails to chdir into the path but this is because the gc-socket in general hasn't been created yet.Normally in this situation the build part of the test will try to add the temp root, fail to grab the global GC lock since the GC is in progress, and then that code should loop on trying to either grab the lock or connect to the GC server's socket to inform it to add the root itself. This relies on being able to catch the normal
SysErrorthat the failure to connect would throw and then checking theerrnoof the result to make sure it's one of the expected failure modes (e.errNo == ECONNREFUSED || e.errNo == ENOENT).But using the connection helper doesn't produce the same failure reporting it simply dies when it fails to connect (in this test's case the initial failure is due to the directory not existing yet when the connect helper tries to connect, this is because the test holds the GC process in a state of having grabbed the lock but not yet started the GC process which includes building the path to the socket along with binding it into existence). This failure to connect is treated as fatal and a retry is not attempted.
This is likely a larger general issue with the helpers and we need to find a way to bubble up these errors likely through some kind of upstream reporting of
errnoinstead of just string errors. Exit codes probably won't work as they technically don't fully have the ability to fit anintlikeerrnois defined as.does
cl/5490fix this to your satisfaction?I think you meant https://gerrit.lix.systems/c/lix/+/5479 here.