Race condition with gc in remote builds causes paths to vanish #505
Labels
No labels
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/store
bug
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
RFD
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
ux
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#505
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This is the root cause for #490
The way that local builds don't get gc'd is through
addTempRoot
, which adds a root that lives for the lifetime of the LocalStore (realistically, the lifetime of the Nix process) or of the active gc process, whichever is longer. It achieves this by putting the path to it in a file with the name of the pid of the Nix process [1], which is then read on gc startup. If there is an active gc, it will open a socket to it and transmit the path to be kept.What actually calls
addTempRoot
?LocalStore::addToStore
(NAR loading)LocalStore::addToStoreFromDump
(NAR loading, also; have not looked into the difference)LocalStore::addTextToStore
DerivationGoal::DerivationGoal
(for temp dirs,.chroot
ostensibly, and probably also the drv file itself)DerivationGoal::loadDerivation
(suspicious correctness, but safe)DerivationGoal::haveDerivation
(output paths)PathSubstitutionGoal::init
(resulting path)[1]: note: this is probably also broken since it assumes that you only have one LocalStore with a given store dir per nix process, which I think could theoretically explode if you use
nix copy
with the same origin and destination?I suspect that this might leak gc roots pretty hard if we modify the way that the daemon works to not fork for each connection, so I am tagging this as "protocol".
It's plausible that temp roots simply don't work over the
ssh
store type:b247ef72dc/src/libstore/gc-store.hh (L94-L98)
. However this does not matter for #490 since that issue is on ssh-ng.@puck inspected the remote build code looking for race conditions and observed:
addToStore actually adds a temp GC root
..but it does so after trying to acquire the global GC lock. it should do so before
and really it should try always connecting to the gcroot server
like, write to the temproots file for that pid, etc first
jade note: this is in reference to
370ac940dd/src/libstore/gc.cc (L118-L182)
. however, i am actually not sure this is a bug? either you don't get the lock (so server is def running, though it could vanish), or you did get the lock (in which case the server is definitely not going to be running). maybe puck can expand on this.My own suspicion as to the cause of this is that it's a result of the following mode of operation of build-remote (
370ac940dd/src/build-remote/build-remote.cc (L294-L347)
):addToStore
was called as above or if it was substituted. It skips copying paths that already exist on the destination (!), which means that those don't get added as temp roots)buildDerivation
(if trusted. [2]) or if not trusted,copyClosure
[3] thenbuildPathsWithResults
.This procedure is broken because it does not add roots for paths that exist on the remote and thus aren't copied (and because they are not copied, they are not added as roots). I think the fix here might be to make
copyPaths
add temp roots for all paths to be copied, since it seems ridiculous to copy paths then have them vanish, right??[2]: this seems badly named because it fails if you are not trusted user; furthermore that code is fucking suspicious; there's some dead paths in it implying untrusted users could previously call it
[3]: why? we just copied the input paths there already, right? or is it that you need to copy the build closure if untrusted?
!!!! Excellent work jade!