Race condition with gc in remote builds causes paths to vanish #505

Open
opened 2024-09-09 03:16:15 +00:00 by jade · 1 comment
Owner

This is the root cause for #490

The way that local builds don't get gc'd is through addTempRoot, which adds a root that lives for the lifetime of the LocalStore (realistically, the lifetime of the Nix process) or of the active gc process, whichever is longer. It achieves this by putting the path to it in a file with the name of the pid of the Nix process [1], which is then read on gc startup. If there is an active gc, it will open a socket to it and transmit the path to be kept.

What actually calls addTempRoot?

  • LocalStore::addToStore (NAR loading)
  • LocalStore::addToStoreFromDump (NAR loading, also; have not looked into the difference)
  • LocalStore::addTextToStore
  • DerivationGoal::DerivationGoal (for temp dirs, .chroot ostensibly, and probably also the drv file itself)
  • DerivationGoal::loadDerivation (suspicious correctness, but safe)
  • DerivationGoal::haveDerivation (output paths)
  • PathSubstitutionGoal::init (resulting path)

[1]: note: this is probably also broken since it assumes that you only have one LocalStore with a given store dir per nix process, which I think could theoretically explode if you use nix copy with the same origin and destination?

I suspect that this might leak gc roots pretty hard if we modify the way that the daemon works to not fork for each connection, so I am tagging this as "protocol".

It's plausible that temp roots simply don't work over the ssh store type: b247ef72dc/src/libstore/gc-store.hh (L94-L98). However this does not matter for #490 since that issue is on ssh-ng.

@puck inspected the remote build code looking for race conditions and observed:

  • addToStore actually adds a temp GC root

    ..but it does so after trying to acquire the global GC lock. it should do so before

    and really it should try always connecting to the gcroot server

    like, write to the temproots file for that pid, etc first

    jade note: this is in reference to 370ac940dd/src/libstore/gc.cc (L118-L182). however, i am actually not sure this is a bug? either you don't get the lock (so server is def running, though it could vanish), or you did get the lock (in which case the server is definitely not going to be running). maybe puck can expand on this.

My own suspicion as to the cause of this is that it's a result of the following mode of operation of build-remote (370ac940dd/src/build-remote/build-remote.cc (L294-L347)):

  • Copy input paths (does keep these, but only if addToStore was called as above or if it was substituted. It skips copying paths that already exist on the destination (!), which means that those don't get added as temp roots)
  • Read derivation (uhhhhhhh why)
  • Call buildDerivation (if trusted. [2]) or if not trusted, copyClosure [3] then buildPathsWithResults.

This procedure is broken because it does not add roots for paths that exist on the remote and thus aren't copied (and because they are not copied, they are not added as roots). I think the fix here might be to make copyPaths add temp roots for all paths to be copied, since it seems ridiculous to copy paths then have them vanish, right??

[2]: this seems badly named because it fails if you are not trusted user; furthermore that code is fucking suspicious; there's some dead paths in it implying untrusted users could previously call it
[3]: why? we just copied the input paths there already, right? or is it that you need to copy the build closure if untrusted?

This is the root cause for https://git.lix.systems/lix-project/lix/issues/490 The way that local builds don't get gc'd is through `addTempRoot`, which adds a root that lives for the lifetime of the LocalStore (realistically, the lifetime of the Nix process) or of the active gc process, whichever is longer. It achieves this by putting the path to it in a file with the name of the pid of the Nix process [1], which is then read on gc startup. If there is an active gc, it will open a socket to it and transmit the path to be kept. What actually calls `addTempRoot`? - `LocalStore::addToStore` (NAR loading) - `LocalStore::addToStoreFromDump` (NAR loading, also; have not looked into the difference) - `LocalStore::addTextToStore` - `DerivationGoal::DerivationGoal` (for temp dirs, `.chroot` ostensibly, and probably also the drv file itself) - `DerivationGoal::loadDerivation` (suspicious correctness, but safe) - `DerivationGoal::haveDerivation` (output paths) - `PathSubstitutionGoal::init` (resulting path) [1]: note: this is probably also broken since it assumes that you only have one LocalStore with a given store dir per nix process, which I think could theoretically explode if you use `nix copy` with the same origin and destination? I suspect that this might leak gc roots pretty hard if we modify the way that the daemon works to not fork for each connection, so I am tagging this as "protocol". It's plausible that temp roots simply don't work over the `ssh` store type: https://git.lix.systems/lix-project/lix/src/b247ef72dc7bcc857288c0ddcceb3e42f76a78f1/src/libstore/gc-store.hh#L94-L98. However this does not matter for #490 since that issue is on ssh-ng. @puck inspected the remote build code looking for race conditions and observed: - addToStore actually adds a temp GC root ..but it does so _after_ trying to acquire the global GC lock. it should do so _before_ and really it should try always connecting to the gcroot server like, write to the temproots file for that pid, etc first jade note: this is in reference to https://git.lix.systems/lix-project/lix/src/370ac940dd7816ad4052fafa4e0f8d17784fa16b/src/libstore/gc.cc#L118-L182. however, i am actually not sure this is a bug? either you don't get the lock (so server is def running, though it could vanish), or you did get the lock (in which case the server is definitely not going to be running). maybe puck can expand on this. My own suspicion as to the cause of this is that it's a result of the following mode of operation of build-remote (https://git.lix.systems/lix-project/lix/src/370ac940dd7816ad4052fafa4e0f8d17784fa16b/src/build-remote/build-remote.cc#L294-L347): - Copy input paths (*does* keep these, but **only** if `addToStore` was called as above or if it was substituted. It skips copying paths that already exist on the destination (!), which means that those don't get added as temp roots) - Read derivation (uhhhhhhh why) - Call `buildDerivation` (if trusted. [2]) or if not trusted, `copyClosure` [3] then `buildPathsWithResults`. This procedure is **broken** because it does not add roots for paths that exist on the remote and thus aren't copied (and because they are not copied, they are not added as roots). I think the fix here might be to make `copyPaths` add temp roots for all paths to be copied, since it seems ridiculous to copy paths then have them vanish, right?? [2]: this seems badly named because it fails if you are not trusted user; furthermore that code is fucking suspicious; there's some dead paths in it implying untrusted users could previously call it [3]: why? we just copied the input paths there already, right? or is it that you need to copy the build closure if untrusted?
jade added the
bug
Area/store
Area/remote-builds
Area/protocol
labels 2024-09-09 03:16:15 +00:00
Owner

!!!! Excellent work jade!

!!!! Excellent work jade!
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#505
No description provided.