UAF in nix copy #618
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
Feature/S3
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lix-project/lix#618
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Describe the bug
nix copycrashes randomly on large copies (a couple of files copy just fine)Running with ASAN yields this trace, on
3413ab5629(today's main) (originally crashed on nixos 24.11's lix, 2.91.1)Steps To Reproduce
Expected behavior
Shouldn't crash.
nix --versionoutputreproduced on
3413ab5629(today's main) and nixos 24.11's lix, 2.91.1Additional context
None at this time.
So the problem seems to be with the addMultipleToStore thread pool in lix/libstore/store-api.cc ..
Since it's a data race I tried building with TSan as well, attached the log with many many warnings for anyone who'd like to try to decipher this; I found it slightly easier to read but perhaps it's just that I had looked at the code in between.
The race would be between the Graph destructor (from leaving processGraph?) and
graph->rrefs[ref].insert(node);in one of the threads.The later is done under graph_.lock(), but the destructors obviously don't take it, but it shouldn't matter anymore because my understanding from the thread pool after a quick read is that it should no longer be running after pool.process()...
OTOH TSan is clear that the main thread is running the destructor at this point, so what we need to look for is why a thread would still be running after
pool.process()- that would obviously cause problems.Any idea?
(FWIW this also seems to affect nix, so that is not a new lix bug)
It should be impossible that a thread is running after pool.process() since I think the pool will have all its threads joined then?
nope. if it never reaches process() because an enqueued item throws quickly enough for another enqueue() to throw that the pool is shuttin down we skip process() altogether, and destroy the state while the pool, which is an outside resource, lives on
This issue was mentioned on Gerrit on the following CLs:
Indeed, there appear to be an error on one of the enqueues.
Running with this patch prints the catch message and no longer aborts:
I've pushed (mostly) this to https://gerrit.lix.systems/c/lix/+/2355
One question though, I'm not sure if an error can happen between the last enqueue and the final pool.process; it might be better to havea construct like "try finally" but I don't know if C++ has one
I saw the Finally type defined in libutil but that doesn't appear to be appropriate, trying it yields this:
OTOH, given that processGraph is only called twice and both places just create a thread pool only used in processGraph in a local variable, we could just move the pool declaration inside processGraph and get that destructor to run first, as pool's destructor will run shutdown() and wait for threads to stop.
fixed by
cl/2355, thanks @martinetd