Lix is hitting fetcher-cache-v1.sqlite too hard under mass concurrency #1122

Open
opened 2026-02-06 17:01:08 +00:00 by raito · 3 comments
Owner

Describe the bug

When Lix is fetching 100s of copies of the same Flake input, the SQLite cache (fetcher-cache-v1.sqlite) is put under massive pressure and can throw errors in various places of the fetching code. This lead to fatal failure when it could be retried or paused gracefully.

Steps To Reproduce

  1. Go to the Lix repo.
  2. Empty or block out your ~/.cache/nix/fetcher-cache-v1.sqlite (I cleared mine)
  3. nix-shell -p lixPackageSets.latest.nix-eval-jobs
  4. nix-eval-jobs --gc-roots-dir /tmp/somewhere/gcroots --force-recurse --max-memory-size 4096 --workers 96 --flake "git+file://$(pwd)?rev=56988d860593a5fd8153d02a0ca5469508378626#hydraJobs"

The exact number of workers is not a rocket science, you need enough concurrency but just below the nr that cause the daemon to reject your connections. 96 on my AMD Ryzen 9 7900X 12-Core Processor cause it to occur.

Expected behavior

Retries or self-pacing.

nix --version output

Reported to occur on 2.94.0 by @lheckemann
Reproduced using nix-eval-jobs from 2.94.0, the code that runs the Flake fetching is independent from the daemon (I believe?), so 2.94.0.

Additional context

I believe that the error occur exactly here:

        if (!shallow)
            infoAttrs.insert_or_assign(
                "revCount",
                std::stoull(TRY_AWAIT(runProgram(
                    "git",
                    true,
                    {"-C",
                     repoDir,
                     "--git-dir",
                     gitDir,
                     "rev-list",
                     "--count",
                     input.getRev()->gitRev()}
                )))
            );

        if (!_input.getRev())
            getCache()->add(
                store,
                unlockedAttrs,
                infoAttrs,
                storePath,
                false);

// here |
//           v
        getCache()->add(
            store,
            getLockedAttrs(),
            infoAttrs,
            storePath,
            true);
## Describe the bug When Lix is fetching 100s of copies of the same Flake input, the SQLite cache (`fetcher-cache-v1.sqlite`) is put under massive pressure and can throw errors in various places of the fetching code. This lead to fatal failure when it could be retried or paused gracefully. ## Steps To Reproduce 1. Go to the Lix repo. 2. Empty or block out your ~/.cache/nix/fetcher-cache-v1.sqlite (I cleared mine) 3. `nix-shell -p lixPackageSets.latest.nix-eval-jobs` 4. `nix-eval-jobs --gc-roots-dir /tmp/somewhere/gcroots --force-recurse --max-memory-size 4096 --workers 96 --flake "git+file://$(pwd)?rev=56988d860593a5fd8153d02a0ca5469508378626#hydraJobs"` The exact number of workers is not a rocket science, you need enough concurrency but just below the nr that cause the daemon to reject your connections. 96 on my AMD Ryzen 9 7900X 12-Core Processor cause it to occur. ## Expected behavior Retries or self-pacing. ## `nix --version` output Reported to occur on 2.94.0 by @lheckemann Reproduced using nix-eval-jobs from 2.94.0, the code that runs the Flake fetching is independent from the daemon (I believe?), so 2.94.0. ## Additional context I believe that the error occur exactly here: ```cpp if (!shallow) infoAttrs.insert_or_assign( "revCount", std::stoull(TRY_AWAIT(runProgram( "git", true, {"-C", repoDir, "--git-dir", gitDir, "rev-list", "--count", input.getRev()->gitRev()} ))) ); if (!_input.getRev()) getCache()->add( store, unlockedAttrs, infoAttrs, storePath, false); // here | // v getCache()->add( store, getLockedAttrs(), infoAttrs, storePath, true); ```
Member

This issue was mentioned on Gerrit on the following CLs:

  • commit message in cl/5081 ("libfetchers/cache: retry inserting entries into the fetcher cache")
  • commit message in cl/5438 ("libfetchers: lock inputs before fetching them")
  • comment in cl/5438 ("libfetchers: lock inputs before fetching them")
  • commit message in cl/5468 ("tests/functional2: add test for concurrent git fetching")
<!-- GERRIT_LINKBOT: {"cls": [{"backlink": "https://gerrit.lix.systems/c/lix/+/5081", "number": 5081, "kind": "commit message"}, {"backlink": "https://gerrit.lix.systems/c/lix/+/5438", "number": 5438, "kind": "commit message"}, {"backlink": "https://gerrit.lix.systems/c/lix/+/5438", "number": 5438, "kind": "comment"}, {"backlink": "https://gerrit.lix.systems/c/lix/+/5468", "number": 5468, "kind": "commit message"}], "cl_meta": {"5081": {"change_title": "libfetchers/cache: retry inserting entries into the fetcher cache"}, "5438": {"change_title": "libfetchers: lock inputs before fetching them"}, "5468": {"change_title": "tests/functional2: add test for concurrent git fetching"}}} --> This issue was mentioned on Gerrit on the following CLs: * commit message in [cl/5081](https://gerrit.lix.systems/c/lix/+/5081) ("libfetchers/cache: retry inserting entries into the fetcher cache") * commit message in [cl/5438](https://gerrit.lix.systems/c/lix/+/5438) ("libfetchers: lock inputs before fetching them") * comment in [cl/5438](https://gerrit.lix.systems/c/lix/+/5438) ("libfetchers: lock inputs before fetching them") * commit message in [cl/5468](https://gerrit.lix.systems/c/lix/+/5468) ("tests/functional2: add test for concurrent git fetching")
raito added this to the 2.96 milestone 2026-03-01 11:22:27 +00:00
Member

This seems to happen occasionally under less extreme concurrency too.

This seems to happen occasionally under less extreme concurrency too.
Member

I found a simpler/faster/less noisy reproducer.

With the following flake.nix:

{
  inputs.dummy = {
    url = "git+file:///home/linus/projects/lix?rev=49a385096e08b42277b7105d5d8d1e0e62b6b7a4";
    flake = false;
  };
  outputs = { self, dummy }: {};
}

(adjust the path to your lix checkout accordingly)

Run:

rm ~/.cache/nix/fetcher-cache-v1.sqlite ; seq 200 | xargs -P0 -I{} nix flake lock --store /tmp --output-lock-file /tmp/{}.lock

On my machine, this triggered the bug 36 out of 100 times, taking ~0.5s each time.

EDIT: make sure the flake.nix is in a git repo, as there are separate code paths for fetching the root flake and the dummy input that can both run into this race condition.

I found a simpler/faster/less noisy reproducer. With the following `flake.nix`: ``` { inputs.dummy = { url = "git+file:///home/linus/projects/lix?rev=49a385096e08b42277b7105d5d8d1e0e62b6b7a4"; flake = false; }; outputs = { self, dummy }: {}; } ``` (adjust the path to your lix checkout accordingly) Run: ``` rm ~/.cache/nix/fetcher-cache-v1.sqlite ; seq 200 | xargs -P0 -I{} nix flake lock --store /tmp --output-lock-file /tmp/{}.lock ``` On my machine, this triggered the bug 36 out of 100 times, taking ~0.5s each time. EDIT: make sure the `flake.nix` is in a git repo, as there are separate code paths for fetching the root flake and the dummy input that can both run into this race condition.
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lix-project/lix#1122
No description provided.