nix copy --to s3://... very slow, hangs #945

Closed
opened 2025-07-31 04:17:51 +00:00 by cobaltcause · 11 comments
Member

Description

I have a script that contains this:

if command -v nom &> /dev/null; then
    nom build "$@"
else
    nix build "$@"
fi

# Find all output paths of the installables and their build dependencies.
readarray -t derivations < <(nix path-info --derivation "$@")
readarray -t upload_paths < <(
    xargs \
        nix-store --query --requisites --include-outputs \
        <<< "${derivations[*]}"
)

echo "Found ${#upload_paths[@]} paths to upload"

if [ -z ${NIX_SIGNING_PRIVATE_KEY+x} ]; then
    echo "Signing key not found, skipping uploading to the binary cache"
    return
fi

# Sign the paths. We have to put the secret in an actual file because Lix
# doesn't like process substitution.
key_file="$(mktemp)"
echo "$NIX_SIGNING_PRIVATE_KEY" > "$key_file"
nix store sign --key-file "$key_file" --stdin --recursive \
    <<< "${upload_paths[*]}"
rm "$key_file"

dst="s3://grapevine-nix"
dst+="?scheme=https"
dst+="&endpoint=garage.computer.surgery"
dst+="&region=garage"
dst+="&multipart-upload=true"
dst+="&buffer-size=67108864"

# Upload the paths. Environment variable disables some AWS-specific stuff
# that causes a 10 second delay before uploading starts.
env AWS_EC2_METADATA_DISABLED=true \
    nix copy --stdin --to "$dst" <<< "${upload_paths[*]}"

For credentials, the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables are set. No AWS configuration files exist. I'm copying to a self hosted Garage instance on the same network with a 2.5Gbps link.

I was able to successfully run this with CppNix 2.28.4 which resulted in ~7900 objects getting stored in the bucket at a total of ~3GB, with peak transfer speeds of ~100Mbps according to btm.

Trying this with Lix, however, doesn't seem to work very well. It can successfully upload some stuff, but it seems to get stuck on other stuff. If left running long enough, its own upload progress counter will stop. Transfer rate is low (Kbps range) to nonexistent according to btm. Additionally, when Lix hangs like that, hitting control+c doesn't successfully shut down and exit Lix, at least not within ~30 seconds; I always have to send a second control+c to kill it.

Uploading a single large file to a different bucket via awscli2 yields a throughput of ~850Mbps, so there's a lot of room for improvement it seems.

Expected behavior

Lix should be able to upload to an S3 binary cache. It would be cool if it were faster than CppNix too.

nix --version output

$ nix --version
nix (Lix, like Nix) 2.94.0-dev
System type: x86_64-linux
Additional system types: i686-linux, x86_64-v1-linux, x86_64-v2-linux, x86_64-v3-linux, x86_64-v4-linux
Features: gc, signed-caches
System configuration file: /etc/nix/nix.conf
User configuration files: /home/charles/.config/nix/nix.conf:/etc/xdg/nix/nix.conf:/home/charles/.local/share/flatpak/exports/etc/xdg/nix/nix.conf:/var/lib/flatpak/exports/etc/xdg/nix/nix.conf:/home/charles/.nix-profile/etc/xdg/nix/nix.conf:/nix/profile/etc/xdg/nix/nix.conf:/home/charles/.local/state/nix/profile/etc/xdg/nix/nix.conf:/etc/profiles/per-user/charles/etc/xdg/nix/nix.conf:/nix/var/nix/profiles/default/etc/xdg/nix/nix.conf:/run/current-system/sw/etc/xdg/nix/nix.conf
Store directory: /nix/store
State directory: /nix/var/nix
Data directory: /nix/store/kcn0iyrzzb8hri3bx3nwgkidr8ym4mri-lix-2.94.0-dev/share

Additional context

My version of Lix is built from commit 2d0109898a.

## Description I have a script that contains this: ```bash if command -v nom &> /dev/null; then nom build "$@" else nix build "$@" fi # Find all output paths of the installables and their build dependencies. readarray -t derivations < <(nix path-info --derivation "$@") readarray -t upload_paths < <( xargs \ nix-store --query --requisites --include-outputs \ <<< "${derivations[*]}" ) echo "Found ${#upload_paths[@]} paths to upload" if [ -z ${NIX_SIGNING_PRIVATE_KEY+x} ]; then echo "Signing key not found, skipping uploading to the binary cache" return fi # Sign the paths. We have to put the secret in an actual file because Lix # doesn't like process substitution. key_file="$(mktemp)" echo "$NIX_SIGNING_PRIVATE_KEY" > "$key_file" nix store sign --key-file "$key_file" --stdin --recursive \ <<< "${upload_paths[*]}" rm "$key_file" dst="s3://grapevine-nix" dst+="?scheme=https" dst+="&endpoint=garage.computer.surgery" dst+="&region=garage" dst+="&multipart-upload=true" dst+="&buffer-size=67108864" # Upload the paths. Environment variable disables some AWS-specific stuff # that causes a 10 second delay before uploading starts. env AWS_EC2_METADATA_DISABLED=true \ nix copy --stdin --to "$dst" <<< "${upload_paths[*]}" ``` For credentials, the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables are set. No AWS configuration files exist. I'm copying to a self hosted Garage instance on the same network with a 2.5Gbps link. I was able to successfully run this with CppNix 2.28.4 which resulted in ~7900 objects getting stored in the bucket at a total of ~3GB, with peak transfer speeds of ~100Mbps according to `btm`. Trying this with Lix, however, doesn't seem to work very well. It can successfully upload some stuff, but it seems to get stuck on other stuff. If left running long enough, its own upload progress counter will stop. Transfer rate is low (Kbps range) to nonexistent according to `btm`. Additionally, when Lix hangs like that, hitting control+c doesn't successfully shut down and exit Lix, at least not within ~30 seconds; I always have to send a second control+c to kill it. Uploading a single large file to a different bucket via awscli2 yields a throughput of ~850Mbps, so there's a lot of room for improvement it seems. ## Expected behavior Lix should be able to upload to an S3 binary cache. It would be cool if it were faster than CppNix too. ## `nix --version` output ``` $ nix --version nix (Lix, like Nix) 2.94.0-dev System type: x86_64-linux Additional system types: i686-linux, x86_64-v1-linux, x86_64-v2-linux, x86_64-v3-linux, x86_64-v4-linux Features: gc, signed-caches System configuration file: /etc/nix/nix.conf User configuration files: /home/charles/.config/nix/nix.conf:/etc/xdg/nix/nix.conf:/home/charles/.local/share/flatpak/exports/etc/xdg/nix/nix.conf:/var/lib/flatpak/exports/etc/xdg/nix/nix.conf:/home/charles/.nix-profile/etc/xdg/nix/nix.conf:/nix/profile/etc/xdg/nix/nix.conf:/home/charles/.local/state/nix/profile/etc/xdg/nix/nix.conf:/etc/profiles/per-user/charles/etc/xdg/nix/nix.conf:/nix/var/nix/profiles/default/etc/xdg/nix/nix.conf:/run/current-system/sw/etc/xdg/nix/nix.conf Store directory: /nix/store State directory: /nix/var/nix Data directory: /nix/store/kcn0iyrzzb8hri3bx3nwgkidr8ym4mri-lix-2.94.0-dev/share ``` ## Additional context My version of Lix is built from commit 2d0109898a65884e8953813c0391ad8b3be0d929.
Owner

#272 related

https://git.lix.systems/lix-project/lix/issues/272 related
Member

Could you please try running with ?compression=none or compression=zstd ?

I've seen nix copy --to completely grind to a halt before especially on burstable VMs due to the compression just being too heavy

Could you also please try running this without multipart-upload=true ? There might be a bug lurking there too.

Could you please try running with `?compression=none` or `compression=zstd` ? I've seen `nix copy --to` completely grind to a halt before especially on burstable VMs due to the compression just being too heavy Could you also please try running this without `multipart-upload=true` ? There might be a bug lurking there too.
Owner

So, I have some news on this topic: xz is the bottleneck obviously.

But, there's a problem if you run with compression=none or compression=zstd. This problem is that the progress bar only reports compression status I think? It completes very quickly and then you are left with:

[1/0/1 copied (1024.0/1024.0 MiB)] copying path '/nix/store/m3rqzzc7kzsbqf4jdfrr4psyk4j47c51-testfile' to 's3://lix-perf-tests'

for a while.

So, I have some news on this topic: xz is the bottleneck obviously. But, there's a problem if you run with compression=none or compression=zstd. This problem is that the progress bar only reports compression status I think? It completes very quickly and then you are left with: ``` [1/0/1 copied (1024.0/1024.0 MiB)] copying path '/nix/store/m3rqzzc7kzsbqf4jdfrr4psyk4j47c51-testfile' to 's3://lix-perf-tests' ``` for a while.
Member

This issue was mentioned on Gerrit on the following CLs:

  • commit message in cl/4503 ("libstore/binary-cache: default to zstd for compression")
  • commit message in cl/4514 ("libstore/s3: resolve completion status via the transfer status callback")
<!-- GERRIT_LINKBOT: {"cls": [{"backlink": "https://gerrit.lix.systems/c/lix/+/4503", "number": 4503, "kind": "commit message"}, {"backlink": "https://gerrit.lix.systems/c/lix/+/4514", "number": 4514, "kind": "commit message"}], "cl_meta": {"4503": {"change_title": "libstore/binary-cache: default to zstd for compression"}, "4514": {"change_title": "libstore/s3: resolve completion status via the transfer status callback"}}} --> This issue was mentioned on Gerrit on the following CLs: * commit message in [cl/4503](https://gerrit.lix.systems/c/lix/+/4503) ("libstore/binary-cache: default to zstd for compression") * commit message in [cl/4514](https://gerrit.lix.systems/c/lix/+/4514) ("libstore/s3: resolve completion status via the transfer status callback")
Author
Member

Could you please try running with ?compression=none or compression=zstd ?

Yeah that speeds things up significantly. (200Mbps peaks with xz to 1.6Gbps peaks with compression off.)

I've seen nix copy --to completely grind to a halt before especially on burstable VMs due to the compression just being too heavy

I'm copying stuff from my desktop to a garage instance on my mini PC server over a 2.5Gbps LAN link, no VMs.

Could you also please try running this without multipart-upload=true ? There might be a bug lurking there too.

Yeah I mean I think this problem is unique to multipart uploads AFAICT. I can't really just turn it off because my garage/nginx setup doesn't like non-multipart uploads over a certain size. If I do turn it off, I see many paths upload successfully until it chokes and dies on an object that's too large.


Two new pieces of information from some testing I just did:

  1. compression=none&multipart-upload=true also starts hanging at some point, i.e. it sits there doing nothing and I can see it's likely that no more transfer is happening according to btm.
  2. compression=none&multipart-upload=true eventually segfaults on its own after a single control+c, no need to send a second one. Good and bad news lol.
> Could you please try running with `?compression=none` or `compression=zstd` ? Yeah that speeds things up significantly. (200Mbps peaks with xz to 1.6Gbps peaks with compression off.) > I've seen `nix copy --to` completely grind to a halt before especially on burstable VMs due to the compression just being too heavy I'm copying stuff from my desktop to a garage instance on my mini PC server over a 2.5Gbps LAN link, no VMs. > Could you also please try running this without `multipart-upload=true` ? There might be a bug lurking there too. Yeah I mean I think this problem is unique to multipart uploads AFAICT. I can't really just turn it off because my garage/nginx setup doesn't like non-multipart uploads over a certain size. If I do turn it off, I see many paths upload successfully until it chokes and dies on an object that's too large. --- Two new pieces of information from some testing I just did: 1. `compression=none&multipart-upload=true` also starts hanging at some point, i.e. it sits there doing nothing and I can see it's likely that no more transfer is happening according to `btm`. 2. `compression=none&multipart-upload=true` eventually segfaults on its own after a single control+c, no need to send a second one. Good and bad news lol.
Owner

https://gerrit.lix.systems/c/lix/+/4504 this fixes the crash. We need now to examine/reproduce the hanging now.

https://gerrit.lix.systems/c/lix/+/4504 this fixes the crash. We need now to examine/reproduce the hanging now.
Owner

My new problem:

❯ ./outputs/out/bin/nix copy --to "s3://lix-perf-tests?region=garage&endpoint=s3.dc1.infra.lahfa.xyz&multipart-upload=true" /run/current-system
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 100 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
error (ignored): error: opening directory '/tmp/nix-shell.qkPnb0/nix-shell.5yJEZI/nix.w18cAo': Too many open files
error: creating temporary file '/tmp/nix-shell.qkPnb0/nix-shell.5yJEZI/nix.OQBr5u': Too many open files
My new problem: ``` ❯ ./outputs/out/bin/nix copy --to "s3://lix-perf-tests?region=garage&endpoint=s3.dc1.infra.lahfa.xyz&multipart-upload=true" /run/current-system AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 100 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms error (ignored): error: opening directory '/tmp/nix-shell.qkPnb0/nix-shell.5yJEZI/nix.w18cAo': Too many open files error: creating temporary file '/tmp/nix-shell.qkPnb0/nix-shell.5yJEZI/nix.OQBr5u': Too many open files ```
Owner

after bumping open files limits

❯ ./outputs/out/bin/nix copy --to "s3://lix-perf-tests?region=garage&endpoint=s3.dc1.infra.lahfa.xyz&multipart-upload=true" /run/current-system
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 100 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 100 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 100 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 100 ms
AWS error '' (), will retry in 200 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 100 ms
AWS error '' (), will retry in 100 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 100 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 100 ms
AWS error '' (), will retry in 0 ms
AWS error '' (), will retry in 50 ms
AWS error '' (), will retry in 100 ms
AWS error '' (), will retry in 0 ms
error: AWS error: failed to upload 's3://lix-perf-tests/nar/0d1lypfsjgia2y179bynmmrl67nlwyca5jn53ppb3yxbr268zaby.nar.zst': Unable to parse ExceptionName: InvalidRequest Message: Bad request: signed header `transfer-encoding` is not present
after bumping open files limits ``` ❯ ./outputs/out/bin/nix copy --to "s3://lix-perf-tests?region=garage&endpoint=s3.dc1.infra.lahfa.xyz&multipart-upload=true" /run/current-system AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 100 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 100 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 100 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 100 ms AWS error '' (), will retry in 200 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 100 ms AWS error '' (), will retry in 100 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 100 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 100 ms AWS error '' (), will retry in 0 ms AWS error '' (), will retry in 50 ms AWS error '' (), will retry in 100 ms AWS error '' (), will retry in 0 ms error: AWS error: failed to upload 's3://lix-perf-tests/nar/0d1lypfsjgia2y179bynmmrl67nlwyca5jn53ppb3yxbr268zaby.nar.zst': Unable to parse ExceptionName: InvalidRequest Message: Bad request: signed header `transfer-encoding` is not present ```
raito self-assigned this 2025-10-31 13:21:45 +00:00
Owner

More work has been done. This error seems to be caused by the SDK itself as Lix does not control the set of headers injected. transfer-encoding is on the skip list of headers but still succeed getting put in the SignedHeaders list.

I will start debugging the SDK now.

More work has been done. This error seems to be caused by the SDK itself as Lix does not control the set of headers injected. `transfer-encoding` is on the skip list of headers but still succeed getting put in the `SignedHeaders` list. I will start debugging the SDK now.
Owner

Root cause for the hang: multipart uploads can sometimes not have their completion status be updated by a progress callback, but only by a transfer status update callback.

Root cause for the hang: multipart uploads can sometimes not have their completion status be updated by a progress callback, but only by a transfer status update callback.
Author
Member

I tested https://gerrit.lix.systems/c/lix/+/4514 by cherry-picking it on top of 91867941fa (latest main at the time of writing) and my previously failing test case now works.

I tested https://gerrit.lix.systems/c/lix/+/4514 by cherry-picking it on top of 91867941fa73afea7869b7c71ede82e5ef8927da (latest main at the time of writing) and my previously failing test case now works.
Sign in to join this conversation.
No milestone
No project
No assignees
5 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lix-project/lix#945
No description provided.