nix copy --to s3://... very slow, hangs #945
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
Feature/S3
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lix-project/lix#945
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Description
I have a script that contains this:
For credentials, the
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYenvironment variables are set. No AWS configuration files exist. I'm copying to a self hosted Garage instance on the same network with a 2.5Gbps link.I was able to successfully run this with CppNix 2.28.4 which resulted in ~7900 objects getting stored in the bucket at a total of ~3GB, with peak transfer speeds of ~100Mbps according to
btm.Trying this with Lix, however, doesn't seem to work very well. It can successfully upload some stuff, but it seems to get stuck on other stuff. If left running long enough, its own upload progress counter will stop. Transfer rate is low (Kbps range) to nonexistent according to
btm. Additionally, when Lix hangs like that, hitting control+c doesn't successfully shut down and exit Lix, at least not within ~30 seconds; I always have to send a second control+c to kill it.Uploading a single large file to a different bucket via awscli2 yields a throughput of ~850Mbps, so there's a lot of room for improvement it seems.
Expected behavior
Lix should be able to upload to an S3 binary cache. It would be cool if it were faster than CppNix too.
nix --versionoutputAdditional context
My version of Lix is built from commit
2d0109898a.#272 related
Could you please try running with
?compression=noneorcompression=zstd?I've seen
nix copy --tocompletely grind to a halt before especially on burstable VMs due to the compression just being too heavyCould you also please try running this without
multipart-upload=true? There might be a bug lurking there too.So, I have some news on this topic: xz is the bottleneck obviously.
But, there's a problem if you run with compression=none or compression=zstd. This problem is that the progress bar only reports compression status I think? It completes very quickly and then you are left with:
for a while.
This issue was mentioned on Gerrit on the following CLs:
Yeah that speeds things up significantly. (200Mbps peaks with xz to 1.6Gbps peaks with compression off.)
I'm copying stuff from my desktop to a garage instance on my mini PC server over a 2.5Gbps LAN link, no VMs.
Yeah I mean I think this problem is unique to multipart uploads AFAICT. I can't really just turn it off because my garage/nginx setup doesn't like non-multipart uploads over a certain size. If I do turn it off, I see many paths upload successfully until it chokes and dies on an object that's too large.
Two new pieces of information from some testing I just did:
compression=none&multipart-upload=truealso starts hanging at some point, i.e. it sits there doing nothing and I can see it's likely that no more transfer is happening according tobtm.compression=none&multipart-upload=trueeventually segfaults on its own after a single control+c, no need to send a second one. Good and bad news lol.https://gerrit.lix.systems/c/lix/+/4504 this fixes the crash. We need now to examine/reproduce the hanging now.
My new problem:
after bumping open files limits
More work has been done. This error seems to be caused by the SDK itself as Lix does not control the set of headers injected.
transfer-encodingis on the skip list of headers but still succeed getting put in theSignedHeaderslist.I will start debugging the SDK now.
Root cause for the hang: multipart uploads can sometimes not have their completion status be updated by a progress callback, but only by a transfer status update callback.
I tested https://gerrit.lix.systems/c/lix/+/4514 by cherry-picking it on top of
91867941fa(latest main at the time of writing) and my previously failing test case now works.