Random curl error: A value or data field grew larger than allowed
on substitution can cause substitution failure #662
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#662
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I have seen these errors randomly and thought nothing of them since they always had resolved themselves for me, but it seems like in certain conditions they can consistently break and actually cause substitution errors.
The errors in question look like:
@lilyinstarlight managed to isolate it in a github actions run with daemon 2.92.0: https://github.com/lilyinstarlight/foosteros/actions/runs/13229359681/job/36924342389#step:6:11193
This is believed to not affect 2.91.x; only 2.92.x and newer, and was introduced in
4ae6fb5a8f
.can we have that reproducing case run with
CURL_DEBUG=all
? i have the feeling that this is http2 related and caused by curl transfer buffers filling up due to our gratuitous use of multi handles for stuff that shouldn't live in multi handlesThis issue was mentioned on Gerrit on the following CLs:
i strongly suspect it to be this dynbuf code:
8289ac1be6/lib/cw-out.c (L332-L334)
https://github.com/curl/curl/pull/5966 suspicious
yep. that should only trigger on http2 transfers since http1 doesn't have multiple substream in an ordered bytestream transport (which is a completely fucked up concept to begin with, but here we are)
Reproducer:
Thanks to horrors for telling me what would probably cause it. It happens when you have two transfers on the same FileTransfer to the same host (so the same connection) and one is paused. It's made worse by compressing 128MB of zeros.
Submitted as a curl bug; I am not sure how we can fix this properly in the interim since we do need flow control: https://github.com/curl/curl/issues/16280
One thing that will workaround this successfully if it is actively being a problem is
http2 = false
in the configuration, but this will almost certainly make querying paths slower.I just had the silliest patch idea for this while in the process of nearly falling asleep: why don't we just completely disable transfer compression for downloads? This isn't such a ridiculous idea because what does Lix actually download?
If we don't completely disable transfer compression for downloads we could at least reimplement lix-side download decompression, which would be a lot more thinking and effort for something that curl will plausibly fix within a year. But basically the fix here is that curl needs to not be doing compression for the time being since it has the broken interactions we see here.
we could indeed do that. there are still corner cases where it'll break, but on average it seems preferable to crashing completely :/
@lilyinstarlight try this overlay after your lix overlay which applies the fix from https://github.com/curl/curl/pull/16296:
If this works then we can temporarily ship it in the lix overlay or in nixpkgs or . something. idk how we can ship this exactly because really we would like to wait for a merge in curl. But that's probably not going to take that long.
Okay so this has been "fixed" upstream but our testing shows the fix causes corrupt transfers which ummmmm. concerning. i think we want their compression code turned off for the time being.
oh fire and wind we're not putting back the decompression sources, especially not the libarchive decompression source >,<
yeah no, we can just have a very limited set of supported ones or just disable accepting transfer compression completely
FYI @raito I looked at https://everything.curl.dev/internals/content-encoding.html#the-libcurl-interface
From some examination I found that we can have
Content-Encoding: zstd
narinfos, so we do actually have to replace the curl decompressor with something. horrors says there is something they have lying around that might be helpful.I can try to look into this on Monday.
Fix: https://gerrit.lix.systems/c/lix/+/2780
we still see this on
423a343937
ummmmmmm. how?! does it also overrun the buffer if it's not compressed, just with lower likelihood? misbehaving server?
no clue. curl isn't decompressing anything (otherwise the test cases we added would fail), but it's still failing :(
we've confirmed that curl is indeed misbehaving by shoving a 1GiB fragment of
/dev/urandom
onto a webserver that delivers it ascontent-encoding: zstd
. the file is larger than its contents and curl is explicitly instructed not to decode anything (by settingCURLOPT_ACCEPT_ENCODING = NULL
as per the documentation), and the error still occurs. we haven't checked the patched curl yetwe've tested curl HEAD, and it's even worse. the specific errors you saw originally are gone with lix HEAD, but curl can hang itself if http2 streams have wildly different receive capacities. for example this test will crash with an internal error when pointed at an http2 server that serves 1GiB of junk:
Tried the following setup:
Caddyfile:
Web root:
Build curl with this in a curl checkout, revision 8a45c2851aeb6f3ec18ad5c39c4042ab516891dd:
Lix version:
5a7e9e1746
Lix setup:
Run with:
Also tried with curl 438dd08b546a5690126c7355ae67a967a8475eae (from the same day you posted) and seemingly the same behaviour was observed. I am not sure what I am doing wrong here and why I cannot repro it.
I tried changing the usleep of the other thread to 100us so it is going at 10MB/s instead of 1MB/s. I may however have observed the hang. Let me see what I can get out of that line of questioning; maybe I can actually observe that?
okay yes, I have this problem observable with
CURL_DEBUG=all
: the second connection seems to be stalled altogether:It looks like it could be either a deadlock in Lix itself, or curl eating unpauses.
I think it's curl eating unpauses.
afawr it's not so much that curl is eating unpauses but that it's internally pausing the wrong thing. when we looked it seemed like it was pausing the entire http2 connection once its buffers are full, or something like that?
The observed symptom I have is that it's getting into a state where it's forgotten to send a window update for the second stream and then sits there like a catgirl waiting to be pet, expecting the server to send it more bytes, but will never receive any because it didn't request any. So it's a protocol level deadlock. The second stream is not paused at the api level but they didn't send any window updates when the internal pause buffer got emptied.
oh, excellent. why does lix expose so many curl bugs in a part of the library that should be stable 🫠
FILED: https://github.com/curl/curl/issues/16955
okay a fix is submitted for review. now i need to figure out who to poke to make sure it gets into nixpkgs and doesn't cause a widespread regression.
@lilyinstarlight would you be able to test the fixed Curl on your setup and see if it still has the issue?
https://github.com/NixOS/nixpkgs/pull/396200#issuecomment-2795944006
Requested that nixpkgs backport the patch to this. I have been running it on my machines for the last week and not noticed issues.
I think this bug is gone.