[Nix#8758] Temporary network paritions cause extremely long substitution stalls, even when network is back quickly #127
Labels
No labels
Area/build-packaging
Area/cli
Area/evaluator
Area/flakes
Area/language
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/store
bug
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
RFD
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
ux
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#127
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Upstream-Issue: NixOS/nix#8758
Describe the bug
A clear and concise description of what the bug is.
I just tried to rebuild my NixOS and had a short network partitions (a few seconds maybe) and after a bit, substitution completely stalled with no network activity on the system. I have observed this issue before but I let it continue this time.
After a few minutes, the first wave of timeouts on 4 NARs (system has 4 cores, so I assume that's all the ones being downloaded at that time) come in. Still no network activity after the timeout. Only after three timeouts (which takes many minutes mind you), did the NARs finally get retried and that got the rebuild on track again.
Why is the timeout for individual NARs multiple minutes long when retries cost almost nothing?
Why isn't the substitution retried immediately after the first timeout? We have exponential back-off, so it's not like that'd overload anyone.
Steps To Reproduce
Failed sending data to the peer (55)
for 3/4 NARs currently being downloadedExpected behavior
A clear and concise description of what you expected to happen.
A NAR should time out after a few seconds at most and retry using the existing exponential back-off mechanism immediately after the timeout.
A network partition a couple seconds in length should not cause multiple minutes of network stall. I'd expect at most 30s of stall in such a situation.
nix-env --version
outputAdditional context
Add any other context about the problem here.
Priorities
Add 👍 to issues you find important.