[Nix#8758] Temporary network paritions cause extremely long substitution stalls, even when network is back quickly #127

Open
opened 2024-03-16 06:45:03 +00:00 by lix-bot · 0 comments
Member

Upstream-Issue: NixOS/nix#8758

Describe the bug

A clear and concise description of what the bug is.

I just tried to rebuild my NixOS and had a short network partitions (a few seconds maybe) and after a bit, substitution completely stalled with no network activity on the system. I have observed this issue before but I let it continue this time.

After a few minutes, the first wave of timeouts on 4 NARs (system has 4 cores, so I assume that's all the ones being downloaded at that time) come in. Still no network activity after the timeout. Only after three timeouts (which takes many minutes mind you), did the NARs finally get retried and that got the rebuild on track again.

Why is the timeout for individual NARs multiple minutes long when retries cost almost nothing?
Why isn't the substitution retried immediately after the first timeout? We have exponential back-off, so it's not like that'd overload anyone.

Steps To Reproduce

  1. Initiate a nix-build which substitutes a bunch of paths
  2. Cause a short network partition (i.e. by turning the network connection off and on again a few seconds later)
  3. Observe no more network activity
  4. Observe a timeout after a few minutes
  5. Observe another set of timeouts on the same NARs after a few minutes
  6. Observe yet another set of timeouts of the same NARs after yet another few minutes
  7. Ovserve Failed sending data to the peer (55) for 3/4 NARs currently being downloaded
  8. Observe NARs actually being retried and network activity

Expected behavior

A clear and concise description of what you expected to happen.

A NAR should time out after a few seconds at most and retry using the existing exponential back-off mechanism immediately after the timeout.

A network partition a couple seconds in length should not cause multiple minutes of network stall. I'd expect at most 30s of stall in such a situation.

nix-env --version output

nix-env (Nix) 2.15.1

Additional context

Add any other context about the problem here.

Priorities

Add 👍 to issues you find important.

Upstream-Issue: https://git.lix.systems/NixOS/nix/issues/8758 **Describe the bug** A clear and concise description of what the bug is. I just tried to rebuild my NixOS and had a short network partitions (a few seconds maybe) and after a bit, substitution completely stalled with no network activity on the system. I have observed this issue before but I let it continue this time. After a few minutes, the first wave of timeouts on 4 NARs (system has 4 cores, so I assume that's all the ones being downloaded at that time) come in. Still no network activity after the timeout. Only after three timeouts (which takes many minutes mind you), did the NARs finally get retried and that got the rebuild on track again. Why is the timeout for individual NARs *multiple minutes* long when retries cost almost nothing? Why isn't the substitution retried immediately after the first timeout? We have exponential back-off, so it's not like that'd overload anyone. **Steps To Reproduce** 1. Initiate a nix-build which substitutes a bunch of paths 2. Cause a short network partition (i.e. by turning the network connection off and on again a few seconds later) 3. Observe no more network activity 4. Observe a timeout after a few minutes 5. Observe another set of timeouts on the same NARs after a few minutes 6. Observe yet another set of timeouts of the same NARs after yet another few minutes 7. Ovserve `Failed sending data to the peer (55)` for 3/4 NARs currently being downloaded 8. Observe NARs actually being retried and network activity **Expected behavior** A clear and concise description of what you expected to happen. A NAR should time out after a few seconds at most and retry using the existing exponential back-off mechanism immediately after the timeout. A network partition a couple seconds in length should not cause multiple minutes of network stall. I'd expect at most 30s of stall in such a situation. **`nix-env --version` output** ``` nix-env (Nix) 2.15.1 ``` **Additional context** Add any other context about the problem here. **Priorities** Add :+1: to [issues you find important](https://github.com/NixOS/nix/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc).
lix-bot added the
bug
imported
labels 2024-03-16 06:45:03 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#127
No description provided.