lix in lix's container images fails to download things #920

New issue

cobaltcause · 2025-07-16T23:12:06Z

cobaltcause commented

2025-07-16 23:12:06 +00:00

I'm using lix's container images (git.lix.systems/lix-project/lix:...) for GitLab CI, and I get errors like this with newer versions:

fetching github input 'github:NixOS/nixpkgs/9807714d6944a957c2e036f84b0ff8caf9930bc0'
warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5002 milliseconds (curl error code=28); retrying in 338 ms (attempt 1/5)
warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 558 ms (attempt 2/5)
warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 1183 ms (attempt 3/5)
warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 2493 ms (attempt 4/5)
error:
       … while fetching the input 'github:NixOS/nixpkgs/9807714d6944a957c2e036f84b0ff8caf9930bc0'
       error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5002 milliseconds (curl error code=28)

Older versions work, though. Here's an example: https://gitlab.computer.surgery/charles/derail/-/merge_requests/10#note_3645

Here are the versions I tried:

Version	Affected
2.93.2	Yes
2.93.1	Yes
2.93.0	Yes
2.92.3	No, works fine
2.92.2	No, works fine

It's entirely possible I'm doing something wrong, I guess, but it seems odd that changing the lix container version is what causes this error.

I'm using lix's container images (`git.lix.systems/lix-project/lix:...`) for GitLab CI, and I get errors like this with newer versions: ``` fetching github input 'github:NixOS/nixpkgs/9807714d6944a957c2e036f84b0ff8caf9930bc0' warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5002 milliseconds (curl error code=28); retrying in 338 ms (attempt 1/5) warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 558 ms (attempt 2/5) warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 1183 ms (attempt 3/5) warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 2493 ms (attempt 4/5) error: … while fetching the input 'github:NixOS/nixpkgs/9807714d6944a957c2e036f84b0ff8caf9930bc0' error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5002 milliseconds (curl error code=28) ``` Older versions work, though. Here's an example: https://gitlab.computer.surgery/charles/derail/-/merge_requests/10#note_3645 Here are the versions I tried: | Version | Affected | |-|-| | 2.93.2 | Yes | | 2.93.1 | Yes | | 2.93.0 | Yes | | 2.92.3 | No, works fine | | 2.92.2 | No, works fine | It's entirely possible I'm doing something wrong, I guess, but it seems odd that changing the lix container version is what causes this error.

cobaltcause added the

bug

label

2025-07-16 23:12:06 +00:00

cobaltcause commented

2025-07-17 00:20:00 +00:00

@raito helped me troubleshoot this out of band. What ended up revealing the problem to us was running while true; do foo="$(pgrep '^nix$')"; if [[ -n "$foo" ]]; then strace -yy -fp "$foo" -o log.strace; break; fi; done and then running docker run -e NIX_REMOTE=local -it git.lix.systems/lix-project/lix:2.93.2 nix shell nixpkgs#hello --extra-experimental-features 'nix-command flakes' immediately after. Raito noticed the following bits of the strace:

288944 read(14</etc/resolv.conf>, "# Generated by Docker Engine.\n# "..., 4096) = 321
###### [etc]
288944 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 14<UDP:[546453]>
288944 setsockopt(14<UDP:[546453]>, SOL_IP, IP_RECVERR, [1], 4) = 0
288944 connect(14<UDP:[546453]>, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.128.0.1")}, 16) = 0
288944 poll([{fd=14<UDP:[546453]>, events=POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])
288944 sendmmsg(14<UDP:[546453]>, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\250\273\1\0\0\1\0\0\0\0\0\0\3api\6github\3com\0\0\1\0\1", iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=32}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\343\265\1\0\0\1\0\0\0\0\0\0\3api\6github\3com\0\0\34\0\1", iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=32}], 2, MSG_NOSIGNAL) = 2

which led to curiosity as to what /etc/resolv.conf contained:

# docker run -it git.lix.systems/lix-project/lix:2.93.2 cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 10.128.0.1
nameserver 1.1.1.1
nameserver 8.8.8.8
nameserver 1.0.0.1
nameserver 8.8.4.4
search .

# Based on host file: '/run/systemd/resolve/resolv.conf' (legacy)
# Overrides: []

On my desktop, the nix shell command inside docker works fine, so here's that for comparison:

$ docker run -it git.lix.systems/lix-project/lix:2.93.2 cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 1.1.1.1
nameserver 8.8.8.8
nameserver 1.0.0.1
nameserver 8.8.4.4
search .

# Based on host file: '/run/systemd/resolve/resolv.conf' (legacy)
# Overrides: []

So, so far, it seems like something is adding that extra nameserver at the top, which doesn't respond for github.com and then the other nameservers are skipped and it just gives up.

@raito helped me troubleshoot this out of band. What ended up revealing the problem to us was running `while true; do foo="$(pgrep '^nix$')"; if [[ -n "$foo" ]]; then strace -yy -fp "$foo" -o log.strace; break; fi; done` and then running `docker run -e NIX_REMOTE=local -it git.lix.systems/lix-project/lix:2.93.2 nix shell nixpkgs#hello --extra-experimental-features 'nix-command flakes'` immediately after. Raito noticed the following bits of the strace: ``` 288944 read(14</etc/resolv.conf>, "# Generated by Docker Engine.\n# "..., 4096) = 321 ###### [etc] 288944 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 14<UDP:[546453]> 288944 setsockopt(14<UDP:[546453]>, SOL_IP, IP_RECVERR, [1], 4) = 0 288944 connect(14<UDP:[546453]>, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.128.0.1")}, 16) = 0 288944 poll([{fd=14<UDP:[546453]>, events=POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}]) 288944 sendmmsg(14<UDP:[546453]>, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\250\273\1\0\0\1\0\0\0\0\0\0\3api\6github\3com\0\0\1\0\1", iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=32}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\343\265\1\0\0\1\0\0\0\0\0\0\3api\6github\3com\0\0\34\0\1", iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=32}], 2, MSG_NOSIGNAL) = 2 ``` which led to curiosity as to what `/etc/resolv.conf` contained: ``` # docker run -it git.lix.systems/lix-project/lix:2.93.2 cat /etc/resolv.conf # Generated by Docker Engine. # This file can be edited; Docker Engine will not make further changes once it # has been modified. nameserver 10.128.0.1 nameserver 1.1.1.1 nameserver 8.8.8.8 nameserver 1.0.0.1 nameserver 8.8.4.4 search . # Based on host file: '/run/systemd/resolve/resolv.conf' (legacy) # Overrides: [] ``` On my desktop, the `nix shell` command inside docker works fine, so here's that for comparison: ``` $ docker run -it git.lix.systems/lix-project/lix:2.93.2 cat /etc/resolv.conf # Generated by Docker Engine. # This file can be edited; Docker Engine will not make further changes once it # has been modified. nameserver 1.1.1.1 nameserver 8.8.8.8 nameserver 1.0.0.1 nameserver 8.8.4.4 search . # Based on host file: '/run/systemd/resolve/resolv.conf' (legacy) # Overrides: [] ``` So, so far, it seems like *something* is adding that extra nameserver at the top, which doesn't respond for github.com and then the other nameservers are skipped and it just gives up.

cobaltcause commented

2025-07-17 00:22:19 +00:00

Ah here's one other data point, the output of this command is from the affected system where resolution breaks:

# docker run -it git.lix.systems/lix-project/lix:2.92.3 cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 10.128.0.1
nameserver 1.1.1.1
nameserver 8.8.8.8
nameserver 1.0.0.1
nameserver 8.8.4.4
search .

# Based on host file: '/run/systemd/resolve/resolv.conf' (legacy)
# Overrides: []

Note that this version of the container succeeds in providing a shell, rather than timing out during name resolution.

Ah here's one other data point, the output of this command is from the affected system where resolution breaks: ``` # docker run -it git.lix.systems/lix-project/lix:2.92.3 cat /etc/resolv.conf # Generated by Docker Engine. # This file can be edited; Docker Engine will not make further changes once it # has been modified. nameserver 10.128.0.1 nameserver 1.1.1.1 nameserver 8.8.8.8 nameserver 1.0.0.1 nameserver 8.8.4.4 search . # Based on host file: '/run/systemd/resolve/resolv.conf' (legacy) # Overrides: [] ``` Note that this version of the container succeeds in providing a shell, rather than timing out during name resolution.

cobaltcause commented

2025-07-17 00:29:16 +00:00

Possibly relevant: git.lix.systems/lix-project/lix:2.92.3 has glibc 2.40-36, git.lix.systems/lix-project/lix:2.93.2 has glibc 2.40-66. Neither contain an /etc/nsswitch.conf.

Possibly relevant: `git.lix.systems/lix-project/lix:2.92.3` has glibc 2.40-36, `git.lix.systems/lix-project/lix:2.93.2` has glibc 2.40-66. Neither contain an `/etc/nsswitch.conf`.

jade commented

2025-07-17 01:11:38 +00:00

I'm guessing 10.128.0.1 is a local recursive nameserver? Or is it in the docker network? Probably it's in the host resolv.conf right?

If it is, I wonder why NSS is correctly skipping it on the host and not on the guest. Certainly there's likely a misconfiguration involved here.

Idea: what if you LD_PRELOAD the old libc? I also wonder if there's a way to extract debug data out of NSS.

I'm guessing 10.128.0.1 is a local recursive nameserver? Or is it in the docker network? Probably it's in the host resolv.conf right? If it is, I wonder why NSS is correctly skipping it on the host and not on the guest. Certainly there's likely a misconfiguration involved here. Idea: what if you LD_PRELOAD the old libc? I also wonder if there's a way to extract debug data out of NSS.

pennae commented

2025-07-17 12:43:55 +00:00

this also happens in our vm tests for certain curl operations in the daemon, it just went unnoticed. forcibly disabling nscd does not change anything. joining the namespace of the daemon and running host cache.nixos.org or trying to download something also just works. turning up the curl tracing options to 11 yields absolutely no useful information. enabling networkd config and thus replacing the resolver does not help. running the test from the interactive driver does seem to help, at least we don't see any curl errors there.

non-interactive runs with resolved have curl errors and print resolved warnings to the log:

machine # [    6.901414] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3.
machine # [    6.903126] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3.
machine # [    6.904342] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3.
machine # [    9.157136] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3.
machine # [    9.159601] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3.
machine # [    9.162637] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3.
machine # [    9.164857] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3.

the interactive run is definitely using resolved as well, but it's not getting any of these errors. even in the interactive runner we can reproduce curl errors with just

machine.succeed("""nix-build --expr 'derivation { name = "a"; system = __currentSystem; builder = "builtin:fetchurl"; url = "https://cache.nixos.org"; outputHashMode = "flat"; }' >&2""")

setting explicit dns servers in FileTransfer does not help, in fact all failing name resolution is sent to 127.0.0.1 for some reason even though that's configured absolutely nowhere as a dns server?

we never see these errors outside of the vm tests either.

this also happens in our vm tests for certain curl operations in the daemon, it just went unnoticed. forcibly disabling nscd does not change anything. joining the namespace of the daemon and running `host cache.nixos.org` or trying to download something also just works. turning up the curl tracing options to 11 yields absolutely no useful information. enabling networkd config and thus replacing the resolver does not help. running the test from the interactive driver *does* seem to help, at least we don't see any curl errors there. non-interactive runs with resolved have curl errors and print resolved warnings to the log: ``` machine # [ 6.901414] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3. machine # [ 6.903126] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3. machine # [ 6.904342] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3. machine # [ 9.157136] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3. machine # [ 9.159601] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3. machine # [ 9.162637] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3. machine # [ 9.164857] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3. ``` the interactive run is definitely using resolved as well, but it's not getting any of these errors. even in the interactive runner we can reproduce curl errors with just ``` machine.succeed("""nix-build --expr 'derivation { name = "a"; system = __currentSystem; builder = "builtin:fetchurl"; url = "https://cache.nixos.org"; outputHashMode = "flat"; }' >&2""") ``` setting explicit dns servers in FileTransfer does not help, in fact all failing name resolution is sent to `127.0.0.1` for some reason even though that's configured *absolutely nowhere* as a dns server? we never see these errors outside of the vm tests either.

cobaltcause commented

2025-07-17 14:40:34 +00:00

I'm guessing 10.128.0.1 is a local recursive nameserver? Or is it in the docker network? Probably it's in the host resolv.conf right?

The host is using systemd-resolved so technically it's not in its /etc/resolv.conf, but resolvectl status shows that it is configured as a DNS server for one of my wireguard interfaces on this machine. If I resolvectl query github.com on the host I get:

github.com: 140.82.116.3                       -- link: enp2s0

-- Information acquired via protocol DNS in 16.4ms.
-- Data is authenticated: no; Data was acquired via local or encrypted transport: no
-- Data from: network

which looks right (i.e. is not using the wireguard interface's DNS).

this also happens in our vm tests for certain curl operations in the daemon

FWIW, it also seems to happen to CppNix users without any kind of virtualization:

> I'm guessing 10.128.0.1 is a local recursive nameserver? Or is it in the docker network? Probably it's in the host resolv.conf right? The host is using systemd-resolved so technically it's not in its `/etc/resolv.conf`, but `resolvectl status` shows that it is configured as a DNS server for one of my wireguard interfaces on this machine. If I `resolvectl query github.com` on the host I get: ``` github.com: 140.82.116.3 -- link: enp2s0 -- Information acquired via protocol DNS in 16.4ms. -- Data is authenticated: no; Data was acquired via local or encrypted transport: no -- Data from: network ``` which looks right (i.e. is not using the wireguard interface's DNS). > this also happens in our vm tests for certain curl operations in the daemon FWIW, it also seems to happen to CppNix users without any kind of virtualization: * https://github.com/NixOS/nix/issues/13341 * https://github.com/NixOS/nix/issues/13466

pennae commented

2025-07-17 16:05:36 +00:00

our reproducer from above is invalid, calling fetchurl like this runs it as an IA derivation without network access. no wonder that it fails

👀 1

raito added the

Context

maintainers

release-blocker

labels

2025-07-17 21:04:09 +00:00

raito pinned this

2025-07-22 12:55:09 +00:00

raito commented

2025-07-26 01:11:10 +00:00

To make progress on the situation:

I can reproduce this just by simply adding a broken DNS server in my /etc/resolv.conf, e.g. 50.50.50.50 on latest Lix HEAD.
getent does not suffer from this problem, even if I pass it getent hosts -s "hosts:dns" ..., it will get stuck for certain amount of time and go to the next DNS entry.
The following C program does not reproduce neither :

#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>

#include <gnu/lib-names.h>
#include <nss.h>
#include <dlfcn.h>

int main(int argc, char *argv[]) {
    if (!dlopen(LIBNSS_DNS_SO, RTLD_NOW))
        fprintf(stderr, "unable to load nss_dns backend");
    __nss_configure_lookup("hosts", "files dns");
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <hostname>\n", argv[0]);
        return 1;
    }

    const char *hostname = argv[1];

    CURL *curl = curl_easy_init();
    if (!curl) {
        fprintf(stderr, "Failed to initialize CURL\n");
        return 1;
    }

    char url[256];
    snprintf(url, sizeof(url), "http://%s", hostname);
    curl_easy_setopt(curl, CURLOPT_URL, url);

    // Enable DNS resolution only (no actual HTTP request)
    curl_easy_setopt(curl, CURLOPT_NOBODY, 1L);  // Don't download body
    curl_easy_setopt(curl, CURLOPT_CONNECT_ONLY, 1L);  // Connect only (no HTTP)

    CURLcode res = curl_easy_perform(curl);
    if (res != CURLE_OK) {
        fprintf(stderr, "DNS resolution failed: %s\n", curl_easy_strerror(res));
    } else {
        printf("DNS resolution succeeded for: %s\n", hostname);
    }

    curl_easy_cleanup(curl);
    return 0;
}

I still do not have a clear idea of where the issue is, but: c-ares (as suggested by pennae) or glibc remains prime suspects to me.

To make progress on the situation: * I can reproduce this just by simply adding a broken DNS server in my /etc/resolv.conf, e.g. 50.50.50.50 on latest Lix HEAD. * `getent` does not suffer from this problem, even if I pass it `getent hosts -s "hosts:dns" ...`, it will get stuck for certain amount of time and go to the next DNS entry. * The following C program does not reproduce neither : ```c #include <stdio.h> #include <stdlib.h> #include <curl/curl.h> #include <gnu/lib-names.h> #include <nss.h> #include <dlfcn.h> int main(int argc, char *argv[]) { if (!dlopen(LIBNSS_DNS_SO, RTLD_NOW)) fprintf(stderr, "unable to load nss_dns backend"); __nss_configure_lookup("hosts", "files dns"); if (argc != 2) { fprintf(stderr, "Usage: %s <hostname>\n", argv[0]); return 1; } const char *hostname = argv[1]; CURL *curl = curl_easy_init(); if (!curl) { fprintf(stderr, "Failed to initialize CURL\n"); return 1; } char url[256]; snprintf(url, sizeof(url), "http://%s", hostname); curl_easy_setopt(curl, CURLOPT_URL, url); // Enable DNS resolution only (no actual HTTP request) curl_easy_setopt(curl, CURLOPT_NOBODY, 1L); // Don't download body curl_easy_setopt(curl, CURLOPT_CONNECT_ONLY, 1L); // Connect only (no HTTP) CURLcode res = curl_easy_perform(curl); if (res != CURLE_OK) { fprintf(stderr, "DNS resolution failed: %s\n", curl_easy_strerror(res)); } else { printf("DNS resolution succeeded for: %s\n", hostname); } curl_easy_cleanup(curl); return 0; } ``` I still do not have a clear idea of where the issue is, but: c-ares (as suggested by pennae) or glibc remains prime suspects to me.

jade commented

2025-07-26 03:10:03 +00:00

I'm suspicious it's related to NSS and systemd-resolved causing the resolv.conf to be bypassed if it's busted, perhaps? nss dispatches to resolved and maybe the way that happens changed.

raito commented

2025-07-26 14:52:08 +00:00

There's no systemd-resolved in containers, so it cannot be systemd related.

The resolv.conf is definitely read, but after the first failure, there is no attempt to resolve using a second server, this was tried with a valid nsswitch.conf as well.

I debugged quite hard and what I see is that curl makes use of curl_getaddrinfo, so I guess whatever happens is on the side of getaddrinfo, I will obtain glibc debug symbols and see how far I can go from there.

At this point, I think this is glibc induced because I do not think I am even seeing any exchange with nsncd at all.

There's no systemd-resolved in containers, so it cannot be systemd related. The resolv.conf is definitely read, but after the first failure, there is no attempt to resolve using a second server, this was tried *with* a valid nsswitch.conf as well. I debugged quite hard and what I see is that curl makes use of `curl_getaddrinfo`, so I guess whatever happens is on the side of `getaddrinfo`, I will obtain glibc debug symbols and see how far I can go from there. At this point, I think this is glibc induced because I do not think I am even seeing any exchange with nsncd *at all*.

raito commented

2025-07-26 19:01:54 +00:00

Alright, this took all my sanity, but I nailed it down.

Currently, curl is built with getaddrinfo, not c-ares, so when you perform a DNS query, you call getaddrinfo with the hostname.

In the meantime, Lix via curl_multi will poll waiting for things, the connect timeout is set to 5 seconds, this connects timeout factors ALSO the resolution time in it.

Furthermore, curl has no way to know what getaddrinfo is up to, e.g. is it doing the 2nd nameserver or something and reset the timeout accordingly.

As a result, as soon as you have your first N entries bogus, the end result will be systematic query failures. Thankfully, @k900 reminded me there's a MAXNS=3 hardcoded in glibc, this means that at most 2 broken nameservers can exist and 1 valid nameserver exist.

Therefore, the simplest fix which does not involve replacing the DNS resolver by something aware of what's going on, e.g. c-ares possibly and https://curl.se/libcurl/c/CURLOPT_RESOLVER_START_FUNCTION.html (thanks to bch on #curl@libera.chat for the tip) with custom logic on our side is to perform exponential backoff on the connect timeout.

We should probably look into what is the default nameserver timeout on a normal system with 2 broken nameservers and 1 valid nameserver.

Alright, this took all my sanity, but I nailed it down. Currently, curl is built with `getaddrinfo`, not `c-ares`, so when you perform a DNS query, you call `getaddrinfo` with the hostname. In the meantime, Lix via `curl_multi` will poll waiting for things, the connect timeout is set to **5 seconds**, this connects timeout factors ALSO the resolution time in it. Furthermore, curl has no way to know what `getaddrinfo` is up to, e.g. is it doing the 2nd nameserver or something and reset the timeout accordingly. As a result, as soon as you have your first N entries bogus, the end result will be systematic query failures. Thankfully, @k900 reminded me there's a `MAXNS=3` hardcoded in glibc, this means that at most 2 broken nameservers can exist and 1 valid nameserver exist. Therefore, the simplest fix which does not involve replacing the DNS resolver by something aware of what's going on, e.g. c-ares possibly and https://curl.se/libcurl/c/CURLOPT_RESOLVER_START_FUNCTION.html (thanks to bch on `#curl@libera.chat` for the tip) with custom logic on our side is to perform **exponential backoff** on the connect timeout. We should probably look into what is the default nameserver timeout on a normal system with 2 broken nameservers and 1 valid nameserver.

raito referenced this issue

2025-07-26 19:32:07 +00:00

Connect attempts should progress in an exponential backoff manner #932

raito added the

Affects/Stable

Affects/Nightly

E/reproducible

labels

2025-07-26 19:33:33 +00:00

cobaltcause commented

2025-07-31 22:10:23 +00:00

So fun fact, this also affects fetching from s3 binary caches via aws-sdk-cpp even on 2.92.3.

jade referenced this issue

2025-08-03 12:06:03 +00:00

Tracking issue: http library testing #949

raito self-assigned this

2025-10-13 11:33:37 +00:00

raito commented

2025-10-29 21:38:48 +00:00

Expectations for removing from the release blocker:

Build a new image of Lix for Docker, use it with a functional DNS resolution with sandbox and pasta.
Repeat the test without sandbox with a functional DNS resolution AND then a dysfunctional DNS resolution with a second entry.

If this test pass, this problem will be considered closed.

The S3 variant of this problem will be out of scope for this change and we should track in a new issue.

Expectations for removing from the release blocker: - Build a new image of Lix for Docker, use it with a functional DNS resolution *with* sandbox and pasta. - Repeat the test *without* sandbox with a functional DNS resolution AND then a dysfunctional DNS resolution with a second entry. If this test pass, this problem will be considered closed. The S3 variant of this problem will be out of scope for this change and we should track in a new issue.

lix-bot commented

2025-10-29 21:47:27 +00:00

This issue was mentioned on Gerrit on the following CLs:

commit message in cl/4501 ("doc/manual: provide more information about Pasta and its shortcomings")

This issue was mentioned on Gerrit on the following CLs: * commit message in [cl/4501](https://gerrit.lix.systems/c/lix/+/4501) ("doc/manual: provide more information about Pasta and its shortcomings")

lix-project referenced this issue from a commit

2025-10-29 23:33:12 +00:00

doc/manual: provide more information about Pasta and its shortcomings

raito commented

2025-11-07 01:19:53 +00:00

Under functional DNS resolution:

builds & substitution without sandbox works
builds & substitution with sandbox and pasta works (as long as you disable seccomp and have user namespaces via remapping possible)

Under failed DNS resolution:

builds & sub without sandbox works

on 2.93.3, confirmed to fail, on recent HEAD in Lix:

❯ docker run --security-opt seccomp=unconfined -it 5de50f32f1fb nix-build -E '(import <nixpkgs> {}).runCommand "test" { } "echo coucou > $out"' --sandbox
_alias_tips__preexec:37: command not found: python3
warning: error: unable to download 'https://cache.nixos.org/nix-cache-info': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 470ms ms (attempt 1/5)
this derivation will be built:
  /nix/store/k9mm1jwbl93nj4as8470r2rfv1snba7c-test.drv
these 12 paths will be fetched (1.84 MiB download, 15.29 MiB unpacked):
  /nix/store/w1pxx760yidi7n9vbi5bhpii9xxl5vdj-bzip2-1.0.8-bin
  /nix/store/xw0mf3shymq3k7zlncf09rm8917sdi4h-diffutils-3.12
  /nix/store/89wrfml7fd8g1vjy29p67vfgxiaj6b12-ed-1.21.1
  /nix/store/xlmpcglsq8l09qh03rf0virz0331pjdc-file-5.45
  /nix/store/c1z5j28ndxljf1ihqzag57bwpfpzms0g-gawk-5.3.2
  /nix/store/khmqxw6b9q7rgkv6hf3gcqf2igk03z1g-gnu-config-2024-01-01
  /nix/store/xk0d14zpm0njxzdm182dd722aqhav2cc-gnumake-4.4.1
  /nix/store/gj54zvf7vxll1mzzmqhqi1p4jiws3mfb-patch-2.7.6
  /nix/store/g7i75czfbw9sy5f8v7rjbama6lr3ya3s-patchelf-0.15.0
  /nix/store/7iirrwzdlzrhwh2b7dlkd6y65riyg4cc-stdenv-linux
  /nix/store/gi6g289i9ydm3z896x67q210y0qq29zg-update-autotools-gnu-config-scripts-hook
  /nix/store/22rpb6790f346c55iqi6s9drr5qgmyjf-xz-5.8.1-bin
copying path '/nix/store/22rpb6790f346c55iqi6s9drr5qgmyjf-xz-5.8.1-bin' from 'https://cache.nixos.org'...
copying path '/nix/store/c1z5j28ndxljf1ihqzag57bwpfpzms0g-gawk-5.3.2' from 'https://cache.nixos.org'...
copying path '/nix/store/89wrfml7fd8g1vjy29p67vfgxiaj6b12-ed-1.21.1' from 'https://cache.nixos.org'...
copying path '/nix/store/xlmpcglsq8l09qh03rf0virz0331pjdc-file-5.45' from 'https://cache.nixos.org'...
copying path '/nix/store/g7i75czfbw9sy5f8v7rjbama6lr3ya3s-patchelf-0.15.0' from 'https://cache.nixos.org'...
copying path '/nix/store/xk0d14zpm0njxzdm182dd722aqhav2cc-gnumake-4.4.1' from 'https://cache.nixos.org'...
copying path '/nix/store/xw0mf3shymq3k7zlncf09rm8917sdi4h-diffutils-3.12' from 'https://cache.nixos.org'...
copying path '/nix/store/khmqxw6b9q7rgkv6hf3gcqf2igk03z1g-gnu-config-2024-01-01' from 'https://cache.nixos.org'...
copying path '/nix/store/w1pxx760yidi7n9vbi5bhpii9xxl5vdj-bzip2-1.0.8-bin' from 'https://cache.nixos.org'...
copying path '/nix/store/gj54zvf7vxll1mzzmqhqi1p4jiws3mfb-patch-2.7.6' from 'https://cache.nixos.org'...
copying path '/nix/store/gi6g289i9ydm3z896x67q210y0qq29zg-update-autotools-gnu-config-scripts-hook' from 'https://cache.nixos.org'...
copying path '/nix/store/7iirrwzdlzrhwh2b7dlkd6y65riyg4cc-stdenv-linux' from 'https://cache.nixos.org'...
building '/nix/store/k9mm1jwbl93nj4as8470r2rfv1snba7c-test.drv'...
/nix/store/zq7sxf3f5yra8gp53fmz4yjnx4raiif2-test

This sounds like the problem is fixed to me.
Feel free to reopen if you disagree with the analysis.

Under functional DNS resolution: - builds & substitution without sandbox works - builds & substitution with sandbox *and* pasta works (as long as you disable seccomp and have user namespaces via remapping possible) Under failed DNS resolution: - builds & sub without sandbox works on 2.93.3, confirmed to fail, on recent HEAD in Lix: ``` ❯ docker run --security-opt seccomp=unconfined -it 5de50f32f1fb nix-build -E '(import <nixpkgs> {}).runCommand "test" { } "echo coucou > $out"' --sandbox _alias_tips__preexec:37: command not found: python3 warning: error: unable to download 'https://cache.nixos.org/nix-cache-info': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 470ms ms (attempt 1/5) this derivation will be built: /nix/store/k9mm1jwbl93nj4as8470r2rfv1snba7c-test.drv these 12 paths will be fetched (1.84 MiB download, 15.29 MiB unpacked): /nix/store/w1pxx760yidi7n9vbi5bhpii9xxl5vdj-bzip2-1.0.8-bin /nix/store/xw0mf3shymq3k7zlncf09rm8917sdi4h-diffutils-3.12 /nix/store/89wrfml7fd8g1vjy29p67vfgxiaj6b12-ed-1.21.1 /nix/store/xlmpcglsq8l09qh03rf0virz0331pjdc-file-5.45 /nix/store/c1z5j28ndxljf1ihqzag57bwpfpzms0g-gawk-5.3.2 /nix/store/khmqxw6b9q7rgkv6hf3gcqf2igk03z1g-gnu-config-2024-01-01 /nix/store/xk0d14zpm0njxzdm182dd722aqhav2cc-gnumake-4.4.1 /nix/store/gj54zvf7vxll1mzzmqhqi1p4jiws3mfb-patch-2.7.6 /nix/store/g7i75czfbw9sy5f8v7rjbama6lr3ya3s-patchelf-0.15.0 /nix/store/7iirrwzdlzrhwh2b7dlkd6y65riyg4cc-stdenv-linux /nix/store/gi6g289i9ydm3z896x67q210y0qq29zg-update-autotools-gnu-config-scripts-hook /nix/store/22rpb6790f346c55iqi6s9drr5qgmyjf-xz-5.8.1-bin copying path '/nix/store/22rpb6790f346c55iqi6s9drr5qgmyjf-xz-5.8.1-bin' from 'https://cache.nixos.org'... copying path '/nix/store/c1z5j28ndxljf1ihqzag57bwpfpzms0g-gawk-5.3.2' from 'https://cache.nixos.org'... copying path '/nix/store/89wrfml7fd8g1vjy29p67vfgxiaj6b12-ed-1.21.1' from 'https://cache.nixos.org'... copying path '/nix/store/xlmpcglsq8l09qh03rf0virz0331pjdc-file-5.45' from 'https://cache.nixos.org'... copying path '/nix/store/g7i75czfbw9sy5f8v7rjbama6lr3ya3s-patchelf-0.15.0' from 'https://cache.nixos.org'... copying path '/nix/store/xk0d14zpm0njxzdm182dd722aqhav2cc-gnumake-4.4.1' from 'https://cache.nixos.org'... copying path '/nix/store/xw0mf3shymq3k7zlncf09rm8917sdi4h-diffutils-3.12' from 'https://cache.nixos.org'... copying path '/nix/store/khmqxw6b9q7rgkv6hf3gcqf2igk03z1g-gnu-config-2024-01-01' from 'https://cache.nixos.org'... copying path '/nix/store/w1pxx760yidi7n9vbi5bhpii9xxl5vdj-bzip2-1.0.8-bin' from 'https://cache.nixos.org'... copying path '/nix/store/gj54zvf7vxll1mzzmqhqi1p4jiws3mfb-patch-2.7.6' from 'https://cache.nixos.org'... copying path '/nix/store/gi6g289i9ydm3z896x67q210y0qq29zg-update-autotools-gnu-config-scripts-hook' from 'https://cache.nixos.org'... copying path '/nix/store/7iirrwzdlzrhwh2b7dlkd6y65riyg4cc-stdenv-linux' from 'https://cache.nixos.org'... building '/nix/store/k9mm1jwbl93nj4as8470r2rfv1snba7c-test.drv'... /nix/store/zq7sxf3f5yra8gp53fmz4yjnx4raiif2-test ``` This sounds like the problem is fixed to me. Feel free to reopen if you disagree with the analysis.

raito closed this issue

2025-11-07 01:19:53 +00:00

Sign in to join this conversation.