lix in lix's container images fails to download things #920

Open
opened 2025-07-16 23:12:06 +00:00 by cobaltcause · 12 comments
Member

I'm using lix's container images (git.lix.systems/lix-project/lix:...) for GitLab CI, and I get errors like this with newer versions:

fetching github input 'github:NixOS/nixpkgs/9807714d6944a957c2e036f84b0ff8caf9930bc0'
warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5002 milliseconds (curl error code=28); retrying in 338 ms (attempt 1/5)
warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 558 ms (attempt 2/5)
warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 1183 ms (attempt 3/5)
warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 2493 ms (attempt 4/5)
error:
       … while fetching the input 'github:NixOS/nixpkgs/9807714d6944a957c2e036f84b0ff8caf9930bc0'
       error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5002 milliseconds (curl error code=28)

Older versions work, though. Here's an example: https://gitlab.computer.surgery/charles/derail/-/merge_requests/10#note_3645

Here are the versions I tried:

Version Affected
2.93.2 Yes
2.93.1 Yes
2.93.0 Yes
2.92.3 No, works fine
2.92.2 No, works fine

It's entirely possible I'm doing something wrong, I guess, but it seems odd that changing the lix container version is what causes this error.

I'm using lix's container images (`git.lix.systems/lix-project/lix:...`) for GitLab CI, and I get errors like this with newer versions: ``` fetching github input 'github:NixOS/nixpkgs/9807714d6944a957c2e036f84b0ff8caf9930bc0' warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5002 milliseconds (curl error code=28); retrying in 338 ms (attempt 1/5) warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 558 ms (attempt 2/5) warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 1183 ms (attempt 3/5) warning: error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5000 milliseconds (curl error code=28); retrying in 2493 ms (attempt 4/5) error: … while fetching the input 'github:NixOS/nixpkgs/9807714d6944a957c2e036f84b0ff8caf9930bc0' error: unable to download 'https://github.com/NixOS/nixpkgs/archive/9807714d6944a957c2e036f84b0ff8caf9930bc0.tar.gz': Resolving timed out after 5002 milliseconds (curl error code=28) ``` Older versions work, though. Here's an example: https://gitlab.computer.surgery/charles/derail/-/merge_requests/10#note_3645 Here are the versions I tried: | Version | Affected | |-|-| | 2.93.2 | Yes | | 2.93.1 | Yes | | 2.93.0 | Yes | | 2.92.3 | No, works fine | | 2.92.2 | No, works fine | It's entirely possible I'm doing something wrong, I guess, but it seems odd that changing the lix container version is what causes this error.
Author
Member

@raito helped me troubleshoot this out of band. What ended up revealing the problem to us was running while true; do foo="$(pgrep '^nix$')"; if [[ -n "$foo" ]]; then strace -yy -fp "$foo" -o log.strace; break; fi; done and then running docker run -e NIX_REMOTE=local -it git.lix.systems/lix-project/lix:2.93.2 nix shell nixpkgs#hello --extra-experimental-features 'nix-command flakes' immediately after. Raito noticed the following bits of the strace:

288944 read(14</etc/resolv.conf>, "# Generated by Docker Engine.\n# "..., 4096) = 321
###### [etc]
288944 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 14<UDP:[546453]>
288944 setsockopt(14<UDP:[546453]>, SOL_IP, IP_RECVERR, [1], 4) = 0
288944 connect(14<UDP:[546453]>, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.128.0.1")}, 16) = 0
288944 poll([{fd=14<UDP:[546453]>, events=POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])
288944 sendmmsg(14<UDP:[546453]>, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\250\273\1\0\0\1\0\0\0\0\0\0\3api\6github\3com\0\0\1\0\1", iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=32}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\343\265\1\0\0\1\0\0\0\0\0\0\3api\6github\3com\0\0\34\0\1", iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=32}], 2, MSG_NOSIGNAL) = 2

which led to curiosity as to what /etc/resolv.conf contained:

# docker run -it git.lix.systems/lix-project/lix:2.93.2 cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 10.128.0.1
nameserver 1.1.1.1
nameserver 8.8.8.8
nameserver 1.0.0.1
nameserver 8.8.4.4
search .

# Based on host file: '/run/systemd/resolve/resolv.conf' (legacy)
# Overrides: []

On my desktop, the nix shell command inside docker works fine, so here's that for comparison:

$ docker run -it git.lix.systems/lix-project/lix:2.93.2 cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 1.1.1.1
nameserver 8.8.8.8
nameserver 1.0.0.1
nameserver 8.8.4.4
search .

# Based on host file: '/run/systemd/resolve/resolv.conf' (legacy)
# Overrides: []

So, so far, it seems like something is adding that extra nameserver at the top, which doesn't respond for github.com and then the other nameservers are skipped and it just gives up.

@raito helped me troubleshoot this out of band. What ended up revealing the problem to us was running `while true; do foo="$(pgrep '^nix$')"; if [[ -n "$foo" ]]; then strace -yy -fp "$foo" -o log.strace; break; fi; done` and then running `docker run -e NIX_REMOTE=local -it git.lix.systems/lix-project/lix:2.93.2 nix shell nixpkgs#hello --extra-experimental-features 'nix-command flakes'` immediately after. Raito noticed the following bits of the strace: ``` 288944 read(14</etc/resolv.conf>, "# Generated by Docker Engine.\n# "..., 4096) = 321 ###### [etc] 288944 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 14<UDP:[546453]> 288944 setsockopt(14<UDP:[546453]>, SOL_IP, IP_RECVERR, [1], 4) = 0 288944 connect(14<UDP:[546453]>, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.128.0.1")}, 16) = 0 288944 poll([{fd=14<UDP:[546453]>, events=POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}]) 288944 sendmmsg(14<UDP:[546453]>, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\250\273\1\0\0\1\0\0\0\0\0\0\3api\6github\3com\0\0\1\0\1", iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=32}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\343\265\1\0\0\1\0\0\0\0\0\0\3api\6github\3com\0\0\34\0\1", iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=32}], 2, MSG_NOSIGNAL) = 2 ``` which led to curiosity as to what `/etc/resolv.conf` contained: ``` # docker run -it git.lix.systems/lix-project/lix:2.93.2 cat /etc/resolv.conf # Generated by Docker Engine. # This file can be edited; Docker Engine will not make further changes once it # has been modified. nameserver 10.128.0.1 nameserver 1.1.1.1 nameserver 8.8.8.8 nameserver 1.0.0.1 nameserver 8.8.4.4 search . # Based on host file: '/run/systemd/resolve/resolv.conf' (legacy) # Overrides: [] ``` On my desktop, the `nix shell` command inside docker works fine, so here's that for comparison: ``` $ docker run -it git.lix.systems/lix-project/lix:2.93.2 cat /etc/resolv.conf # Generated by Docker Engine. # This file can be edited; Docker Engine will not make further changes once it # has been modified. nameserver 1.1.1.1 nameserver 8.8.8.8 nameserver 1.0.0.1 nameserver 8.8.4.4 search . # Based on host file: '/run/systemd/resolve/resolv.conf' (legacy) # Overrides: [] ``` So, so far, it seems like *something* is adding that extra nameserver at the top, which doesn't respond for github.com and then the other nameservers are skipped and it just gives up.
Author
Member

Ah here's one other data point, the output of this command is from the affected system where resolution breaks:

# docker run -it git.lix.systems/lix-project/lix:2.92.3 cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 10.128.0.1
nameserver 1.1.1.1
nameserver 8.8.8.8
nameserver 1.0.0.1
nameserver 8.8.4.4
search .

# Based on host file: '/run/systemd/resolve/resolv.conf' (legacy)
# Overrides: []

Note that this version of the container succeeds in providing a shell, rather than timing out during name resolution.

Ah here's one other data point, the output of this command is from the affected system where resolution breaks: ``` # docker run -it git.lix.systems/lix-project/lix:2.92.3 cat /etc/resolv.conf # Generated by Docker Engine. # This file can be edited; Docker Engine will not make further changes once it # has been modified. nameserver 10.128.0.1 nameserver 1.1.1.1 nameserver 8.8.8.8 nameserver 1.0.0.1 nameserver 8.8.4.4 search . # Based on host file: '/run/systemd/resolve/resolv.conf' (legacy) # Overrides: [] ``` Note that this version of the container succeeds in providing a shell, rather than timing out during name resolution.
Author
Member

Possibly relevant: git.lix.systems/lix-project/lix:2.92.3 has glibc 2.40-36, git.lix.systems/lix-project/lix:2.93.2 has glibc 2.40-66. Neither contain an /etc/nsswitch.conf.

Possibly relevant: `git.lix.systems/lix-project/lix:2.92.3` has glibc 2.40-36, `git.lix.systems/lix-project/lix:2.93.2` has glibc 2.40-66. Neither contain an `/etc/nsswitch.conf`.
Owner

I'm guessing 10.128.0.1 is a local recursive nameserver? Or is it in the docker network? Probably it's in the host resolv.conf right?

If it is, I wonder why NSS is correctly skipping it on the host and not on the guest. Certainly there's likely a misconfiguration involved here.

Idea: what if you LD_PRELOAD the old libc? I also wonder if there's a way to extract debug data out of NSS.

I'm guessing 10.128.0.1 is a local recursive nameserver? Or is it in the docker network? Probably it's in the host resolv.conf right? If it is, I wonder why NSS is correctly skipping it on the host and not on the guest. Certainly there's likely a misconfiguration involved here. Idea: what if you LD_PRELOAD the old libc? I also wonder if there's a way to extract debug data out of NSS.
Owner

this also happens in our vm tests for certain curl operations in the daemon, it just went unnoticed. forcibly disabling nscd does not change anything. joining the namespace of the daemon and running host cache.nixos.org or trying to download something also just works. turning up the curl tracing options to 11 yields absolutely no useful information. enabling networkd config and thus replacing the resolver does not help. running the test from the interactive driver does seem to help, at least we don't see any curl errors there.

non-interactive runs with resolved have curl errors and print resolved warnings to the log:

machine # [    6.901414] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3.
machine # [    6.903126] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3.
machine # [    6.904342] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3.
machine # [    9.157136] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3.
machine # [    9.159601] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3.
machine # [    9.162637] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3.
machine # [    9.164857] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3.

the interactive run is definitely using resolved as well, but it's not getting any of these errors. even in the interactive runner we can reproduce curl errors with just

machine.succeed("""nix-build --expr 'derivation { name = "a"; system = __currentSystem; builder = "builtin:fetchurl"; url = "https://cache.nixos.org"; outputHashMode = "flat"; }' >&2""")

setting explicit dns servers in FileTransfer does not help, in fact all failing name resolution is sent to 127.0.0.1 for some reason even though that's configured absolutely nowhere as a dns server?

we never see these errors outside of the vm tests either.

this also happens in our vm tests for certain curl operations in the daemon, it just went unnoticed. forcibly disabling nscd does not change anything. joining the namespace of the daemon and running `host cache.nixos.org` or trying to download something also just works. turning up the curl tracing options to 11 yields absolutely no useful information. enabling networkd config and thus replacing the resolver does not help. running the test from the interactive driver *does* seem to help, at least we don't see any curl errors there. non-interactive runs with resolved have curl errors and print resolved warnings to the log: ``` machine # [ 6.901414] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3. machine # [ 6.903126] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3. machine # [ 6.904342] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3. machine # [ 9.157136] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3. machine # [ 9.159601] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3. machine # [ 9.162637] systemd-resolved[421]: Using degraded feature set TCP instead of UDP for DNS server 10.0.2.3. machine # [ 9.164857] systemd-resolved[421]: Using degraded feature set UDP instead of TCP for DNS server 10.0.2.3. ``` the interactive run is definitely using resolved as well, but it's not getting any of these errors. even in the interactive runner we can reproduce curl errors with just ``` machine.succeed("""nix-build --expr 'derivation { name = "a"; system = __currentSystem; builder = "builtin:fetchurl"; url = "https://cache.nixos.org"; outputHashMode = "flat"; }' >&2""") ``` setting explicit dns servers in FileTransfer does not help, in fact all failing name resolution is sent to `127.0.0.1` for some reason even though that's configured *absolutely nowhere* as a dns server? we never see these errors outside of the vm tests either.
Author
Member

I'm guessing 10.128.0.1 is a local recursive nameserver? Or is it in the docker network? Probably it's in the host resolv.conf right?

The host is using systemd-resolved so technically it's not in its /etc/resolv.conf, but resolvectl status shows that it is configured as a DNS server for one of my wireguard interfaces on this machine. If I resolvectl query github.com on the host I get:

github.com: 140.82.116.3                       -- link: enp2s0

-- Information acquired via protocol DNS in 16.4ms.
-- Data is authenticated: no; Data was acquired via local or encrypted transport: no
-- Data from: network

which looks right (i.e. is not using the wireguard interface's DNS).

this also happens in our vm tests for certain curl operations in the daemon

FWIW, it also seems to happen to CppNix users without any kind of virtualization:

> I'm guessing 10.128.0.1 is a local recursive nameserver? Or is it in the docker network? Probably it's in the host resolv.conf right? The host is using systemd-resolved so technically it's not in its `/etc/resolv.conf`, but `resolvectl status` shows that it is configured as a DNS server for one of my wireguard interfaces on this machine. If I `resolvectl query github.com` on the host I get: ``` github.com: 140.82.116.3 -- link: enp2s0 -- Information acquired via protocol DNS in 16.4ms. -- Data is authenticated: no; Data was acquired via local or encrypted transport: no -- Data from: network ``` which looks right (i.e. is not using the wireguard interface's DNS). > this also happens in our vm tests for certain curl operations in the daemon FWIW, it also seems to happen to CppNix users without any kind of virtualization: * https://github.com/NixOS/nix/issues/13341 * https://github.com/NixOS/nix/issues/13466
Owner

our reproducer from above is invalid, calling fetchurl like this runs it as an IA derivation without network access. no wonder that it fails

our reproducer from above is invalid, calling fetchurl like this runs it as an IA derivation without network access. no wonder that it fails
Owner

To make progress on the situation:

  • I can reproduce this just by simply adding a broken DNS server in my /etc/resolv.conf, e.g. 50.50.50.50 on latest Lix HEAD.
  • getent does not suffer from this problem, even if I pass it getent hosts -s "hosts:dns" ..., it will get stuck for certain amount of time and go to the next DNS entry.
  • The following C program does not reproduce neither :
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>

#include <gnu/lib-names.h>
#include <nss.h>
#include <dlfcn.h>

int main(int argc, char *argv[]) {
    if (!dlopen(LIBNSS_DNS_SO, RTLD_NOW))
        fprintf(stderr, "unable to load nss_dns backend");
    __nss_configure_lookup("hosts", "files dns");
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <hostname>\n", argv[0]);
        return 1;
    }

    const char *hostname = argv[1];

    CURL *curl = curl_easy_init();
    if (!curl) {
        fprintf(stderr, "Failed to initialize CURL\n");
        return 1;
    }

    char url[256];
    snprintf(url, sizeof(url), "http://%s", hostname);
    curl_easy_setopt(curl, CURLOPT_URL, url);

    // Enable DNS resolution only (no actual HTTP request)
    curl_easy_setopt(curl, CURLOPT_NOBODY, 1L);  // Don't download body
    curl_easy_setopt(curl, CURLOPT_CONNECT_ONLY, 1L);  // Connect only (no HTTP)

    CURLcode res = curl_easy_perform(curl);
    if (res != CURLE_OK) {
        fprintf(stderr, "DNS resolution failed: %s\n", curl_easy_strerror(res));
    } else {
        printf("DNS resolution succeeded for: %s\n", hostname);
    }

    curl_easy_cleanup(curl);
    return 0;
}

I still do not have a clear idea of where the issue is, but: c-ares (as suggested by pennae) or glibc remains prime suspects to me.

To make progress on the situation: * I can reproduce this just by simply adding a broken DNS server in my /etc/resolv.conf, e.g. 50.50.50.50 on latest Lix HEAD. * `getent` does not suffer from this problem, even if I pass it `getent hosts -s "hosts:dns" ...`, it will get stuck for certain amount of time and go to the next DNS entry. * The following C program does not reproduce neither : ```c #include <stdio.h> #include <stdlib.h> #include <curl/curl.h> #include <gnu/lib-names.h> #include <nss.h> #include <dlfcn.h> int main(int argc, char *argv[]) { if (!dlopen(LIBNSS_DNS_SO, RTLD_NOW)) fprintf(stderr, "unable to load nss_dns backend"); __nss_configure_lookup("hosts", "files dns"); if (argc != 2) { fprintf(stderr, "Usage: %s <hostname>\n", argv[0]); return 1; } const char *hostname = argv[1]; CURL *curl = curl_easy_init(); if (!curl) { fprintf(stderr, "Failed to initialize CURL\n"); return 1; } char url[256]; snprintf(url, sizeof(url), "http://%s", hostname); curl_easy_setopt(curl, CURLOPT_URL, url); // Enable DNS resolution only (no actual HTTP request) curl_easy_setopt(curl, CURLOPT_NOBODY, 1L); // Don't download body curl_easy_setopt(curl, CURLOPT_CONNECT_ONLY, 1L); // Connect only (no HTTP) CURLcode res = curl_easy_perform(curl); if (res != CURLE_OK) { fprintf(stderr, "DNS resolution failed: %s\n", curl_easy_strerror(res)); } else { printf("DNS resolution succeeded for: %s\n", hostname); } curl_easy_cleanup(curl); return 0; } ``` I still do not have a clear idea of where the issue is, but: c-ares (as suggested by pennae) or glibc remains prime suspects to me.
Owner

I'm suspicious it's related to NSS and systemd-resolved causing the resolv.conf to be bypassed if it's busted, perhaps? nss dispatches to resolved and maybe the way that happens changed.

I'm suspicious it's related to NSS and systemd-resolved causing the resolv.conf to be bypassed if it's busted, perhaps? nss dispatches to resolved and maybe the way that happens changed.
Owner

There's no systemd-resolved in containers, so it cannot be systemd related.

The resolv.conf is definitely read, but after the first failure, there is no attempt to resolve using a second server, this was tried with a valid nsswitch.conf as well.

I debugged quite hard and what I see is that curl makes use of curl_getaddrinfo, so I guess whatever happens is on the side of getaddrinfo, I will obtain glibc debug symbols and see how far I can go from there.

At this point, I think this is glibc induced because I do not think I am even seeing any exchange with nsncd at all.

There's no systemd-resolved in containers, so it cannot be systemd related. The resolv.conf is definitely read, but after the first failure, there is no attempt to resolve using a second server, this was tried *with* a valid nsswitch.conf as well. I debugged quite hard and what I see is that curl makes use of `curl_getaddrinfo`, so I guess whatever happens is on the side of `getaddrinfo`, I will obtain glibc debug symbols and see how far I can go from there. At this point, I think this is glibc induced because I do not think I am even seeing any exchange with nsncd *at all*.
Owner

Alright, this took all my sanity, but I nailed it down.

Currently, curl is built with getaddrinfo, not c-ares, so when you perform a DNS query, you call getaddrinfo with the hostname.

In the meantime, Lix via curl_multi will poll waiting for things, the connect timeout is set to 5 seconds, this connects timeout factors ALSO the resolution time in it.

Furthermore, curl has no way to know what getaddrinfo is up to, e.g. is it doing the 2nd nameserver or something and reset the timeout accordingly.

As a result, as soon as you have your first N entries bogus, the end result will be systematic query failures. Thankfully, @k900 reminded me there's a MAXNS=3 hardcoded in glibc, this means that at most 2 broken nameservers can exist and 1 valid nameserver exist.

Therefore, the simplest fix which does not involve replacing the DNS resolver by something aware of what's going on, e.g. c-ares possibly and https://curl.se/libcurl/c/CURLOPT_RESOLVER_START_FUNCTION.html (thanks to bch on #curl@libera.chat for the tip) with custom logic on our side is to perform exponential backoff on the connect timeout.

We should probably look into what is the default nameserver timeout on a normal system with 2 broken nameservers and 1 valid nameserver.

Alright, this took all my sanity, but I nailed it down. Currently, curl is built with `getaddrinfo`, not `c-ares`, so when you perform a DNS query, you call `getaddrinfo` with the hostname. In the meantime, Lix via `curl_multi` will poll waiting for things, the connect timeout is set to **5 seconds**, this connects timeout factors ALSO the resolution time in it. Furthermore, curl has no way to know what `getaddrinfo` is up to, e.g. is it doing the 2nd nameserver or something and reset the timeout accordingly. As a result, as soon as you have your first N entries bogus, the end result will be systematic query failures. Thankfully, @k900 reminded me there's a `MAXNS=3` hardcoded in glibc, this means that at most 2 broken nameservers can exist and 1 valid nameserver exist. Therefore, the simplest fix which does not involve replacing the DNS resolver by something aware of what's going on, e.g. c-ares possibly and https://curl.se/libcurl/c/CURLOPT_RESOLVER_START_FUNCTION.html (thanks to bch on `#curl@libera.chat` for the tip) with custom logic on our side is to perform **exponential backoff** on the connect timeout. We should probably look into what is the default nameserver timeout on a normal system with 2 broken nameservers and 1 valid nameserver.
Author
Member

So fun fact, this also affects fetching from s3 binary caches via aws-sdk-cpp even on 2.92.3.

So fun fact, this also affects fetching from s3 binary caches via aws-sdk-cpp even on 2.92.3.
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#920
No description provided.