lix in lix's container images fails to download things #920
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#920
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm using lix's container images (
git.lix.systems/lix-project/lix:...
) for GitLab CI, and I get errors like this with newer versions:Older versions work, though. Here's an example: https://gitlab.computer.surgery/charles/derail/-/merge_requests/10#note_3645
Here are the versions I tried:
It's entirely possible I'm doing something wrong, I guess, but it seems odd that changing the lix container version is what causes this error.
@raito helped me troubleshoot this out of band. What ended up revealing the problem to us was running
while true; do foo="$(pgrep '^nix$')"; if [[ -n "$foo" ]]; then strace -yy -fp "$foo" -o log.strace; break; fi; done
and then runningdocker run -e NIX_REMOTE=local -it git.lix.systems/lix-project/lix:2.93.2 nix shell nixpkgs#hello --extra-experimental-features 'nix-command flakes'
immediately after. Raito noticed the following bits of the strace:which led to curiosity as to what
/etc/resolv.conf
contained:On my desktop, the
nix shell
command inside docker works fine, so here's that for comparison:So, so far, it seems like something is adding that extra nameserver at the top, which doesn't respond for github.com and then the other nameservers are skipped and it just gives up.
Ah here's one other data point, the output of this command is from the affected system where resolution breaks:
Note that this version of the container succeeds in providing a shell, rather than timing out during name resolution.
Possibly relevant:
git.lix.systems/lix-project/lix:2.92.3
has glibc 2.40-36,git.lix.systems/lix-project/lix:2.93.2
has glibc 2.40-66. Neither contain an/etc/nsswitch.conf
.I'm guessing 10.128.0.1 is a local recursive nameserver? Or is it in the docker network? Probably it's in the host resolv.conf right?
If it is, I wonder why NSS is correctly skipping it on the host and not on the guest. Certainly there's likely a misconfiguration involved here.
Idea: what if you LD_PRELOAD the old libc? I also wonder if there's a way to extract debug data out of NSS.
this also happens in our vm tests for certain curl operations in the daemon, it just went unnoticed. forcibly disabling nscd does not change anything. joining the namespace of the daemon and running
host cache.nixos.org
or trying to download something also just works. turning up the curl tracing options to 11 yields absolutely no useful information. enabling networkd config and thus replacing the resolver does not help. running the test from the interactive driver does seem to help, at least we don't see any curl errors there.non-interactive runs with resolved have curl errors and print resolved warnings to the log:
the interactive run is definitely using resolved as well, but it's not getting any of these errors. even in the interactive runner we can reproduce curl errors with just
setting explicit dns servers in FileTransfer does not help, in fact all failing name resolution is sent to
127.0.0.1
for some reason even though that's configured absolutely nowhere as a dns server?we never see these errors outside of the vm tests either.
The host is using systemd-resolved so technically it's not in its
/etc/resolv.conf
, butresolvectl status
shows that it is configured as a DNS server for one of my wireguard interfaces on this machine. If Iresolvectl query github.com
on the host I get:which looks right (i.e. is not using the wireguard interface's DNS).
FWIW, it also seems to happen to CppNix users without any kind of virtualization:
our reproducer from above is invalid, calling fetchurl like this runs it as an IA derivation without network access. no wonder that it fails
To make progress on the situation:
getent
does not suffer from this problem, even if I pass itgetent hosts -s "hosts:dns" ...
, it will get stuck for certain amount of time and go to the next DNS entry.I still do not have a clear idea of where the issue is, but: c-ares (as suggested by pennae) or glibc remains prime suspects to me.
I'm suspicious it's related to NSS and systemd-resolved causing the resolv.conf to be bypassed if it's busted, perhaps? nss dispatches to resolved and maybe the way that happens changed.
There's no systemd-resolved in containers, so it cannot be systemd related.
The resolv.conf is definitely read, but after the first failure, there is no attempt to resolve using a second server, this was tried with a valid nsswitch.conf as well.
I debugged quite hard and what I see is that curl makes use of
curl_getaddrinfo
, so I guess whatever happens is on the side ofgetaddrinfo
, I will obtain glibc debug symbols and see how far I can go from there.At this point, I think this is glibc induced because I do not think I am even seeing any exchange with nsncd at all.
Alright, this took all my sanity, but I nailed it down.
Currently, curl is built with
getaddrinfo
, notc-ares
, so when you perform a DNS query, you callgetaddrinfo
with the hostname.In the meantime, Lix via
curl_multi
will poll waiting for things, the connect timeout is set to 5 seconds, this connects timeout factors ALSO the resolution time in it.Furthermore, curl has no way to know what
getaddrinfo
is up to, e.g. is it doing the 2nd nameserver or something and reset the timeout accordingly.As a result, as soon as you have your first N entries bogus, the end result will be systematic query failures. Thankfully, @k900 reminded me there's a
MAXNS=3
hardcoded in glibc, this means that at most 2 broken nameservers can exist and 1 valid nameserver exist.Therefore, the simplest fix which does not involve replacing the DNS resolver by something aware of what's going on, e.g. c-ares possibly and https://curl.se/libcurl/c/CURLOPT_RESOLVER_START_FUNCTION.html (thanks to bch on
#curl@libera.chat
for the tip) with custom logic on our side is to perform exponential backoff on the connect timeout.We should probably look into what is the default nameserver timeout on a normal system with 2 broken nameservers and 1 valid nameserver.
So fun fact, this also affects fetching from s3 binary caches via aws-sdk-cpp even on 2.92.3.