lix in lix's container images fails to download things #920
	
		Labels
		
	
	
	
	No labels
	
		
			
	
	Affects/CppNix
		
			Affects/Nightly
		
			Affects/Only nightly
		
			Affects/Stable
		
			Area/build-packaging
		
			Area/cli
		
			Area/evaluator
		
			Area/fetching
		
			Area/flakes
		
			Area/language
		
			Area/lix ci
		
			Area/nix-eval-jobs
		
			Area/profiles
		
			Area/protocol
		
			Area/releng
		
			Area/remote-builds
		
			Area/repl
		
			Area/repl/debugger
		
			Area/store
		
			bug
		
			Context
contributors
		
			Context
drive-by
		
			Context
maintainers
		
			Context
RFD
		
			crash 💥
		
			Cross Compilation
		
			devx
		
			docs
		
			Downstream Dependents
		
			E/easy
		
			E/hard
		
			E/help wanted
		
			E/reproducible
		
			E/requires rearchitecture
		
			Feature/S3
		
			imported
		
			Language/Bash
		
			Language/C++
		
			Language/NixLang
		
			Language/Python
		
			Language/Rust
		
			Needs Langver
		
			OS/Linux
		
			OS/macOS
		
			performance
		
			regression
		
			release-blocker
		
			stability
		
			Status
blocked
		
			Status
invalid
		
			Status
postponed
		
			Status
wontfix
		
			testing
		
			testing/flakey
		
			Topic/Large Scale Installations
		
			ux
		
		
	
		No milestone
		
			
		
	No project
	
		
	
	
	
	
		No assignees
		
	
	
	
	
		5 participants
	
	
		
		
	Notifications
	
		
	
	
	
		
	
	
	Due date
No due date set.
	
		Dependencies
		
		
	
	
	No dependencies set.
	
	
		
	
	
		
			Reference
		
	
	
		
	
	
			lix-project/lix#920
			
		
	
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue
	
	
	No description provided.
		
		Delete branch "%!s()"
	 
	Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm using lix's container images (
git.lix.systems/lix-project/lix:...) for GitLab CI, and I get errors like this with newer versions:Older versions work, though. Here's an example: https://gitlab.computer.surgery/charles/derail/-/merge_requests/10#note_3645
Here are the versions I tried:
It's entirely possible I'm doing something wrong, I guess, but it seems odd that changing the lix container version is what causes this error.
@raito helped me troubleshoot this out of band. What ended up revealing the problem to us was running
while true; do foo="$(pgrep '^nix$')"; if [[ -n "$foo" ]]; then strace -yy -fp "$foo" -o log.strace; break; fi; doneand then runningdocker run -e NIX_REMOTE=local -it git.lix.systems/lix-project/lix:2.93.2 nix shell nixpkgs#hello --extra-experimental-features 'nix-command flakes'immediately after. Raito noticed the following bits of the strace:which led to curiosity as to what
/etc/resolv.confcontained:On my desktop, the
nix shellcommand inside docker works fine, so here's that for comparison:So, so far, it seems like something is adding that extra nameserver at the top, which doesn't respond for github.com and then the other nameservers are skipped and it just gives up.
Ah here's one other data point, the output of this command is from the affected system where resolution breaks:
Note that this version of the container succeeds in providing a shell, rather than timing out during name resolution.
Possibly relevant:
git.lix.systems/lix-project/lix:2.92.3has glibc 2.40-36,git.lix.systems/lix-project/lix:2.93.2has glibc 2.40-66. Neither contain an/etc/nsswitch.conf.I'm guessing 10.128.0.1 is a local recursive nameserver? Or is it in the docker network? Probably it's in the host resolv.conf right?
If it is, I wonder why NSS is correctly skipping it on the host and not on the guest. Certainly there's likely a misconfiguration involved here.
Idea: what if you LD_PRELOAD the old libc? I also wonder if there's a way to extract debug data out of NSS.
this also happens in our vm tests for certain curl operations in the daemon, it just went unnoticed. forcibly disabling nscd does not change anything. joining the namespace of the daemon and running
host cache.nixos.orgor trying to download something also just works. turning up the curl tracing options to 11 yields absolutely no useful information. enabling networkd config and thus replacing the resolver does not help. running the test from the interactive driver does seem to help, at least we don't see any curl errors there.non-interactive runs with resolved have curl errors and print resolved warnings to the log:
the interactive run is definitely using resolved as well, but it's not getting any of these errors. even in the interactive runner we can reproduce curl errors with just
setting explicit dns servers in FileTransfer does not help, in fact all failing name resolution is sent to
127.0.0.1for some reason even though that's configured absolutely nowhere as a dns server?we never see these errors outside of the vm tests either.
The host is using systemd-resolved so technically it's not in its
/etc/resolv.conf, butresolvectl statusshows that it is configured as a DNS server for one of my wireguard interfaces on this machine. If Iresolvectl query github.comon the host I get:which looks right (i.e. is not using the wireguard interface's DNS).
FWIW, it also seems to happen to CppNix users without any kind of virtualization:
our reproducer from above is invalid, calling fetchurl like this runs it as an IA derivation without network access. no wonder that it fails
To make progress on the situation:
getentdoes not suffer from this problem, even if I pass itgetent hosts -s "hosts:dns" ..., it will get stuck for certain amount of time and go to the next DNS entry.I still do not have a clear idea of where the issue is, but: c-ares (as suggested by pennae) or glibc remains prime suspects to me.
I'm suspicious it's related to NSS and systemd-resolved causing the resolv.conf to be bypassed if it's busted, perhaps? nss dispatches to resolved and maybe the way that happens changed.
There's no systemd-resolved in containers, so it cannot be systemd related.
The resolv.conf is definitely read, but after the first failure, there is no attempt to resolve using a second server, this was tried with a valid nsswitch.conf as well.
I debugged quite hard and what I see is that curl makes use of
curl_getaddrinfo, so I guess whatever happens is on the side ofgetaddrinfo, I will obtain glibc debug symbols and see how far I can go from there.At this point, I think this is glibc induced because I do not think I am even seeing any exchange with nsncd at all.
Alright, this took all my sanity, but I nailed it down.
Currently, curl is built with
getaddrinfo, notc-ares, so when you perform a DNS query, you callgetaddrinfowith the hostname.In the meantime, Lix via
curl_multiwill poll waiting for things, the connect timeout is set to 5 seconds, this connects timeout factors ALSO the resolution time in it.Furthermore, curl has no way to know what
getaddrinfois up to, e.g. is it doing the 2nd nameserver or something and reset the timeout accordingly.As a result, as soon as you have your first N entries bogus, the end result will be systematic query failures. Thankfully, @k900 reminded me there's a
MAXNS=3hardcoded in glibc, this means that at most 2 broken nameservers can exist and 1 valid nameserver exist.Therefore, the simplest fix which does not involve replacing the DNS resolver by something aware of what's going on, e.g. c-ares possibly and https://curl.se/libcurl/c/CURLOPT_RESOLVER_START_FUNCTION.html (thanks to bch on
#curl@libera.chatfor the tip) with custom logic on our side is to perform exponential backoff on the connect timeout.We should probably look into what is the default nameserver timeout on a normal system with 2 broken nameservers and 1 valid nameserver.
So fun fact, this also affects fetching from s3 binary caches via aws-sdk-cpp even on 2.92.3.
Expectations for removing from the release blocker:
If this test pass, this problem will be considered closed.
The S3 variant of this problem will be out of scope for this change and we should track in a new issue.
This issue was mentioned on Gerrit on the following CLs: