NOTE: I'm well-aware that we have to be careful with this to avoid new
regressions on hydra.nixos.org, so this should only be merged after
extensive testing from more people.
Motivation: I updated Nix in my deployment to 2.9.1 and decided to also
update Hydra in one go (and compile it against the newer Nix). Given
that this also updates the C++ code in `hydra-{queue-runner,eval-jobs}`
this patch might become useful in the future though.
I started to wonder quite recently why Hydra doesn't send email
notifications anymore to me. I saw the following issue in the log of
`hydra-notify.service`:
May 22 11:57:29 hydra 9bik0bxyxbrklhx6lqwifd6af8kj84va-hydra-notify[1887289]: fatal: unsafe repository ('/var/lib/hydra/scm/git/3e70c16c266ef70dc4198705a688acccf71e932878f178277c9ac47d133cc663' is owned by someone else)
May 22 11:57:29 hydra 9bik0bxyxbrklhx6lqwifd6af8kj84va-hydra-notify[1887289]: To add an exception for this directory, call:
May 22 11:57:29 hydra 9bik0bxyxbrklhx6lqwifd6af8kj84va-hydra-notify[1887289]: git config --global --add safe.directory /var/lib/hydra/scm/git/3e70c16c266ef70dc4198705a688acccf71e932878f178277c9ac47d133cc663
May 22 11:57:29 hydra 9bik0bxyxbrklhx6lqwifd6af8kj84va-hydra-notify[1886654]: error running build_finished hooks: command `git log --pretty=format:%H%x09%an%x09%ae%x09%at b0c30a7557685d25a8ab3f34fdb775e66db0bc4c..eaf28389fcebc2beca13a802f79b2cca6e9ca309 --git-dir=.git' failed with e>
This is also a problem because of Git's fix for CVE-2022-24765[1], so I
applied the same fix as for Nix[2], by using `--git-dir` which skips the
code-path for the ownership-check[3].
[1] https://lore.kernel.org/git/xmqqv8veb5i6.fsf@gitster.g/
[2] https://github.com/NixOS/nix/pull/6440
[3] To quote `git(1)`:
> Specifying the location of the ".git" directory using this option
> (or GIT_DIR environment variable) turns off the repository
> discovery that tries to find a directory with ".git" subdirectory
On hydra.nixos.org the queue runner had child processes that were
stuck handling an exception:
Thread 1 (Thread 0x7f501f7fe640 (LWP 1413473) "bld~v54h5zkhmb3"):
#0 futex_wait (private=0, expected=2, futex_word=0x7f50c27969b0 <_rtld_local+2480>) at ../sysdeps/nptl/futex-internal.h:146
#1 __lll_lock_wait (futex=0x7f50c27969b0 <_rtld_local+2480>, private=0) at lowlevellock.c:52
#2 0x00007f50c21eaee4 in __GI___pthread_mutex_lock (mutex=0x7f50c27969b0 <_rtld_local+2480>) at ../nptl/pthread_mutex_lock.c:115
#3 0x00007f50c1854bef in __GI___dl_iterate_phdr (callback=0x7f50c190c020 <_Unwind_IteratePhdrCallback>, data=0x7f501f7fb040) at dl-iteratephdr.c:40
#4 0x00007f50c190d2d1 in _Unwind_Find_FDE () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#5 0x00007f50c19099b3 in uw_frame_state_for () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#6 0x00007f50c190ab90 in uw_init_context_1 () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#7 0x00007f50c190b08e in _Unwind_RaiseException () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#8 0x00007f50c1b02ab7 in __cxa_throw () from /nix/store/dd8swlwhpdhn6bv219562vyxhi8278hs-gcc-10.3.0-lib/lib/libstdc++.so.6
#9 0x00007f50c1d01abe in nix::parseURL (url="root@cb893012.packethost.net") at src/libutil/url.cc:53
#10 0x0000000000484f55 in extraStoreArgs (machine="root@cb893012.packethost.net") at build-remote.cc:35
#11 operator() (__closure=0x7f4fe9fe0420) at build-remote.cc:79
...
Maybe the fork happened while another thread was holding some global
stack unwinding lock
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744). Anyway, since
the hanging child inherits all file descriptors to SSH clients,
shutting down remote builds (via 'child.to = -1' in
State::buildRemote()) doesn't work and 'child.pid.wait()' hangs
forever.
So let's not do any significant work between fork and exec.
Re-executing this search_related on every access turned out to
create very problematic performance. If a jobset had a lot of
error output stored in the jobset, and there were many hundreds
or thousands of active jobs, this could easily cause >1Gbps of
network traffic.
Otherwise, when the port is randomly chosen (e.g. by specifying no port,
or a port of 0), it will just show that the port is 0 and not the port
that is actually serving the metrics.
This is syntactically lighter wait, and demonstates there are no weird
dynamic lifetimes involved, just regular passing reference to callee
which it only borrows for the duration of the call.