On hydra.nixos.org the queue runner had child processes that were
stuck handling an exception:
Thread 1 (Thread 0x7f501f7fe640 (LWP 1413473) "bld~v54h5zkhmb3"):
#0 futex_wait (private=0, expected=2, futex_word=0x7f50c27969b0 <_rtld_local+2480>) at ../sysdeps/nptl/futex-internal.h:146
#1 __lll_lock_wait (futex=0x7f50c27969b0 <_rtld_local+2480>, private=0) at lowlevellock.c:52
#2 0x00007f50c21eaee4 in __GI___pthread_mutex_lock (mutex=0x7f50c27969b0 <_rtld_local+2480>) at ../nptl/pthread_mutex_lock.c:115
#3 0x00007f50c1854bef in __GI___dl_iterate_phdr (callback=0x7f50c190c020 <_Unwind_IteratePhdrCallback>, data=0x7f501f7fb040) at dl-iteratephdr.c:40
#4 0x00007f50c190d2d1 in _Unwind_Find_FDE () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#5 0x00007f50c19099b3 in uw_frame_state_for () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#6 0x00007f50c190ab90 in uw_init_context_1 () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#7 0x00007f50c190b08e in _Unwind_RaiseException () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#8 0x00007f50c1b02ab7 in __cxa_throw () from /nix/store/dd8swlwhpdhn6bv219562vyxhi8278hs-gcc-10.3.0-lib/lib/libstdc++.so.6
#9 0x00007f50c1d01abe in nix::parseURL (url="root@cb893012.packethost.net") at src/libutil/url.cc:53
#10 0x0000000000484f55 in extraStoreArgs (machine="root@cb893012.packethost.net") at build-remote.cc:35
#11 operator() (__closure=0x7f4fe9fe0420) at build-remote.cc:79
...
Maybe the fork happened while another thread was holding some global
stack unwinding lock
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744). Anyway, since
the hanging child inherits all file descriptors to SSH clients,
shutting down remote builds (via 'child.to = -1' in
State::buildRemote()) doesn't work and 'child.pid.wait()' hangs
forever.
So let's not do any significant work between fork and exec.
Re-executing this search_related on every access turned out to
create very problematic performance. If a jobset had a lot of
error output stored in the jobset, and there were many hundreds
or thousands of active jobs, this could easily cause >1Gbps of
network traffic.
This is syntactically lighter wait, and demonstates there are no weird
dynamic lifetimes involved, just regular passing reference to callee
which it only borrows for the duration of the call.
Periodically, I have seen tests fail because of out of order queue runner behavior:
checking the queue for builds > 0...
loading build 1 (tests:basic:empty_dir)
aborting unsupported build step '...-empty-dir.drv' (type 'x86_64-linux')
marking build 1 as failed
adding new machine ‘localhost’
This patch should prevent the dispatcher from running before any machines are
made available.
This shouldn't be possible normally, but it is possible to:
$db->resultset('RunCommandLogs')->new({ uuid => "../etc/passwd" });
if you have access to the `$db`.
Also split it out to a new div -- there are now 3 lines per
RunCommandLog -- the first saying when it started, the second saying how
long it ran for (or has been running), and the third with the buttons
for the pretty, raw, and tail versions of the log.
This also adds the `runcommandlog` object to the stash so that we can
access its uuid as well as command run in order to display more useful
and specific information on the webpage.
Using a sha1 of the command combined with the build ID is not a
particularly good or unique identifier:
* A build could fail, be restarted, and then succeed -- assuming no
configuration changes, the sha1 hash of the command as well as the build
ID will be the same. This would lead to an overwritten log file.
* Allowing user input to influence filenames is not the best of ideas.
Since breaking the filename construction out to a helper function,
Hydra::Model::DB is no longer used. Importing Hydra::Helper::Nix,
however, has the potential to break tests, so just use the functions we
need without importing the entire module.