While sorting machines by load, the load of a machine
(machine->currentJobs) can be changed by other threads. If that
happens, the comparator is no longer a proper ordering, in which case
std::sort() can segfault. So we now make a copy of currentJobs before
sorting.
There is a slight possibility that the queue monitor and a builder
thread simultaneously decide to mark a build as finished. That's fine,
as long as we ensure the DB update is idempotent (as ensured by doing
"update Builds set finished = 1 ... where finished = 0").
If a build A depends on a derivation that is the top-level derivation
of some build B, then we should process B before A (meaning we
shouldn't make the derivation runnable before B has been
added). Otherwise, the derivation will be "accounted" to A rather than
B (so the build step will show up in the wrong build).
Aborted builds are now put back on the runnable queue and retried
after a certain time interval (currently 60 seconds for the first
retry, then tripled on each subsequent retry).
Hydra-queue-runner now no longer polls the queue periodically, but
instead sleeps until it receives a notification from PostgreSQL about
a change to the queue (build added, build cancelled or build
restarted).
Also, for the "build added" case, we now only check for builds with an
ID greater than the previous greatest ID. This is much more efficient
if the queue is large.
It just makes things unnecessarily complicated. We can just exit
without cleaning anything up, since the only thing to do is unmark
builds and build steps as busy. But we can do that by having systemd
call "hydra-queue-runner --unlock" from ExecStopPost.
If multiple threads create a step for the same build, they could get
the same "max(stepnr)" and allocate conflicting new step numbers. So
lock the BuildSteps table while doing this. We could use a different
isolation level, but this is easier.