We've seen many fails on ofborg, at lot of them ultimately appear to come down to
a timeout being hit, resulting in something like this:
Failure executing slapadd -F /<path>/slap.d -b dc=example -l /<path>/load.ldif.
Hopefully this resolves it for most cases.
I've done some endurance testing and this helps a lot.
some other commands also regularly time-out with high load:
- hydra-init
- hydra-create-user
- nix-store --delete
This should address most issues with tests randomly failing.
Used the following script for endurance testing:
```
import os
import subprocess
run_counter = 0
fail_counter = 0
while True:
try:
run_counter += 1
print(f"Starting run {run_counter}")
env = os.environ
env["YATH_JOB_COUNT"] = "20"
result = subprocess.run(["perl", "t/test.pl"], env=env)
if (result.returncode != 0):
fail_counter += 1
print(f"Finish run {run_counter}, total fail count: {fail_counter}")
except KeyboardInterrupt:
print(f"Finished {run_counter} runs with {fail_counter} fails")
break
```
In case someone else wants to do it on their system :).
Note that YATH_JOB_COUNT may need to be changed loosely based on your
cores.
I only have 4 cores (8 threads), so for others higher numbers might
yield better results in hashing out unstable tests.
The feature cannot easily be ported to nix-eval-jobs since it requires
deep integration into the evaluator, and h.n.o doesn't use it. Later
more of this will be ripped out.
This is implement in an extremely hacky way due to poor DBIx feature
support. Ideally, what we'd need is a way to tell DBIx to ignore the
errormsg column unless explicitly requested, and to automatically add a
computed 'errormsg IS NULL' column in others. Since it does not support
that, this commit instead hacks some support via method overrides while
taking care to not break anything obvious.
This verison has a worse UI, but also chnages the schema less: One
non-null constraint is removed, but no new columns are added.
Co-Authored-By: Andrea Ciceri <andrea.ciceri@autistici.org>
Co-Authored-By: regnat <rg@regnat.ovh>
At the moment, aggregate jobs can easily break and cause the entire
evaluation to fail, which is not ideal. For Nixpkgs, we do have some
important aggregate jobs (like `tested`), but for debugging and building
purposes it's still useful to get a partial result even if the channel
won't actually advance.
This commit changes the behaviour of hydra-eval-jobs such that it
aggregates any errors found during the construction of an aggregate, and
will instead annotate the job with the evaluation failure such that it
shows up in a "cleaner" way.
There are really two types of failure that we care about: one is where
the attribute just ends up missing altogether in the final output, and
also where the attribute is in the output but fails to evaluate. Both
are handled here.
Note that this does mean that the same error message may be output
multiple times, but this aids debuggability because it'll be much
clearer what's blocking the job from being created.