Post run diagnostics #39
Loading…
Reference in a new issue
No description provided.
Delete branch "post-run-diagnostics"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Description
Sometimes Nix releases will cause widespread breakage that is hard to identify. For example, the recent Nix 2.18.0 release. To provide the best experience we can, we would like to know if a Nix bump is causing users to have suddenly broken CI experiences so we can roll it back.
This extends our action to send a post-workflow-run diagnostics report which sends 2 bits of data: failure | cancelled | success. This also sets the "attribution" property to a random UUID for the installer's diagnostic report, to allow us to correlate the install diagnostic with the subsequent post-run diagnostic field. This correlation is useful and necessary to connect the status to the version of the installer and Nix that was installed, and other diagnostic data in that original capture. This doesn't give us any new insight into who our users are or their behavior, nor does it offer anything identifiable.
Checklist
@ -4,3 +4,3 @@
Copyright (c) 2016 - 2020 Node Fetch Team
Copyright (c) 2016 David Frank
?
I don't know how the post-run whatever stuff works, but it feels to me like it would just... re-run this and thus get a new random UUID...? Am I wrong?
For readers: we've had a bit of discussion internally about if this PR is a good idea. We don't all agree, but I've decided that we're going to try it. We discussed the privacy implications and whether or not the data will even be useful. In general, I agree it does feel a little bit weird to be collecting overall workflow / job statuses, however since we don't collect any data which connects the reports to a given organization or repository it is only providing data in aggregate. That's the point, though. I'm open to ideas about how this data will be somehow identifying, I just haven't figured any out.
Whether the data will be useful or not: I'm not sure. I think it will be. I'll explain: The goal of the Determinate Nix Installer is to provide a working Nix. Not just the most recent, or a version of Nix -- but an installation that works as the user is expecting. We do existing work here: like running a self-test after the installation completes. This is a great start, but it doesn't check a lot. It shows if the bare minimum of the Nix installation was successful, and isn't able to check more of Nix's behavior to identify larger problems like what Nix 2.18.0 had with invalid store paths.
So, why record aggregate job conclusions? The way we roll out updates to the Nix installer is by gradually ramping new releases out, starting with something like 10-20% of GitHub Actions installs. (That's the purpose of the
?ci=github
argument in the download URL.) We prioritize initial rollout for GHA because the environment is highly likely to be ephemeral, and the user "cost" of a failure is smaller: one re-run away from a clean environment where they're not likely to get the new version again. Compared to starting with users, who end up in a bad state and have to possibly uninstall and reinstall.One reason the data may not be useful is if a particularly large user of our action has a bad day and has many many failures in the day. Or perhaps some related infrastructure is broken, causing the job failures to rise. This might be true, and if the data isn't useful we will stop collecting it and delete it. However, we perform many thousands of installs every day on GitHub Actions. In that way we get quite a lot of "signal". And, importantly, using the percentage-based rollout strategy, I believe this will cause heavy users to be roughly balanced between the released cases. The data is intended to be examined in using comparative analysis: % of runs per outcome in version A, vs. the % of runs per outcome in version B. I think with the install frequency and the randomized distribution of A vs. B, we'll find reasonable signal in the results. However, again: if we don't, we'll get rid of it.
@ -4,3 +4,3 @@
Copyright (c) 2016 - 2020 Node Fetch Team
Copyright (c) 2016 David Frank
This is automatically maintained by typscript I think.
ohp lol