Store's haunted! Zeroed regions in mesa apple_dri.so
in the Nix store on M1 machines #248
Labels
No labels
Area/build-packaging
Area/evaluator
Area/flakes
Area/language
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/store
bug
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
RFD
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
ux
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#248
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
@yu-re-ka and @xal-0 have reported finding regions of zero bytes in files, specifically mesa on asahi nixos.
I am unsure if this is just because it is something substantive that's getting built and asahi has haunted local builds, or if mesa is special in some way.
This was a local build of the mesa drivers, not from cache, and it happened on two machines which significantly reduces the likelihood of it being the machine needing an exorcism.
xal is using zfs, unsure about yureka.
This may have happened before xal switched to lix, and I think yureka reported similar?
I've
attached the xz'd binaries of the bad apple_dri filesnvm forgejo is blocking it, i need to fix that, but I am honestly suspecting a kernel bug or asahi bug of some sort? No idea what the heck is going on.I have a fairly high degree of confidence that this is not our fault at the Lix project, because it is aligned to a linker section.
bad:
good:
Store's haunted! Zeroed regions in files in the Nix store on M1 machinesto Store's haunted! Zeroed regions in mesa `apple_dri.so` in the Nix store on M1 machinesfor reference I'm using xfs
copy pasting things yuyu wrote on matrix:
running version:
11cc6b0b26
building version:
05c3a4fe02
we believe that this bug happens only under load. doing something like
stress -c 5 --vm 5 --io 5
in the background causes it to repro much more consistentlyjade did something like the following, which did find an unrelated build determinism problem after a while on a non-asahi aarch64-linux box:
jade's nondeterminism bug: https://jade.fyi/nix-haunting/mesa-nonrepro-ampere.tar.gz. both built on the same machine, and the nondeterministic one seems to really piss off objdump on my machine such that it takes minutes, unlike the normal one that merely takes 17 seconds:
/nix/store/n1c7659yp1mic81k6zm2qk0nach9xynr-binutils-2.41/bin/objdump --line-numbers --disassemble --demangle --reloc --no-show-raw-insn --section=.text nix/store/h6cgvk41w6my2lqpib5vlcznbmv56k7h-mesa-24.1.0-drivers.check/lib/dri/apple_dri.so > /dev/null
jade's nondeterminism bug appears unrelated to the missing sections issue.
remaining tasks to be able to take this to someone upstream and be like "hi look at this nonsense" is to catch a build directory of the bad build in the act and compare it to a good build, then figure out if it was the linker. if it is the linker, we need to get the link command line such that we can spam it in a loop and make reproing it faster. if it is the compiler, we need to find an object file where the problem happened and get the compile cmdline.
This has been confirmed to happen on CppNix as well.
Opened an issue in nixos-apple-silicon: https://github.com/tpwrules/nixos-apple-silicon/issues/199
I ran into this as well, without running Lix anywhere (yet) -
nix (Nix) 2.18.2
I cannot confirm 100% it did indeed get built on this machine, or another aarch64 machines, as I have a bunch of remote builders configured, but it definitely doesn't seem Lix-specific.
As posted on the
nixos-apple-silicon
repo, I'm quite confident this is not a Nix/Lix issue, but rather a Nixpkgs issue:so, my current working theory is that this is not patchelf, but a repeat of a previous issue; tho i'm not entirely sure why it doesn't affect x86_64, as it should have the same bug.
Since NixOS/nixpkgs#207101,
strip
is parallelised. This already turned out to be a problem, asstrip
running multiple processes simultaneously on the same file has caused issues on aarch64 before; see NixOS/nixpkgs#246147. As I was trying to debug this, I built the same store path on a Hetzner aarch64 VPS, as well as locally using qemu-user. What I noticed was that thestrtab
of the natively-built mesa'sdri
was a size8
, rather than a size0x165199
when built under qemu-user. The rest of the ELF was identical. After a bit more digging, it turns out that mesa'sdri
driver is hard-linked to every single{foo}_dri.so
path. This is eventually deduplicated, but this is done after running strip. Which means that, in effect, it was running strip over the same file, once again, repeating issue 246147. In my case, this showed up as thestrtab
being truncated, but I could imagine this showing up differently for other people (e.g. a part of a section missing, but always section-aligned, because that's how it's written by binutils).This issue is exacerbated by the fact that
strip
errors aren't printed as long as at least one file has been successfully stripped, hiding the myriad of{foo}_dri.so[.eh_frame]: invalid operation
and similar errors.I believe that moving the symlink deduping logic from
postFixup
topreFixup
is likely to solve this issue; but as I don't have a real aarch64 device to test with, I leave the implementation of this suggestion to others :)