Store's haunted! Zeroed regions in mesa apple_dri.so in the Nix store on M1 machines #248

Closed
opened 2024-04-23 05:30:53 +00:00 by jade · 6 comments
Owner

@yu-re-ka and @xal-0 have reported finding regions of zero bytes in files, specifically mesa on asahi nixos.

I am unsure if this is just because it is something substantive that's getting built and asahi has haunted local builds, or if mesa is special in some way.

This was a local build of the mesa drivers, not from cache, and it happened on two machines which significantly reduces the likelihood of it being the machine needing an exorcism.

[yuka@m1:~]$ for i in $(echo /nix/store/*-mesa*-drivers/lib/dri/apple_dri.so); do echo $i; cat $i | xz -9 | wc -c; done
/nix/store/02j1xmf4bxd9mbm9ib0p77564alsr9hi-mesa-24.1.0-drivers/lib/dri/apple_dri.so
1070228
/nix/store/3q9ii210553r7y2jkmjgwglr0pzn2186-mesa-24.1.0-drivers/lib/dri/apple_dri.so
3177052
/nix/store/a89cgxahkgk5i1vkj72h999mcw2v080p-mesa-24.1.0-drivers/lib/dri/apple_dri.so
3178056
/nix/store/h6cgvk41w6my2lqpib5vlcznbmv56k7h-mesa-24.1.0-drivers/lib/dri/apple_dri.so
3176360
/nix/store/iwhmigsxz5x0zzzfrhpm35dljipm6l4w-mesa-24.1.0-drivers/lib/dri/apple_dri.so
730908
/nix/store/m8g9bfzvyvqkymf8wd9y8rzlqvdv4av5-mesa-24.1.0-drivers/lib/dri/apple_dri.so
3179944

xal is using zfs, unsure about yureka.

This may have happened before xal switched to lix, and I think yureka reported similar?

I've attached the xz'd binaries of the bad apple_dri filesnvm forgejo is blocking it, i need to fix that, but I am honestly suspecting a kernel bug or asahi bug of some sort? No idea what the heck is going on.

@yu-re-ka and @xal-0 have reported finding regions of zero bytes in files, *specifically* mesa on asahi nixos. I am unsure if this is just because it is something substantive that's getting built and asahi has haunted local builds, or if mesa is special in some way. This was a local build of the mesa drivers, not from cache, and it happened on two machines which significantly reduces the likelihood of it being the machine needing an exorcism. ``` [yuka@m1:~]$ for i in $(echo /nix/store/*-mesa*-drivers/lib/dri/apple_dri.so); do echo $i; cat $i | xz -9 | wc -c; done /nix/store/02j1xmf4bxd9mbm9ib0p77564alsr9hi-mesa-24.1.0-drivers/lib/dri/apple_dri.so 1070228 /nix/store/3q9ii210553r7y2jkmjgwglr0pzn2186-mesa-24.1.0-drivers/lib/dri/apple_dri.so 3177052 /nix/store/a89cgxahkgk5i1vkj72h999mcw2v080p-mesa-24.1.0-drivers/lib/dri/apple_dri.so 3178056 /nix/store/h6cgvk41w6my2lqpib5vlcznbmv56k7h-mesa-24.1.0-drivers/lib/dri/apple_dri.so 3176360 /nix/store/iwhmigsxz5x0zzzfrhpm35dljipm6l4w-mesa-24.1.0-drivers/lib/dri/apple_dri.so 730908 /nix/store/m8g9bfzvyvqkymf8wd9y8rzlqvdv4av5-mesa-24.1.0-drivers/lib/dri/apple_dri.so 3179944 ``` xal is using zfs, unsure about yureka. This may have happened before xal switched to lix, and I think yureka reported similar? I've ~~attached the xz'd binaries of the bad apple_dri files~~nvm forgejo is blocking it, i need to fix that, but I am honestly suspecting a kernel bug or asahi bug of some sort? No idea what the heck is going on.
jade added the
bug
label 2024-04-23 05:30:53 +00:00
Author
Owner

I have a fairly high degree of confidence that this is not our fault at the Lix project, because it is aligned to a linker section.

bad:
image

good:

image

I have a fairly high degree of confidence that this is not our fault at the Lix project, because it is aligned to a *linker section*. bad: ![image](/attachments/1bc9a0e1-9ddb-40b0-ae75-bdad8ae8b914) good: ![image](/attachments/b18a8ada-5c85-4365-9ffd-5665a9465a3a)
156 KiB
192 KiB
jade changed title from Store's haunted! Zeroed regions in files in the Nix store on M1 machines to Store's haunted! Zeroed regions in mesa `apple_dri.so` in the Nix store on M1 machines 2024-04-23 06:25:41 +00:00
Member

for reference I'm using xfs

for reference I'm using xfs
Author
Owner

copy pasting things yuyu wrote on matrix:

running version: 11cc6b0b26
building version: 05c3a4fe02

we believe that this bug happens only under load. doing something like stress -c 5 --vm 5 --io 5 in the background causes it to repro much more consistently

jade did something like the following, which did find an unrelated build determinism problem after a while on a non-asahi aarch64-linux box:

while nix build -L .#mesa-asahi-edge --rebuild --keep-failed ; do echo 'built another'; done

jade's nondeterminism bug: https://jade.fyi/nix-haunting/mesa-nonrepro-ampere.tar.gz. both built on the same machine, and the nondeterministic one seems to really piss off objdump on my machine such that it takes minutes, unlike the normal one that merely takes 17 seconds: /nix/store/n1c7659yp1mic81k6zm2qk0nach9xynr-binutils-2.41/bin/objdump --line-numbers --disassemble --demangle --reloc --no-show-raw-insn --section=.text nix/store/h6cgvk41w6my2lqpib5vlcznbmv56k7h-mesa-24.1.0-drivers.check/lib/dri/apple_dri.so > /dev/null

jade's nondeterminism bug appears unrelated to the missing sections issue.


remaining tasks to be able to take this to someone upstream and be like "hi look at this nonsense" is to catch a build directory of the bad build in the act and compare it to a good build, then figure out if it was the linker. if it is the linker, we need to get the link command line such that we can spam it in a loop and make reproing it faster. if it is the compiler, we need to find an object file where the problem happened and get the compile cmdline.

copy pasting things yuyu wrote on matrix: running version: https://github.com/yu-re-ka/nixos-m1/commit/11cc6b0b261c28b93b28a06da72f1ecaadce3705 building version: https://github.com/yu-re-ka/nixos-m1/commit/05c3a4fe02f5b0f281313f943e1b6efc8e24299a we believe that this bug happens only under load. doing something like `stress -c 5 --vm 5 --io 5` in the background causes it to repro much more consistently jade did something like the following, which *did* find an unrelated build determinism problem after a while on a non-asahi aarch64-linux box: ``` while nix build -L .#mesa-asahi-edge --rebuild --keep-failed ; do echo 'built another'; done ``` --- jade's nondeterminism bug: https://jade.fyi/nix-haunting/mesa-nonrepro-ampere.tar.gz. both built on the same machine, and the nondeterministic one seems to really piss off objdump on my machine such that it takes *minutes*, unlike the normal one that merely takes 17 seconds: `/nix/store/n1c7659yp1mic81k6zm2qk0nach9xynr-binutils-2.41/bin/objdump --line-numbers --disassemble --demangle --reloc --no-show-raw-insn --section=.text nix/store/h6cgvk41w6my2lqpib5vlcznbmv56k7h-mesa-24.1.0-drivers.check/lib/dri/apple_dri.so > /dev/null` jade's nondeterminism bug appears unrelated to the missing sections issue. --- remaining tasks to be able to take this to someone upstream and be like "hi look at this nonsense" is to catch a build directory of the bad build in the act and compare it to a good build, then figure out if it was the linker. if it is the linker, we need to get the link command line such that we can spam it in a loop and make reproing it faster. if it is the compiler, we need to find an object file where the problem happened and get the compile cmdline.
qyriad added the
Area/store
label 2024-05-06 00:54:00 +00:00
Member

This has been confirmed to happen on CppNix as well.

Opened an issue in nixos-apple-silicon: https://github.com/tpwrules/nixos-apple-silicon/issues/199

This has been confirmed to happen on CppNix as well. Opened an issue in nixos-apple-silicon: https://github.com/tpwrules/nixos-apple-silicon/issues/199

I ran into this as well, without running Lix anywhere (yet) - nix (Nix) 2.18.2

13:14 <flokli> tpw_rules: yuka: hmh, did a nixos update to latest master, and the graphical session doesn't come up. gdm crashes X, eglinfo shows SIGILL in loader_bind_extensions.
13:14 <flokli> running sway as root works
[…]
13:15 <yuka> quick check: "for i in /nix/store/*-mesa-24.1.0-drivers/lib/dri/apple_dri.so ; do cat $i | xz -9 | wc -c ; done"
13:15 <yuka> if any of the paths has a significantly lower entropy than the others, your store is haunted
13:15 <yuka> if this is the case it would be tremendously useful because it means this issue is not lix specific
13:16 <flokli> for i in /nix/store/*-mesa-24.1.0-drivers/lib/dri/apple_dri.so ; do cat $i | xz -9 | wc -c ; done 
13:16 <flokli> 3175560
13:16 <flokli> 3039308
13:16 <flokli> 3176884
13:16 <flokli> 2513936
13:16 <yuka> yeah that looks suspicious
13:16 <yuka> let me guess, the one with 2.5M is the one referenced by your current system?
13:17 <yuka> (add a "echo $i" in the loop to find out which one it is)
13:17 <flokli> yes

I cannot confirm 100% it did indeed get built on this machine, or another aarch64 machines, as I have a bunch of remote builders configured, but it definitely doesn't seem Lix-specific.

I ran into this as well, without running Lix anywhere (yet) - `nix (Nix) 2.18.2` ``` 13:14 <flokli> tpw_rules: yuka: hmh, did a nixos update to latest master, and the graphical session doesn't come up. gdm crashes X, eglinfo shows SIGILL in loader_bind_extensions. 13:14 <flokli> running sway as root works […] 13:15 <yuka> quick check: "for i in /nix/store/*-mesa-24.1.0-drivers/lib/dri/apple_dri.so ; do cat $i | xz -9 | wc -c ; done" 13:15 <yuka> if any of the paths has a significantly lower entropy than the others, your store is haunted 13:15 <yuka> if this is the case it would be tremendously useful because it means this issue is not lix specific 13:16 <flokli> for i in /nix/store/*-mesa-24.1.0-drivers/lib/dri/apple_dri.so ; do cat $i | xz -9 | wc -c ; done 13:16 <flokli> 3175560 13:16 <flokli> 3039308 13:16 <flokli> 3176884 13:16 <flokli> 2513936 13:16 <yuka> yeah that looks suspicious 13:16 <yuka> let me guess, the one with 2.5M is the one referenced by your current system? 13:17 <yuka> (add a "echo $i" in the loop to find out which one it is) 13:17 <flokli> yes ``` I cannot confirm 100% it did indeed get built on this machine, or another aarch64 machines, as I have a bunch of remote builders configured, but it definitely doesn't seem Lix-specific.
Owner

As posted on the nixos-apple-silicon repo, I'm quite confident this is not a Nix/Lix issue, but rather a Nixpkgs issue:

so, my current working theory is that this is not patchelf, but a repeat of a previous issue; tho i'm not entirely sure why it doesn't affect x86_64, as it should have the same bug.

Since NixOS/nixpkgs#207101, strip is parallelised. This already turned out to be a problem, as strip running multiple processes simultaneously on the same file has caused issues on aarch64 before; see NixOS/nixpkgs#246147. As I was trying to debug this, I built the same store path on a Hetzner aarch64 VPS, as well as locally using qemu-user. What I noticed was that the strtab of the natively-built mesa's dri was a size 8, rather than a size 0x165199 when built under qemu-user. The rest of the ELF was identical. After a bit more digging, it turns out that mesa's dri driver is hard-linked to every single {foo}_dri.so path. This is eventually deduplicated, but this is done after running strip. Which means that, in effect, it was running strip over the same file, once again, repeating issue 246147. In my case, this showed up as the strtab being truncated, but I could imagine this showing up differently for other people (e.g. a part of a section missing, but always section-aligned, because that's how it's written by binutils).

This issue is exacerbated by the fact that strip errors aren't printed as long as at least one file has been successfully stripped, hiding the myriad of {foo}_dri.so[.eh_frame]: invalid operation and similar errors.

I believe that moving the symlink deduping logic from postFixup to preFixup is likely to solve this issue; but as I don't have a real aarch64 device to test with, I leave the implementation of this suggestion to others :)

[As posted on the `nixos-apple-silicon` repo](https://github.com/tpwrules/nixos-apple-silicon/issues/199#issuecomment-2111036503), I'm quite confident this is not a Nix/Lix issue, but rather a Nixpkgs issue: so, my current working theory is that this is not patchelf, but a repeat of a previous issue; tho i'm not entirely sure why it doesn't affect x86_64, as it should have the same bug. Since NixOS/nixpkgs#207101, `strip` is parallelised. This already turned out to be a problem, as `strip` running multiple processes simultaneously on the same file has caused issues on aarch64 before; see NixOS/nixpkgs#246147. As I was trying to debug this, I built the same store path on a Hetzner aarch64 VPS, as well as locally using qemu-user. What I noticed was that the `strtab` of the natively-built mesa's `dri` was a size `8`, rather than a size `0x165199` when built under qemu-user. The rest of the ELF was identical. After a bit more digging, it turns out that mesa's `dri` driver is hard-linked to every single `{foo}_dri.so` path. This is eventually deduplicated, but this is done _after_ running strip. Which means that, in effect, it was running strip over the same file, once again, repeating issue 246147. In my case, this showed up as the `strtab` being truncated, but I could imagine this showing up differently for other people (e.g. a part of a section missing, but _always_ section-aligned, because that's how it's written by binutils). This issue is exacerbated by the fact that `strip` errors aren't printed as long as at least one file has been successfully stripped, hiding the myriad of `{foo}_dri.so[.eh_frame]: invalid operation` and similar errors. I believe that moving the symlink deduping logic from `postFixup` to `preFixup` is likely to solve this issue; but as I don't have a real aarch64 device to test with, I leave the implementation of this suggestion to others :)
jade added the
Status
invalid
label 2024-05-14 20:17:58 +00:00
jade closed this issue 2024-05-14 20:18:00 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#248
No description provided.