RFD: What to do about NUL bytes in the short term? #963
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
6 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: lix-project/lix#963
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This follows on from
cl/3921
andcl/3968
.So, first, things I think everyone agrees on:
The behaviour of Nix‐language strings is currently bad for both text and binary data processing. String functions corrupt UTF‐8 data, and most binary data can’t be represented at all.
The current behaviour when NUL bytes leak through is just a bug and essentially “UB”; there is little point to preserving it, and we should fix it one way or another ASAP. See https://gerrit.lix.systems/c/lix/+/3921?tab=comments for some horrible examples.
“Arbitrary binary data except NUL bytes” C strings aren’t a type that makes sense, and they’re a bad runtime storage format too.
Nix’s use of a string type without a defined encoding runs headlong into the “Makefile problem”, where there is no general solution to representing binary strings that are usually but not always text in a known encoding inside textual settings like user interface input/output and file formats.
For instance, formats like JSON, TOML, and XML are explicitly Unicode text. In the case of XML, NUL bytes are forbidden even when escaped. There is no way to represent arbitrary binary data in these formats without an additional encoding layer like Base64, and it is easy to produce crashes or corrupt output currently.
For instance,
builtins.toJSON (builtins.substring 0 1 "🫠")
crashes the interpreter with annlohmann_json
error currently, andnix eval --expr (printf 'builtins.toXML "\xc3\x28"')
thinks that the contents of the string is?(
. At some point we are going to have to decide between being in wilful violation of standards forever, corrupting data, or forbidding some calls that currently “succeed”.There are two potential immediate next steps:
Forbid NUL bytes comprehensively in Nix‐language strings. The current version of my CL does not do an ideal job of this, but I have a WIP that forbids direct construction from C strings and requires either string literals or explicit C++ string views, which would allow robust enforcement of this at the boundary. This is the direction I favour, and I can finish off that WIP if it’s agreed on.
Allow NUL bytes across the board and fix their behaviour; declare that Nix‐language strings can work with arbitrary binary data. This is the direction @pennae favours.
Neither of these directions produces a good long‐term end state. For instance, downsides of my favoured approach:
Binary data processing is legitimately useful and incredibly annoying to do currently in Nix. My change does nothing to address this.
There are already no guarantees around string encoding in Nix and there are expressions in the wild that construct strings that are invalid UTF‐8.
Downsides of @pennae’s approach:
Path names on Unix are C strings, and cannot contain NUL bytes. If we forbid NUL bytes in strings, then there are no issues here. Otherwise we have to think about rejecting and converting these at various boundaries. We’ll have to think about this kind of boundary issue regardless in future, because of the encoding issues and serialization formats (and of course, a proper UTF‐8 clean string type would allow representing the U+0000 codepoint too), but in the immediate term, NUL bytes have their own unique interoperability issues that the other 255 bytes don’t, due to C’s legacy.
In my opinion, comprehensively forbidding NUL bytes in the Nix‐language string type makes potential language evolution in this space much easier. If
open("/dev/urandom").read(1024)
already failed in Python 2, then use of the string type to process binary data or non‐NUL‐safe text encodings would have been much rarer, and I think the Python 3 string transition would have been a lot easier. The “arbitrary binary bytes except for NUL” C string legacy isn’t a good or useful type, but it’s a type that strongly discourages actually doing binary data processing. Removing the exception and stabilizing NUL byte handling overnight opens the floodgates to arbitrary binary processing in Nix and makes future evolution of the type more of a compatibility hazard.It’s not clear how to get workable semantics for text processing when evolving the language in future if we stabilize that the string type is arbitrary binary data.
builtins.substring
would need to handle binary data forever, so I guess we’d need to add and accept that interoperating with code that hasn’t been updated to use those built‐ins will corrupt UTF‐8. This seems suboptimal to me, because basically every string that actually gets used in practice with Nix today is UTF‐8.For the purpose of determining the available scope for future language evolution, I think it would be good to do an evaluation of the Nixpkgs release job set with a mode that enforces that Nix‐language strings are UTF‐8 (e.g., turns
builtins.substring 0 1 "🫠"
into an error). I would be interested in trying this out if people think it would be compelling data.If that works, then I believe it’s possible that judicious use of
deprecated-features
and language versioning can get us to a state where we have a UTF‐8 string type with correct Unicode behaviour from string functions, and a new, separate binary type for processing of arbitrary data. This may not be an easy transition, and adding the binary type would take a lot of language design work, but I think that would be a better final state than Nix‐language strings playing double duty, and if Nixpkgs would Just Work then I think we could close off non‐Unicode strings by way ofdeprecated-features
without too much pain.On the other hand, if it doesn’t work, then that constrains our future evolution, and it’s more likely the case that we’ll have to accept that Nix‐language strings are binary data with ambiguous encoding forever.
However, I don’t think we should block resolving this immediate decision on a definite long‐term design for how Nix‐language strings should work, and I think it would be bad if bikeshedding on that blocked resolving some of the super‐broken behaviour that Nix’s change has already ruled out, one way or the other. In the end, the biggest reason that forbidding NUL bytes across the board is my preferred solution is that it only forbids things that act in “UB” ways currently, and doesn’t stabilize any behaviour that didn’t already work. That means that even if the ideal path forward isn’t totally obvious, it’s perfectly safe to go with my approach and then later decide to allow arbitrary binary data including NUL bytes across the board, but the other way around has significant compatibility hazards.
I’ve tried to do my best to summarize both sides here but of course I’m unavoidably biased and welcome other comments.
cc @pennae @jade @piegames @alois31 @puck
This issue was mentioned on Gerrit on the following CLs:
(Also, to be clear, I definitely favour switching to an explicit‐length string representation even if we ban NUL bytes.)
a few points:
totally true! however, forbidding nul bytes in paths is just as easy as forbidding them in strings and makes much more sense because paths actually do have no-nul semantics baked into them. we already have awful validation at the interfaces and have to fix that anyway, because for example this can happen:
forbidding nul bytes now but adding a true binary type later means we have an eval type in the new version with by definition no corresponding type in the legacy version, unless we change the semantics again to allow nul bytes after all (in which case, why delay it to begin with). interop from old to new must treat old strings as new blobs without conversion either explicit through the user or implicit at builtins like a version-aware substring. forbidding nul specifically does absolutely nothing here because we already have all the evolution hazards you want to exclude
allowing nuls is a semantic change, but it doesn't break anything now or going forward that isn't already completely fucked and needs fixed anyway. if anything it gives us a head start that process rather than delaying it until much later.
From my PoV (I just came back from holidays), I'm in favor of an approach that preserve the ability to write binary data which is done by some folks in the ecosystem (even if it may start as satire and ends up being a load-bearing component for kicking bootstrapping or similar). So applying extra hidden constraints seems a hidden contract breakage to me; culturally, for me, Lix is about absolutely avoiding these situations even if it cost more to the developer team. Obviously, this has to be a measured risk.
To me, we are already in-between Python 2 and Python 3, so I don't register this as a valid counterargument, things like Tarnix already exist and this does not even rely on the NUL-byte handling at all.
Playing a Whac-A-Mole with primitives that can end up writing binary data seems a fruitless fight for the time being, conversely, allowing something natural like
\0
doesn't seem it will add up MANY risks to the current ecosystem of writing binary data using Nix.That's what language versioning and nix2 intends to solve. I don't believe it's reasonable to fix everything in the current iteration of the Nix language. Some things will remain deliberately under-specified or knowingly broken because there's only so much we can afford to take care of with the policy of 'quasi'-semantic stability we pursue in Lix.
In general, I wish we had more time to flesh out https://github.com/piegamesde/flaker and conduct large scale analysis across the ecosystem with various features. This work is still very much in my mind and I would like to construct the infrastructure to run flaker (aka the Nix crater) runs.
Language versioning being a dependency of something means that this will take a non-trivial amount of time as we do not have enough folks to lead this compared to e.g. RPC. (and the way I see it is that proper RPC synergizes to enable easier langver.)
I get the idea behind the recommendation. Again, from my PoV:
Either way, it feels like this case requires more chime-in from other people, and we should put probably a deadline for core team to tie-break it if we cannot converge to consensus on a reasonable timeline.
Mainly due to the data processing usecase others have already argued earlier better, I am also moderately in favour of allowing nulls at least in the long term. Other than the possibility of more exposure to bugs (more below), the main concern I can see are interoperability hazards: it is not really possible for user code to feature-detect proper null handling, so either it can't rely on it or will break on older Lix or on the other side of the fork (which does seem to move in the direction of more thoroughly forbidding nulls).
Now on the bugs, the type system of the language being not so rich with only one string type (which I do not see changing any time soon), there are conflicting requirements on it:
__structuredAttrs
is enabled. If it's off they are basically environment variables (theoretically platform-dependent, but probably arbitrary bytes except null everywhere), if it's on they are JSON (so valid Unicode only) but often passed around as environment variables too.Due to these expectations I think the best way forward is as follows:
stringLength
andsubstring
) should continue working on bytes. This is mostly orthogonal to the null discussion and included for completeness.match
andsplit
) should gain proper Unicode support, and ideally throw on non-UTF-8 input. Optionally binary can be supported in addition (maybe with(?-u)
flag like the Rust crate). This is mostly orthogonal to the null discussion and only included for completeness.If we do end up forbidding nulls on the other hand, I think they should be forbidden near where the actual string creation happens, instead of playing whack-a-mole.
Note that Flaker currently is a parser-only framework, while it would be nice to be able to do eval diffing that capability is currently nonexistent
Thinking long-term (i.e. with langver), there are only two realistic options for handling strings
to_ascii_lowercase
functions in Rust)Both will require significant refactoring of builtins. Without having thought things through in more detail, my gut feeling tends towards the second option.
As for the current decision, I see a consensus forming around allowing NUL bytes. In terms of interop worries with origNix, now that they've implemented checks to forbid NUL bytes we are safe to allow them; at no point in time are we at risk of one Nix implementation evaluating to one value while the other gives a different value anymore (modulo divergence, which is explicitly fine in my eyes). (We still do have that risk w.r.t. older origNix and Lix versions, but I'm not too worried about that.)