RFD: What to do about NUL bytes in the short term? #963

New issue

emilazy · 2025-08-17T15:33:05Z

emilazy commented

2025-08-17 15:33:05 +00:00

This follows on from cl/3921 and cl/3968.

So, first, things I think everyone agrees on:

The behaviour of Nix‐language strings is currently bad for both text and binary data processing. String functions corrupt UTF‐8 data, and most binary data can’t be represented at all.
The current behaviour when NUL bytes leak through is just a bug and essentially “UB”; there is little point to preserving it, and we should fix it one way or another ASAP. See https://gerrit.lix.systems/c/lix/+/3921?tab=comments for some horrible examples.
“Arbitrary binary data except NUL bytes” C strings aren’t a type that makes sense, and they’re a bad runtime storage format too.
Nix’s use of a string type without a defined encoding runs headlong into the “Makefile problem”, where there is no general solution to representing binary strings that are usually but not always text in a known encoding inside textual settings like user interface input/output and file formats.

For instance, formats like JSON, TOML, and XML are explicitly Unicode text. In the case of XML, NUL bytes are forbidden even when escaped. There is no way to represent arbitrary binary data in these formats without an additional encoding layer like Base64, and it is easy to produce crashes or corrupt output currently.

For instance, builtins.toJSON (builtins.substring 0 1 "🫠") crashes the interpreter with an nlohmann_json error currently, and nix eval --expr (printf 'builtins.toXML "\xc3\x28"') thinks that the contents of the string is ?(. At some point we are going to have to decide between being in wilful violation of standards forever, corrupting data, or forbidding some calls that currently “succeed”.

There are two potential immediate next steps:

Forbid NUL bytes comprehensively in Nix‐language strings. The current version of my CL does not do an ideal job of this, but I have a WIP that forbids direct construction from C strings and requires either string literals or explicit C++ string views, which would allow robust enforcement of this at the boundary. This is the direction I favour, and I can finish off that WIP if it’s agreed on.
Allow NUL bytes across the board and fix their behaviour; declare that Nix‐language strings can work with arbitrary binary data. This is the direction @pennae favours.

Neither of these directions produces a good long‐term end state. For instance, downsides of my favoured approach:

Binary data processing is legitimately useful and incredibly annoying to do currently in Nix. My change does nothing to address this.
There are already no guarantees around string encoding in Nix and there are expressions in the wild that construct strings that are invalid UTF‐8.

Downsides of @pennae’s approach:

Path names on Unix are C strings, and cannot contain NUL bytes. If we forbid NUL bytes in strings, then there are no issues here. Otherwise we have to think about rejecting and converting these at various boundaries. We’ll have to think about this kind of boundary issue regardless in future, because of the encoding issues and serialization formats (and of course, a proper UTF‐8 clean string type would allow representing the U+0000 codepoint too), but in the immediate term, NUL bytes have their own unique interoperability issues that the other 255 bytes don’t, due to C’s legacy.
In my opinion, comprehensively forbidding NUL bytes in the Nix‐language string type makes potential language evolution in this space much easier. If open("/dev/urandom").read(1024) already failed in Python 2, then use of the string type to process binary data or non‐NUL‐safe text encodings would have been much rarer, and I think the Python 3 string transition would have been a lot easier. The “arbitrary binary bytes except for NUL” C string legacy isn’t a good or useful type, but it’s a type that strongly discourages actually doing binary data processing. Removing the exception and stabilizing NUL byte handling overnight opens the floodgates to arbitrary binary processing in Nix and makes future evolution of the type more of a compatibility hazard.
It’s not clear how to get workable semantics for text processing when evolving the language in future if we stabilize that the string type is arbitrary binary data. builtins.substring would need to handle binary data forever, so I guess we’d need to add and accept that interoperating with code that hasn’t been updated to use those built‐ins will corrupt UTF‐8. This seems suboptimal to me, because basically every string that actually gets used in practice with Nix today is UTF‐8.

For the purpose of determining the available scope for future language evolution, I think it would be good to do an evaluation of the Nixpkgs release job set with a mode that enforces that Nix‐language strings are UTF‐8 (e.g., turns builtins.substring 0 1 "🫠" into an error). I would be interested in trying this out if people think it would be compelling data.

If that works, then I believe it’s possible that judicious use of deprecated-features and language versioning can get us to a state where we have a UTF‐8 string type with correct Unicode behaviour from string functions, and a new, separate binary type for processing of arbitrary data. This may not be an easy transition, and adding the binary type would take a lot of language design work, but I think that would be a better final state than Nix‐language strings playing double duty, and if Nixpkgs would Just Work then I think we could close off non‐Unicode strings by way of deprecated-features without too much pain.

On the other hand, if it doesn’t work, then that constrains our future evolution, and it’s more likely the case that we’ll have to accept that Nix‐language strings are binary data with ambiguous encoding forever.

However, I don’t think we should block resolving this immediate decision on a definite long‐term design for how Nix‐language strings should work, and I think it would be bad if bikeshedding on that blocked resolving some of the super‐broken behaviour that Nix’s change has already ruled out, one way or the other. In the end, the biggest reason that forbidding NUL bytes across the board is my preferred solution is that it only forbids things that act in “UB” ways currently, and doesn’t stabilize any behaviour that didn’t already work. That means that even if the ideal path forward isn’t totally obvious, it’s perfectly safe to go with my approach and then later decide to allow arbitrary binary data including NUL bytes across the board, but the other way around has significant compatibility hazards.

I’ve tried to do my best to summarize both sides here but of course I’m unavoidably biased and welcome other comments.

cc @pennae @jade @piegames @alois31 @puck

This follows on from cl/3921 and cl/3968. So, first, things I think everyone agrees on: * The behaviour of Nix‐language strings is currently bad for both text and binary data processing. String functions corrupt UTF‐8 data, and most binary data can’t be represented at all. * The current behaviour when NUL bytes leak through is just a bug and essentially “UB”; there is little point to preserving it, and we should fix it one way or another ASAP. See https://gerrit.lix.systems/c/lix/+/3921?tab=comments for some horrible examples. * “Arbitrary binary data except NUL bytes” C strings aren’t a type that makes sense, and they’re a bad runtime storage format too. * Nix’s use of a string type without a defined encoding runs headlong into the [“Makefile problem”](https://wiki.mercurial-scm.org/EncodingStrategy#The_.22makefile_problem.22), where there is no general solution to representing binary strings that are *usually* but not always text in a known encoding inside textual settings like user interface input/output and file formats. For instance, formats like JSON, TOML, and XML are explicitly Unicode text. In the case of XML, NUL bytes are forbidden even when escaped. There is no way to represent arbitrary binary data in these formats without an additional encoding layer like Base64, and it is easy to produce crashes or corrupt output currently. For instance, `builtins.toJSON (builtins.substring 0 1 "🫠")` crashes the interpreter with an `nlohmann_json` error currently, and `nix eval --expr (printf 'builtins.toXML "\xc3\x28"')` thinks that the contents of the string is `?(`. At some point we are going to have to decide between being in wilful violation of standards forever, corrupting data, or forbidding some calls that currently “succeed”. There are two potential immediate next steps: * Forbid NUL bytes comprehensively in Nix‐language strings. The current version of my CL does not do an ideal job of this, but I have a WIP that forbids direct construction from C strings and requires either string literals or explicit C++ string views, which would allow robust enforcement of this at the boundary. This is the direction I favour, and I can finish off that WIP if it’s agreed on. * Allow NUL bytes across the board and fix their behaviour; declare that Nix‐language strings can work with arbitrary binary data. This is the direction @pennae favours. Neither of these directions produces a good long‐term end state. For instance, downsides of my favoured approach: * Binary data processing is legitimately useful and incredibly annoying to do currently in Nix. My change does nothing to address this. * There are already no guarantees around string encoding in Nix and there are expressions in the wild that construct strings that are invalid UTF‐8. Downsides of @pennae’s approach: * Path names on Unix are C strings, and cannot contain NUL bytes. If we forbid NUL bytes in strings, then there are no issues here. Otherwise we have to think about rejecting and converting these at various boundaries. We’ll have to think about this kind of boundary issue regardless in future, because of the encoding issues and serialization formats (and of course, a proper UTF‐8 clean string type would allow representing the U+0000 codepoint too), but in the immediate term, NUL bytes have their own unique interoperability issues that the other 255 bytes don’t, due to C’s legacy. * In my opinion, comprehensively forbidding NUL bytes in the Nix‐language string type makes potential language evolution in this space much easier. If `open("/dev/urandom").read(1024)` already failed in Python 2, then use of the string type to process binary data or non‐NUL‐safe text encodings would have been much rarer, and I think the Python 3 string transition would have been a lot easier. The “arbitrary binary bytes except for NUL” C string legacy isn’t a good or useful type, but it’s a type that strongly discourages actually doing binary data processing. Removing the exception and stabilizing NUL byte handling overnight opens the floodgates to arbitrary binary processing in Nix and makes future evolution of the type more of a compatibility hazard. * It’s not clear how to get workable semantics for text processing when evolving the language in future if we stabilize that the string type is arbitrary binary data. `builtins.substring` would need to handle binary data forever, so I guess we’d need to add and accept that interoperating with code that hasn’t been updated to use those built‐ins will corrupt UTF‐8. This seems suboptimal to me, because basically every string that actually gets used in practice with Nix today is UTF‐8. For the purpose of determining the available scope for future language evolution, I think it would be good to do an evaluation of the Nixpkgs release job set with a mode that enforces that Nix‐language strings are UTF‐8 (e.g., turns `builtins.substring 0 1 "🫠"` into an error). I would be interested in trying this out if people think it would be compelling data. If that works, then I believe it’s possible that judicious use of `deprecated-features` and language versioning can get us to a state where we have a UTF‐8 string type with correct Unicode behaviour from string functions, and a new, separate binary type for processing of arbitrary data. This may not be an easy transition, and adding the binary type would take a lot of language design work, but I think that would be a better final state than Nix‐language strings playing double duty, and if Nixpkgs would Just Work then I think we could close off non‐Unicode strings by way of `deprecated-features` without too much pain. On the other hand, if it doesn’t work, then that constrains our future evolution, and it’s more likely the case that we’ll have to accept that Nix‐language strings are binary data with ambiguous encoding forever. However, I don’t think we should block resolving this immediate decision on a definite long‐term design for how Nix‐language strings should work, and I think it would be bad if bikeshedding on that blocked resolving some of the super‐broken behaviour that Nix’s change has already ruled out, one way or the other. In the end, the biggest reason that forbidding NUL bytes across the board is my preferred solution is that it only forbids things that act in “UB” ways currently, and doesn’t stabilize any behaviour that didn’t already work. That means that even if the ideal path forward isn’t totally obvious, it’s perfectly safe to go with my approach and then later decide to allow arbitrary binary data including NUL bytes across the board, but the other way around has significant compatibility hazards. I’ve tried to do my best to summarize both sides here but of course I’m unavoidably biased and welcome other comments. cc @pennae @jade @piegames @alois31 @puck

emilazy added the

labels

2025-08-17 15:33:05 +00:00

lix-bot commented

2025-08-17 15:34:42 +00:00

This issue was mentioned on Gerrit on the following CLs:

comment in cl/3968 ("libexpr: use pascal strings for eval")

This issue was mentioned on Gerrit on the following CLs: * comment in [cl/3968](https://gerrit.lix.systems/c/lix/+/3968) ("libexpr: use pascal strings for eval")

emilazy commented

2025-08-17 15:40:18 +00:00

(Also, to be clear, I definitely favour switching to an explicit‐length string representation even if we ban NUL bytes.)

pennae commented

2025-08-17 16:11:42 +00:00

a few points:

Path names on Unix are C strings, and cannot contain NUL bytes. If we forbid NUL bytes in strings, then there are no issues here.

totally true! however, forbidding nul bytes in paths is just as easy as forbidding them in strings and makes much more sense because paths actually do have no-nul semantics baked into them. we already have awful validation at the interfaces and have to fix that anyway, because for example this can happen:

nix-repl> :b derivation ({ system = builtins.currentSystem; name = "foo"; builder = "/bin/sh"; args = ["-c" "echo $a"]; } // builtins.fromJSON ''{"a=2\u0000b":1}'') 
foo> 2

Removing the exception and stabilizing NUL byte handling overnight opens the floodgates to arbitrary binary processing in Nix and makes future evolution of the type more of a compatibility hazard.
[…]
It’s not clear how to get workable semantics for text processing when evolving the language in future if we stabilize that the string type is arbitrary binary data.

forbidding nul bytes now but adding a true binary type later means we have an eval type in the new version with by definition no corresponding type in the legacy version, unless we change the semantics again to allow nul bytes after all (in which case, why delay it to begin with). interop from old to new must treat old strings as new blobs without conversion either explicit through the user or implicit at builtins like a version-aware substring. forbidding nul specifically does absolutely nothing here because we already have all the evolution hazards you want to exclude

allowing nuls is a semantic change, but it doesn't break anything now or going forward that isn't already completely fucked and needs fixed anyway. if anything it gives us a head start that process rather than delaying it until much later.

a few points: > Path names on Unix are C strings, and cannot contain NUL bytes. If we forbid NUL bytes in strings, then there are no issues here. totally true! however, forbidding nul bytes in paths is just as easy as forbidding them in strings and makes much more sense because paths actually do have no-nul semantics baked into them. we already have awful validation at the interfaces and have to fix that anyway, because for example this can happen: ``` nix-repl> :b derivation ({ system = builtins.currentSystem; name = "foo"; builder = "/bin/sh"; args = ["-c" "echo $a"]; } // builtins.fromJSON ''{"a=2\u0000b":1}'') foo> 2 ``` > Removing the exception and stabilizing NUL byte handling overnight opens the floodgates to arbitrary binary processing in Nix and makes future evolution of the type more of a compatibility hazard. > […] > It’s not clear how to get workable semantics for text processing when evolving the language in future if we stabilize that the string type is arbitrary binary data. forbidding nul bytes now but adding a true binary type later means we have an eval type in the new version with *by definition* no corresponding type in the legacy version, unless we change the semantics again to allow nul bytes after all (in which case, why delay it to begin with). interop from old to new *must* treat old strings as new blobs without conversion either explicit through the user or implicit at builtins like a version-aware substring. forbidding nul specifically does absolutely nothing here because we *already* have all the evolution hazards you want to exclude allowing nuls is a semantic change, but it doesn't break anything now *or* going forward that isn't already completely fucked and needs fixed anyway. if anything it gives us a head start that process rather than delaying it until much later.

raito commented

2025-08-18 09:19:05 +00:00

From my PoV (I just came back from holidays), I'm in favor of an approach that preserve the ability to write binary data which is done by some folks in the ecosystem (even if it may start as satire and ends up being a load-bearing component for kicking bootstrapping or similar). So applying extra hidden constraints seems a hidden contract breakage to me; culturally, for me, Lix is about absolutely avoiding these situations even if it cost more to the developer team. Obviously, this has to be a measured risk.

In my opinion, comprehensively forbidding NUL bytes in the Nix‐language string type makes potential language evolution in this space much easier. If open("/dev/urandom").read(1024) already failed in Python 2, then use of the string type to process binary data or non‐NUL‐safe text encodings would have been much rarer, and I think the Python 3 string transition would have been a lot easier. The “arbitrary binary bytes except for NUL” C string legacy isn’t a good or useful type, but it’s a type that strongly discourages actually doing binary data processing. Removing the exception and stabilizing NUL byte handling overnight opens the floodgates to arbitrary binary processing in Nix and makes future evolution of the type more of a compatibility hazard.

To me, we are already in-between Python 2 and Python 3, so I don't register this as a valid counterargument, things like Tarnix already exist and this does not even rely on the NUL-byte handling at all.

Playing a Whac-A-Mole with primitives that can end up writing binary data seems a fruitless fight for the time being, conversely, allowing something natural like \0 doesn't seem it will add up MANY risks to the current ecosystem of writing binary data using Nix.

It’s not clear how to get workable semantics for text processing when evolving the language in future if we stabilize that the string type is arbitrary binary data. builtins.substring would need to handle binary data forever, so I guess we’d need to add and accept that interoperating with code that hasn’t been updated to use those built‐ins will corrupt UTF‐8. This seems suboptimal to me, because basically every string that actually gets used in practice with Nix today is UTF‐8.

That's what language versioning and nix2 intends to solve. I don't believe it's reasonable to fix everything in the current iteration of the Nix language. Some things will remain deliberately under-specified or knowingly broken because there's only so much we can afford to take care of with the policy of 'quasi'-semantic stability we pursue in Lix.

For the purpose of determining the available scope for future language evolution, I think it would be good to do an evaluation of the Nixpkgs release job set with a mode that enforces that Nix‐language strings are UTF‐8 (e.g., turns builtins.substring 0 1 "🫠" into an error). I would be interested in trying this out if people think it would be compelling data.

In general, I wish we had more time to flesh out https://github.com/piegamesde/flaker and conduct large scale analysis across the ecosystem with various features. This work is still very much in my mind and I would like to construct the infrastructure to run flaker (aka the Nix crater) runs.

If that works, then I believe it’s possible that judicious use of deprecated-features and language versioning can get us to a state where we have a UTF‐8 string type with correct Unicode behaviour from string functions, and a new, separate binary type for processing of arbitrary data. This may not be an easy transition, and adding the binary type would take a lot of language design work, but I think that would be a better final state than Nix‐language strings playing double duty, and if Nixpkgs would Just Work then I think we could close off non‐Unicode strings by way of deprecated-features without too much pain.

Language versioning being a dependency of something means that this will take a non-trivial amount of time as we do not have enough folks to lead this compared to e.g. RPC. (and the way I see it is that proper RPC synergizes to enable easier langver.)

However, I don’t think we should block resolving this immediate decision on a definite long‐term design for how Nix‐language strings should work, and I think it would be bad if bikeshedding on that blocked resolving some of the super‐broken behaviour that Nix’s change has already ruled out, one way or the other. In the end, the biggest reason that forbidding NUL bytes across the board is my preferred solution is that it only forbids things that act in “UB” ways currently, and doesn’t stabilize any behaviour that didn’t already work. That means that even if the ideal path forward isn’t totally obvious, it’s perfectly safe to go with my approach and then later decide to allow arbitrary binary data including NUL bytes across the board, but the other way around has significant compatibility hazards.

I get the idea behind the recommendation. Again, from my PoV:

having arbitrary data in the string feels like a marginal heightened risk for end user reproducibility (either you do something broken, either you do something right and Lix doesn't handle it well and this is a bug)
forbidding NUL bytes now means that we are going to prevent this for a long time because I do not foresee langver happening this year, neither Q1, Q2, Q3 2026.
we will always have significant compatibility hazards with Nixlang "unnumbered" version, that's the whole issue with the lack of langver. this issue is only one of the bazillions of known compatibility hazards (laziness semantics changing between C++ Nix versions), therefore, I would rather go into the other direction of safety here and let binary data be unleashed to inform how we should design this in the next iteration of the language (which is how I read pennae's: "head start" remark).

Either way, it feels like this case requires more chime-in from other people, and we should put probably a deadline for core team to tie-break it if we cannot converge to consensus on a reasonable timeline.

From my PoV (I just came back from holidays), I'm in favor of an approach that preserve the ability to write binary data which is done by some folks in the ecosystem (even if it may start as satire and ends up being a load-bearing component for kicking bootstrapping or similar). So applying extra hidden constraints seems a hidden contract breakage to me; culturally, for me, Lix is about absolutely avoiding these situations even if it cost more to the developer team. Obviously, this has to be a measured risk. > In my opinion, comprehensively forbidding NUL bytes in the Nix‐language string type makes potential language evolution in this space much easier. If open("/dev/urandom").read(1024) already failed in Python 2, then use of the string type to process binary data or non‐NUL‐safe text encodings would have been much rarer, and I think the Python 3 string transition would have been a lot easier. The “arbitrary binary bytes except for NUL” C string legacy isn’t a good or useful type, but it’s a type that strongly discourages actually doing binary data processing. Removing the exception and stabilizing NUL byte handling overnight opens the floodgates to arbitrary binary processing in Nix and makes future evolution of the type more of a compatibility hazard. To me, we are already in-between Python 2 and Python 3, so I don't register this as a valid counterargument, things like Tarnix already exist and this does not even rely on the NUL-byte handling *at all*. Playing a Whac-A-Mole with primitives that can end up writing binary data seems a fruitless fight for the time being, conversely, allowing something natural like `\0` doesn't seem it will add up *MANY* risks to the current ecosystem of writing binary data using Nix. > It’s not clear how to get workable semantics for text processing when evolving the language in future if we stabilize that the string type is arbitrary binary data. builtins.substring would need to handle binary data forever, so I guess we’d need to add and accept that interoperating with code that hasn’t been updated to use those built‐ins will corrupt UTF‐8. This seems suboptimal to me, because basically every string that actually gets used in practice with Nix today is UTF‐8. That's what language versioning and nix2 intends to solve. I don't believe it's reasonable to fix everything in the current iteration of the Nix language. Some things will remain deliberately under-specified or knowingly broken because there's only so much we can afford to take care of with the policy of 'quasi'-semantic stability we pursue in Lix. --- > For the purpose of determining the available scope for future language evolution, I think it would be good to do an evaluation of the Nixpkgs release job set with a mode that enforces that Nix‐language strings are UTF‐8 (e.g., turns builtins.substring 0 1 "🫠" into an error). I would be interested in trying this out if people think it would be compelling data. In general, I wish we had more time to flesh out https://github.com/piegamesde/flaker and conduct large scale analysis across the ecosystem with various features. This work is still very much in my mind and I would like to construct the infrastructure to run flaker (aka the Nix crater) runs. > If that works, then I believe it’s possible that judicious use of deprecated-features and language versioning can get us to a state where we have a UTF‐8 string type with correct Unicode behaviour from string functions, and a new, separate binary type for processing of arbitrary data. This may not be an easy transition, and adding the binary type would take a lot of language design work, but I think that would be a better final state than Nix‐language strings playing double duty, and if Nixpkgs would Just Work then I think we could close off non‐Unicode strings by way of deprecated-features without too much pain. Language versioning being a dependency of something means that this will take a non-trivial amount of time as we do not have enough folks to lead this compared to e.g. RPC. (and the way I see it is that _proper_ RPC synergizes to enable easier langver.) > However, I don’t think we should block resolving this immediate decision on a definite long‐term design for how Nix‐language strings should work, and I think it would be bad if bikeshedding on that blocked resolving some of the super‐broken behaviour that Nix’s change has already ruled out, one way or the other. In the end, the biggest reason that forbidding NUL bytes across the board is my preferred solution is that it only forbids things that act in “UB” ways currently, and doesn’t stabilize any behaviour that didn’t already work. That means that even if the ideal path forward isn’t totally obvious, it’s perfectly safe to go with my approach and then later decide to allow arbitrary binary data including NUL bytes across the board, but the other way around has significant compatibility hazards. I get the idea behind the recommendation. Again, from my PoV: - having arbitrary data in the string feels like a marginal heightened risk for end user reproducibility (either you do something broken, either you do something right and Lix doesn't handle it well and this is a bug) - forbidding NUL bytes now means that we are going to prevent this for a long time because I do not foresee langver happening this year, neither Q1, Q2, Q3 2026. - we will always have significant compatibility hazards with Nixlang "unnumbered" version, that's the whole issue with the lack of langver. this issue is only one of the bazillions of known compatibility hazards (laziness semantics changing between C++ Nix versions), therefore, I would rather go into the other direction of safety here and let binary data be unleashed to inform how we should design this in the next iteration of the language (which is how I read pennae's: "head start" remark). Either way, it feels like this case requires more chime-in from other people, and we should put probably a deadline for core team to tie-break it if we cannot converge to consensus on a reasonable timeline.

alois31 commented

2025-08-18 17:48:21 +00:00

Mainly due to the data processing usecase others have already argued earlier better, I am also moderately in favour of allowing nulls at least in the long term. Other than the possibility of more exposure to bugs (more below), the main concern I can see are interoperability hazards: it is not really possible for user code to feature-detect proper null handling, so either it can't rely on it or will break on older Lix or on the other side of the fork (which does seem to move in the direction of more thoroughly forbidding nulls).

Now on the bugs, the type system of the language being not so rich with only one string type (which I do not see changing any time soon), there are conflicting requirements on it:

Processing of textual data mostly uses ASCII printable characters, but decent Unicode support should still be available. More advanced control characters (like null) are probably not so important though.
Processing of binary data by its nature needs to support arbitrary bytes and combinations thereof, which precludes forbidding nulls or invalid UTF-8.
Serializing or converting data at interfaces:
- Paths are platform-dependent (arbitrary bytes except null on Linux, IIRC UTF-8 of some specific normal form and no nulls on Darwin). Mostly the same applies to arguments passed to external programs.
- JSON accepts Unicode.
- XML accepts Unicode except 0000, fffe and ffff (https://www.w3.org/TR/xml11/#charsets).
- Derivation attributes depend on whether __structuredAttrs is enabled. If it's off they are basically environment variables (theoretically platform-dependent, but probably arbitrary bytes except null everywhere), if it's on they are JSON (so valid Unicode only) but often passed around as environment variables too.
- Some special-purpose strings, like URLs or the fetcher arguments, have more extensive restrictions I will not try to exhaustively describe here because the comment box is too small for it.

Due to these expectations I think the best way forward is as follows:

Strings will be able to contain arbitrary bytes, including nulls. (This applies to the type, I do not have a strong opinion on whether the parser should start allowing nulls, but clearly it should not truncate except for the compatibility deprecated feature that's currently in place.)
Widely used string processing functions that are also useful on binary data and pose compatibility hazards if they started proceeding codepoint wise (stringLength and substring) should continue working on bytes. This is mostly orthogonal to the null discussion and included for completeness.
Clearly text-focused string processing functions working with regexes (match and split) should gain proper Unicode support, and ideally throw on non-UTF-8 input. Optionally binary can be supported in addition (maybe with (?-u) flag like the Rust crate). This is mostly orthogonal to the null discussion and only included for completeness.
String serialized or converted at interfaces should be validated if the target has more restrictions than the Nix language string type, and an error should be thrown if the string is not representable. If we end up supporting nulls this includes rejecting nulls if not representable instead of causing silent truncation.

If we do end up forbidding nulls on the other hand, I think they should be forbidden near where the actual string creation happens, instead of playing whack-a-mole.

Mainly due to the data processing usecase others have already argued earlier better, I am also moderately in favour of allowing nulls at least in the long term. Other than the possibility of more exposure to bugs (more below), the main concern I can see are interoperability hazards: it is not really possible for user code to feature-detect proper null handling, so either it can't rely on it or will break on older Lix or on the other side of the fork (which does seem to move in the direction of more thoroughly forbidding nulls). Now on the bugs, the type system of the language being not so rich with only one string type (which I do not see changing any time soon), there are conflicting requirements on it: * Processing of textual data mostly uses ASCII printable characters, but decent Unicode support should still be available. More advanced control characters (like null) are probably not so important though. * Processing of binary data by its nature needs to support arbitrary bytes and combinations thereof, which precludes forbidding nulls or invalid UTF-8. * Serializing or converting data at interfaces: * Paths are platform-dependent (arbitrary bytes except null on Linux, IIRC UTF-8 of some specific normal form and no nulls on Darwin). Mostly the same applies to arguments passed to external programs. * JSON accepts Unicode. * XML accepts Unicode except 0000, fffe and ffff (https://www.w3.org/TR/xml11/#charsets). * Derivation attributes depend on whether `__structuredAttrs` is enabled. If it's off they are basically environment variables (theoretically platform-dependent, but probably arbitrary bytes except null everywhere), if it's on they are JSON (so valid Unicode only) but often passed around as environment variables too. * Some special-purpose strings, like URLs or the fetcher arguments, have more extensive restrictions I will not try to exhaustively describe here because the comment box is too small for it. Due to these expectations I think the best way forward is as follows: * Strings will be able to contain arbitrary bytes, including nulls. (This applies to the type, I do not have a strong opinion on whether the parser should start allowing nulls, but clearly it should not truncate except for the compatibility deprecated feature that's currently in place.) * Widely used string processing functions that are also useful on binary data and pose compatibility hazards if they started proceeding codepoint wise (`stringLength` and `substring`) should continue working on bytes. This is mostly orthogonal to the null discussion and included for completeness. * Clearly text-focused string processing functions working with regexes (`match` and `split`) should gain proper Unicode support, and ideally throw on non-UTF-8 input. Optionally binary can be supported in addition (maybe with `(?-u)` flag like the Rust crate). This is mostly orthogonal to the null discussion and only included for completeness. * String serialized or converted at interfaces should be validated if the target has more restrictions than the Nix language string type, and an error should be thrown if the string is not representable. If we end up supporting nulls this includes rejecting nulls if not representable instead of causing silent truncation. If we do end up forbidding nulls on the other hand, I think they should be forbidden near where the actual string creation happens, instead of playing whack-a-mole.

👍 1

piegames commented

2025-08-24 16:00:01 +00:00

In general, I wish we had more time to flesh out https://github.com/piegamesde/flaker and conduct large scale analysis across the ecosystem with various features.

Note that Flaker currently is a parser-only framework, while it would be nice to be able to do eval diffing that capability is currently nonexistent

Thinking long-term (i.e. with langver), there are only two realistic options for handling strings

Strings and bytes are distinct types

For legacy interop, this means treating all current strings as "bytes"
This leaves the question open of what to do with other potentially useful encodings, like ASCII strings
A slight variation would be to have the string type annotated with an "encoding" subtype

Strings don't have any encoding and the encoding is up to the processing function (similar to to_ascii_lowercase functions in Rust)

This may require additional sanity checks at API boundaries

Both will require significant refactoring of builtins. Without having thought things through in more detail, my gut feeling tends towards the second option.

As for the current decision, I see a consensus forming around allowing NUL bytes. In terms of interop worries with origNix, now that they've implemented checks to forbid NUL bytes we are safe to allow them; at no point in time are we at risk of one Nix implementation evaluating to one value while the other gives a different value anymore (modulo divergence, which is explicitly fine in my eyes). (We still do have that risk w.r.t. older origNix and Lix versions, but I'm not too worried about that.)

> In general, I wish we had more time to flesh out https://github.com/piegamesde/flaker and conduct large scale analysis across the ecosystem with various features. Note that Flaker currently is a parser-only framework, while it would be nice to be able to do eval diffing that capability is currently nonexistent -------- Thinking long-term (i.e. with langver), there are only two realistic options for handling strings 1. Strings and bytes are distinct types - For legacy interop, this means treating all current strings as "bytes" - This leaves the question open of what to do with other potentially useful encodings, like ASCII strings - A slight variation would be to have the string type annotated with an "encoding" subtype 2. Strings don't have any encoding and the encoding is up to the processing function (similar to `to_ascii_lowercase` functions in Rust) - This may require additional sanity checks at API boundaries Both will require significant refactoring of builtins. Without having thought things through in more detail, my gut feeling tends towards the second option. -------- As for the current decision, I see a consensus forming around allowing NUL bytes. In terms of interop worries with origNix, now that they've implemented checks to forbid NUL bytes we are safe to allow them; at no point in time are we at risk of one Nix implementation evaluating to one value while the other gives a different value anymore (modulo divergence, which is explicitly fine in my eyes). (We still *do* have that risk w.r.t. older origNix and Lix versions, but I'm not too worried about that.)

piegames commented

2025-10-05 11:14:47 +00:00

To prevent this from going stale and making progress, I call a core team vote on the following actions:

Treat all string values in the evaluator as arbitrary and un-encoded byte sequences, and work towards adapting all APIs to match that notion
Proceed with merging https://gerrit.lix.systems/c/lix/+/3968, which will fix all current correctness issues with strings containing nul-bytes in the evaluator, and thus removing the immediate need for forbidding them
Close https://gerrit.lix.systems/c/lix/+/3921

To prevent this from going stale and making progress, I call a core team vote on the following actions: - Treat all *string values* in the evaluator as arbitrary and un-encoded byte sequences, and work towards adapting all APIs to match that notion - Proceed with merging https://gerrit.lix.systems/c/lix/+/3968, which will fix all current correctness issues with strings containing nul-bytes in the evaluator, and thus removing the immediate need for forbidding them - Close https://gerrit.lix.systems/c/lix/+/3921

👍 4

pennae commented

2025-10-10 17:15:27 +00:00

now that there seems to be broad approval from the core team that allowing nul bytes is the way we want to take there's another issue that popped up due to the recent pointer-tagging related rework of string handling: if strings can contain nul bytes then paths can too, and what should we do about that?

disallow path values containing nuls being created at all, or
rely on our syscall wrappers to throw errors when such paths are used?

we'd opt for allowing nuls in paths as well and throwing at the os interface only. c++ std::filesystem::path allows nuls in path values but truncates at the os interface (like we previously did, only in a different place). this is pretty dangerous behavior and we're not sure whether it's a bug in libstdc++ or something else, but since paths are just strings with slightly modified interpolation semantics it kind of makes sense. allowing nuls also doesn't break eval for expressions that would add nuls to paths but never would have passed them to an OS boundary.

we definitely need to treat nuls making their way into path arguments of syscalls as a reproducibility failure, but not all values containing nuls immediately are such failures. we already treat paths as "unevaluated" to an extent, with "add to store and return store path" forcing semantics. feels like it makes sense to allow creation if thusly invalid paths in the same way it's legal to create throwing thunks in lists: as long as they're never forced everything is fine. (this does make debugging harder, but debugging is already a real problem and for many things we really need a kind of provenance analysis rather than the mostly useless stack traces we have today)

now that there seems to be broad approval from the core team that allowing nul bytes is the way we want to take there's another issue that popped up due to the recent pointer-tagging related rework of string handling: if strings can contain nul bytes then paths can too, and what should we do about that? 1. disallow path values containing nuls being created at all, or 2. rely on our syscall wrappers to throw errors when such paths are used? we'd opt for allowing nuls in paths as well and throwing at the os interface only. c++ `std::filesystem::path` allows nuls in path values but *truncates* at the os interface (like we previously did, only in a different place). this is pretty dangerous behavior and we're not sure whether it's a bug in libstdc++ or something else, but since paths are just strings with slightly modified interpolation semantics it kind of makes sense. allowing nuls also doesn't break eval for expressions that would add nuls to paths but never would have passed them to an OS boundary. we definitely need to treat nuls making their way into path arguments of syscalls as a reproducibility failure, but not all values containing nuls immediately are such failures. we already treat paths as "unevaluated" to an extent, with "add to store and return store path" forcing semantics. feels like it makes sense to allow creation if thusly invalid paths in the same way it's legal to create throwing thunks in lists: as long as they're never forced everything is fine. (this does make debugging harder, but debugging is already a real problem and for many things we really need a kind of provenance analysis rather than the mostly useless stack traces we have today)

jade commented

2025-10-10 17:44:19 +00:00

I think that banning creating invalid paths with nulls in them in the first place is probably a safer and simpler option as it avoids us having to maintain invariants throughout the codebase that are harder to maintain than "these just aren't null safe".

It's IMO a bit scary to expose all our path handling code to null bytes where they will be always mishandled because the OS is written in C. And the ways to create paths are themselves usually behind thunks so it's not like we're checking path values that eagerly.

I think that banning creating invalid paths with nulls in them in the first place is probably a safer and simpler option as it avoids us having to maintain invariants throughout the codebase that are harder to maintain than "these just aren't null safe". It's IMO a bit scary to expose all our path handling code to null bytes where they will be *always* mishandled because the OS is written in C. And the ways to create paths are themselves usually behind thunks so it's not like we're checking path values *that* eagerly.

pennae commented

2025-10-10 17:55:10 +00:00

cl/4290 already secures the OS boundaries with a clang lint. forbidding nul-bearing paths does not replace this at all because many (probably all) locations in libexpr that accept paths also accept strings, and checking strings everywhere they are used as paths is a lot harder than checking the OS boundary.

And the ways to create paths are themselves usually behind thunks so it's not like we're checking path values that eagerly.

yeah, but substr 0 6 (/tmp/x${fromJSON ''"\u0000whatever"''}) creates a valid path value. doing this in the string domain and converting the result to a path would eval, doing in the path domain would not. that just feels weird, especially since we already support creation of invalid paths (especially on macos where unicode limits apply). a single arbitrary restriction that doesn't even solve the entire problem just feels like a non-solution?

cl/4290 already secures the OS boundaries with a clang lint. forbidding nul-bearing paths does not replace this at all because many (probably *all*) locations in libexpr that accept paths also accept strings, and checking strings everywhere they are used as paths is a lot harder than checking the OS boundary. > And the ways to create paths are themselves usually behind thunks so it's not like we're checking path values that eagerly. yeah, but `substr 0 6 (/tmp/x${fromJSON ''"\u0000whatever"''})` creates a valid path value. doing this in the string domain and converting the result to a path would eval, doing in the path domain would not. that just feels *weird*, especially since we *already* support creation of invalid paths (especially on macos where unicode limits apply). a single arbitrary restriction that doesn't even solve the entire problem just feels like a non-solution?

jade commented

2025-10-10 22:06:26 +00:00

oh, great, we fixed it with encapsulation and a lint. thank you for doing that! then I'm not bothered with your runtime error approach at all as it seems likely we just won't have many bugs.

👍 1

Sign in to join this conversation.