RFD: Structured interpolation via quasi-quoters and AST representation structures #835

Open
opened 2025-05-19 15:05:02 +00:00 by raito · 4 comments
Owner

Firstly, this idea is not mine, all credits are to @pennae and @delroth who came up with these ideas (independently, in addition, so it must be good :P). (oopsie miscommunications) This issue is a way to keep track of this milestone goal.

Problem

The legacy ${ … } interpolation conflates coercion and concatenation, which

  • makes impossible to perform precise float / numeric formatting (3.141592653589793…);
  • is actually sharing the coercion infrastructure with function argument passing
  • prevent users to extend their own behavior for interpolation

An example of things that users would like let p = 2516; in "${p}" to work out of the box without having to write toString p all the time, see https://gerrit.lix.systems/c/lix/+/3191 for an attempt to solve this.

Design

Introduce a format-AST with a quasi-quoter, better seen in Lisp or Lean 4:

# list report
format `(
  "items: " (length xs) ": " (concatStringsSep ", " xs)
)`

# float with printf directives
format `("π ≈ " `("%.3g" 3.1415926))`
  • Everything inside format ⟨…⟩ is parsed into the AST; no coercion yet.
  • A leading back-tick on a string literal activates a printf mini-language (%d, %.3g, %x, …).
  • Rendering to string happens only when a true string is demanded.

Migration

  • ${ … } remains supported as-is.
  • A lint pass may suggest automatic rewrites to the new syntax.

CppNix divergence

This feature is a clear departure from CppNix's syntax.

Action Items

  1. Ratify format ⟨…⟩ surface syntax and the supported printf subset.
  2. Prototype a parser which extends and provide this new syntax (language version may need to be ready by then, unclear?)
Firstly, this idea is not mine, all credits are to @pennae ~~and @delroth who came up with these ideas (independently, in addition, so it must be good :P).~~ (oopsie miscommunications) This issue is a way to keep track of this milestone goal. ## Problem The legacy `${ … }` interpolation **conflates coercion and concatenation**, which * makes impossible to perform precise float / numeric formatting (`3.141592653589793…`); * is actually sharing the coercion infrastructure with function argument passing * prevent users to extend their own behavior for interpolation An example of things that users would like `let p = 2516; in "${p}"` to work out of the box without having to write `toString p` all the time, see https://gerrit.lix.systems/c/lix/+/3191 for an attempt to solve this. ## Design Introduce a **format-AST** with a [**quasi-quoter**](https://en.wikipedia.org/wiki/Quasi-quotation), better seen in [Lisp](https://3e8.org/pub/scheme/doc/Quasiquotation%20in%20Lisp%20(Bawden).pdf) or [Lean 4](https://github.com/leanprover-community/quote4): ```nix # list report format `( "items: " (length xs) ": " (concatStringsSep ", " xs) )` # float with printf directives format `("π ≈ " `("%.3g" 3.1415926))` ``` * Everything inside `format ⟨…⟩` is parsed into the AST; no coercion yet. * A leading back-tick on a string literal activates a *printf* mini-language (`%d`, `%.3g`, `%x`, …). * Rendering to `string` happens only when a true string is demanded. ## Migration * `${ … }` remains supported as-is. * A lint pass may suggest automatic rewrites to the new syntax. ## CppNix divergence This feature is a clear departure from CppNix's syntax. ## Action Items 1. **Ratify** `format ⟨…⟩` surface syntax and the supported printf subset. 2. **Prototype** a parser which extends and provide this new syntax (language version may need to be ready by then, unclear?)
raito changed title from Structured interpolation via quasi-quoters and AST representation structures to RFD: Structured interpolation via quasi-quoters and AST representation structures 2025-05-19 15:05:21 +00:00
Member

I'm unconvinced that this is the right direction (and I think I might have miscommunicated at some point because this isn't really something I've considered).

To go over your problem statement:

  • "is actually sharing the coercion infrastructure with function argument passing" is an implementation detail, and given that right now ${} is in fact fairly restrictive about what it accepts (only supporting string/path/external/attrs that have __toString) there's no reason this couldn't be split off into a separate code path while keeping 100% compatibility (or, I don't know, change the already bad bool coerceMore to an enum class CoercionMode).
  • "prevent users to extend their own behavior for interpolation" not really if you consider that the contents of a ${} can be any Nixlang expression returning a string. You can extend the behavior by just calling a function if you need to.
  • "makes impossible to perform precise float / numeric formatting (3.141592653589793…)" same as previous point imo.

I fail to see what makes for example:

format `(
  "items: " (length xs) ": " (concatStringsSep ", " xs)
)`

format `("π ≈ " `("%.3g" 3.1415926))`

Better than:

"items: ${length xs}: ${concatStringsSep ", " xs}"
"π ≈ ${format "%.3g" 3.1415926}"

Which already works now, is arguably more readable, and does not introduce more syntax. (Assuming a format function which supports properly formatting floats. That's an orthogonal problem imo.)


When I originally ranted about what ended up being https://gerrit.lix.systems/c/lix/+/3191 I was mostly annoyed at the case of ${} with integers. This is a case that has a clear, unambiguous coercion to string, with no data loss. Coercing integers to strings in the context of a ${} templating is also imo not something that makes Nixlang "more weakly typed", not any more than having a toString function which accepts different types of arguments, and that's 1. the case; 2. the current main alternative to ${anIntVariable} anyway.

I'm unconvinced that this is the right direction (and I think I might have miscommunicated at some point because this isn't really something I've considered). To go over your problem statement: - "_is actually sharing the coercion infrastructure with function argument passing_" is an implementation detail, and given that right now `${}` is in fact fairly restrictive about what it accepts (only supporting string/path/external/attrs that have __toString) there's no reason this couldn't be split off into a separate code path while keeping 100% compatibility (or, I don't know, change the already bad `bool coerceMore` to an `enum class CoercionMode`). - "_prevent users to extend their own behavior for interpolation_" not really if you consider that the contents of a `${}` can be any Nixlang expression returning a string. You can extend the behavior by just calling a function if you need to. - "_makes impossible to perform precise float / numeric formatting (3.141592653589793…)_" same as previous point imo. I fail to see what makes for example: ``` format `( "items: " (length xs) ": " (concatStringsSep ", " xs) )` format `("π ≈ " `("%.3g" 3.1415926))` ``` Better than: ``` "items: ${length xs}: ${concatStringsSep ", " xs}" "π ≈ ${format "%.3g" 3.1415926}" ``` Which _already_ works now, is arguably more readable, and does not introduce more syntax. (Assuming a `format` function which supports properly formatting floats. That's an orthogonal problem imo.) ---- When I originally ranted about what ended up being https://gerrit.lix.systems/c/lix/+/3191 I was mostly annoyed at the case of `${}` with integers. This is a case that has a clear, unambiguous coercion to string, with no data loss. Coercing integers to strings _in the context of a `${}` templating_ is also imo not something that makes Nixlang "more weakly typed", not any more than having a `toString` function which accepts different types of arguments, and that's 1. the case; 2. the current main alternative to `${anIntVariable}` anyway.
Member

Which already works now

Oops, no, it doesn't already work now, because coercing ints with ${} doesn't work. For the 3rd time in a week I would have had an eval failure because I didn't write ${builtins.toString (length xs)} if this was real code.

> Which already works now Oops, no, it doesn't already work now, **because coercing ints with `${}` doesn't work**. For the 3rd time in a week I would have had an eval failure because I didn't write `${builtins.toString (length xs)}` if this was real code.
Owner

this isn't just about formatting strings, it's about the general ability to represent things more cleanly. formatting is one example of this. another is a more principled way to specify build instructions (not that "lol everything gets chucked into a bash script by string replacement" was easy to make any worse).

the root problem we want to solve is that string coercions are a complete trap, especially when combined with toString (due to it formatting null and false as "", true as 1, and arrays as space-separated concatenations of their elements). if we're going to change interpolation behavior at all it should be prinicipled and thought ought from the beginning, not just tack on more hacks to an already broken system. a formatting system with sensible (i.e., not terminally bash-brained) behavior cannot begin with the current interpolation system if you already refuse to write ${toString x} instead of ${x} because the stringification rules of the language are complete garbage. building a new system on quasiquotes is not necessary, but it sure is convenient because it allows us to implement the actual formatting behavior in a place that isn't another broken builtin

this isn't just about formatting strings, it's about the general ability to represent things more cleanly. formatting is one example of this. another is a more principled way to specify build instructions (not that "lol everything gets chucked into a bash script by string replacement" was easy to make any worse). the root problem we want to solve is that string coercions are a complete trap, *especially* when combined with `toString` (due to it formatting `null` and `false` as `""`, `true` as `1`, and arrays as space-separated concatenations of their elements). if we're going to change interpolation behavior *at all* it should be prinicipled and thought ought from the beginning, not just tack on more hacks to an already broken system. a formatting system with sensible (i.e., not terminally bash-brained) behavior cannot begin with the current interpolation system if you already refuse to write `${toString x}` instead of `${x}` because the stringification rules of the language are complete garbage. building a new system on quasiquotes is not *necessary*, but it sure is convenient because it allows us to implement the actual formatting behavior in a place that *isn't* another broken builtin

Generally I am all for the proposal!
Formatting was always a source of massive frustration in the language for me (whenever I ran into it, which isn't too often, but the times I did it was just.… urgh).
And the idea to introduce AST features, paving the way for future improvements in that direction, while not necessary, is convenient as stated before.

Some unqualified thoughts hidden behind a spoiler and disclaimer:

Big disclaimer: I am not a language person, don't take my word for anything I say here, feel free to ignore

I wanted to split this into a semantics and syntax section, but somehow I failed, so here goes a mix of both, in a horrible back and forth mix of several different stances on different parts of the proposal.
However starting off, is the syntax presented in the Design section of the OP an example of what it could look like or a solid syntax proposal (or something otherwise agreed upon)?
If it is more than a suggestion already then please read the following as an opinion only.
I will readily admit that the syntax is unfamiliar to me so my concerns are probably amplified by that, meaning I'm not a good baseline for any serious criticism.

Anyway, taking this example from the OP:

format `(
  "items: " (length xs) ": " (concatStringsSep ", " xs)
)`

(Note: that I am largely ignoring the second format in the OP because I can't wrap my head around what that would do or how its syntax works, and if someone had a cohesive explanation of that specific bit I think it would also answer most of my uncertainties below)

What I'm reading here (as someone unfamiliar with language development) is:

  • backticks are the delimiters isolating quasi-quotation from other code
  • the outer parenthesis are either
    • part of the backtick delimiters (meaning the delimiters are two characters wide)
    • it could be that (having had a glance over at LISP quasi-quotation) backticks themselves are already sufficient for quasi-quoting the next token and the parens extend this to a list of tokens
  • the content of the parens has regular Nix list syntax and thereby represents a list of tokens
    • I don't think so from reading the example, but I could imagine that the parenthesis areactually special syntax constructs that evaluate what's inside, instead of treating it as a token (basically evaluating the construct on the spot to a quasi-quotation of the values "items: ", 2, ": ", and "a, b" (for an xs of [ "a" "b" ])
      • since the language does not have any other way to distinguish between [ length xs ] and [ (length xs) ] in regular code except for parens (or using a binding of sorts) I would assume that the inner parens are not special, but more on that below
  • each token is just regular Nix syntax
  • and then one of these applies
    • since the feature is quasi-quotation I assume that everything enclosed in backticks turns into a single "value" at eval time
      • it could potentially be assigned in a let binding or passed around the code
      • is basically a value of type AST
      • format is either a keyword or a builtin which takes a value of type AST (and could potentially even be written as builtin.format)
    • this is a syntax construct that requires the use of a format keyword followed by quasi-quotation and any use other than this is a syntax error (i.e. format is not standalone and in absence of any other keywords the quasi-quotation is also not usable by itself)

Given this I, as a potential user, would be a bit confused and consider it somewhat unintuitive.

If the last bullet point applies I'd prefer a syntax like this for instance:

format:`(
	"something"
)`

While seeing this bothers me personally because I can't put the parens after a newline that'd be a me-problem.
More importantly however it would indicate that the combination of format and the rest of it make up a value and they are not separable.

On the other hand if format is a builtin or keyword which is independent (although not really usable without an AST built from quasi-quotation) which the space separation indicates to me, then I would expect this to work:

let
	foo = "bar";
	ast = `( "foo: " (foo) "\n" )`;

in
	format ast

This also is much more in line with what little I've seen from LISP in that 2 minute glance (though LISP is a bit special in how code is just data).

Note also that I put parens around foo.
As mentioned earlier, reading the OP I am not sure how these come into play exactly but I have some assumptions.
However, as an alternative to that, and maybe it's just because I've worked with it recently, but I feel like the maud library (Rust) has interesting syntax which could be applicable here.
If this feature does introduce quasi-quotation as a general feature (and not strictly tied to format), regardless of whether it is otherwise possible to use, it would be good to get that part right from the start.
It would be awful if we ended up with several different quasi-quotation syntaxes depending on whether you format things or use some other feature which may use quasi-quotation later (tryEval comes to mind).
Basically the syntax there means that anything written as-is is taken as a token as-is, but surrounding it with parens makes it an expression.
So in the quasi-quotation syntax here foo could refer to a foo token while (foo) would evaluate (lazily I assume) the local foo and use its value as part of the quasi-quotation AST thingy.

This would also mean that the syntax inside the backtick-parens is not necessarily purely literal, opening up possibilities like this:

let
	number1 = 0.1234;
	number2 = 5;
in
format `(
	# constant values get special treatment since they don't need evaluation, so they're allowed directly
	"regular string"
	# the above would basically be this, but since 42 and "foo" are constants anyway I think there's value in adding a shorthand
	("also a string")

	# this on the other hand I would handle either as an error or an actual AST token since interpolation requires evaluation
	#"some ${number1}"
	# this would work on the other hand
	("some ${number1}")

	# "special" syntax constructs could be implemented which only work in qq context, like a formatting literal
	# the formatting syntax I use here is just something I made up, but it would have the advantage of parens consistently meaning "evaluate this"
	f"%03.2(number1)f" # 000.12

	# honestly no idea how to represent something like a hunk such as `builtins.mul 2`; maybe something like braces to get an *actual* AST instead of a list of tokens?
	# this would then be an AST of a function application
	# (as opposed to a "partially applied primop mul", since that'd be post-eval)
	#{ builtins.mul 2 }
)`

Although now I'm straying pretty far from the proposal, however this would generalize a lot of things.
If we then also changed the backtick-parens to backtick-brackets to be in line with list syntax then it would be pretty clear that this is a list of tokens, and since formatting would be syntax inside quasi-quoting we wouldn't need a format builtin, but rather just a concat one which concats an AST-list of strings into a single string.

assert concat `[ "string1" " string2 " f"%02(5)d" ]` == "string1 string2 05"

However there are "drawbacks" such as requiring literals for the format specifier, but even ignoring the really broken formatting options I don't think runtime format options within Nix code are that important, but that's not my area of expertise (none of this is, I cannot overstate that enough).

In general though, this looks pretty neat to me:

# evaluating code and making the result a single AST node
assert evalTypeOf `( builtins.mul 2 3 )` == "int"

# making an actual AST *tree* (not just a list) which can be eval'd
assert eval `{ builtins.mul 2 3 }`

# list of tokens, can be concatenated
assert concat `[ "string1" " string2 " f"%02(5)d" ]` == "string1 string2 05"
# note that the above would be more or less the same as `{ [ "..." "..." "..." ]}`
Generally I am all for the proposal! Formatting was always a source of massive frustration in the language for me (whenever I ran into it, which isn't too often, but the times I did it was just.… urgh). And the idea to introduce AST features, paving the way for future improvements in that direction, while not necessary, is convenient as stated before. Some unqualified thoughts hidden behind a spoiler and disclaimer: <details><summary>Big disclaimer: I am not a language person, don't take my word for anything I say here, feel free to ignore</summary> I wanted to split this into a semantics and syntax section, but somehow I failed, so here goes a mix of both, in a horrible back and forth mix of several different stances on different parts of the proposal. However starting off, is the syntax presented in [the Design section of the OP](https://git.lix.systems/lix-project/lix/issues/835#user-content-design) an example of what it *could* look like or a solid syntax proposal (or something otherwise agreed upon)? If it is more than a suggestion already then please read the following as an opinion only. I will readily admit that the syntax is unfamiliar to me so my concerns are probably amplified by that, meaning I'm not a good baseline for any serious criticism. Anyway, taking this example from the OP: ```text format `( "items: " (length xs) ": " (concatStringsSep ", " xs) )` ``` (Note: that I am largely ignoring the second format in the OP because I can't wrap my head around what that would do or how its syntax works, and if someone had a cohesive explanation of that specific bit I think it would also answer most of my uncertainties below) What *I'm reading* here (as someone unfamiliar with language development) is: - backticks are the delimiters isolating quasi-quotation from other code - the outer parenthesis are either - part of the backtick delimiters (meaning the delimiters are two characters wide) - it could be that (having had a glance over at LISP quasi-quotation) backticks themselves are already sufficient for quasi-quoting the next token and the parens extend this to a list of tokens - the content of the parens has regular Nix list syntax and thereby represents a list of tokens - I don't think so from reading the example, but I could imagine that the parenthesis areactually special syntax constructs that evaluate what's inside, instead of treating it as a token (basically evaluating the construct on the spot to a quasi-quotation of the values `"items: "`, `2`, `": "`, and `"a, b"` (for an *xs* of `[ "a" "b" ]`) - since the language does not have any other way to distinguish between `[ length xs ]` and `[ (length xs) ]` in regular code except for parens (or using a binding of sorts) I would assume that the inner parens are not special, but more on that below - each token is just regular Nix syntax - and then one of these applies - since the feature is quasi-quotation I assume that everything enclosed in backticks turns into a single "value" at eval time - it could potentially be assigned in a let binding or passed around the code - is basically a value of type AST - `format` is either a keyword or a builtin which takes a value of type AST (and could potentially even be written as `builtin.format`) - this is a syntax construct that requires the use of a `format` keyword followed by quasi-quotation and any use other than this is a syntax error (i.e. format is not standalone and in absence of any other keywords the quasi-quotation is also not usable by itself) Given this I, as a potential user, would be a bit confused and consider it somewhat unintuitive. If the last bullet point applies I'd prefer a syntax like this for instance: ```text format:`( "something" )` ``` While seeing this bothers me personally because I can't put the parens after a newline that'd be a me-problem. More importantly however it would indicate that the *combination* of `format` and the rest of it make up a value and they are not separable. On the other hand if format is a builtin or keyword which is independent (although not really usable without an AST built from quasi-quotation) which the space separation indicates to me, then I would expect this to work: ```text let foo = "bar"; ast = `( "foo: " (foo) "\n" )`; in format ast ``` This also is much more in line with what little I've seen from LISP in that 2 minute glance (though LISP is a bit special in how code is just data). Note also that I put parens around `foo`. As mentioned earlier, reading the OP I am not sure how these come into play exactly but I have some assumptions. However, as an alternative to that, and maybe it's just because I've worked with it recently, but I feel like [the *maud* library (Rust) has interesting syntax](https://maud.lambda.xyz/control-structures.html) which could be applicable here. If this feature does introduce quasi-quotation as a general feature (and not strictly tied to `format`), regardless of whether it is otherwise possible to use, it would be good to get that part right from the start. It would be awful if we ended up with several different quasi-quotation syntaxes depending on whether you format things or use some other feature which may use quasi-quotation later (`tryEval` comes to mind). Basically the syntax there means that anything written as-is is taken as a token as-is, but surrounding it with parens makes it an expression. So in the quasi-quotation syntax here `foo` could refer to a `foo` token while `(foo)` would evaluate (lazily I assume) the local `foo` and use its value as part of the quasi-quotation AST thingy. This would also mean that the syntax inside the backtick-parens is not necessarily purely literal, opening up possibilities like this: ```text let number1 = 0.1234; number2 = 5; in format `( # constant values get special treatment since they don't need evaluation, so they're allowed directly "regular string" # the above would basically be this, but since 42 and "foo" are constants anyway I think there's value in adding a shorthand ("also a string") # this on the other hand I would handle either as an error or an actual AST token since interpolation requires evaluation #"some ${number1}" # this would work on the other hand ("some ${number1}") # "special" syntax constructs could be implemented which only work in qq context, like a formatting literal # the formatting syntax I use here is just something I made up, but it would have the advantage of parens consistently meaning "evaluate this" f"%03.2(number1)f" # 000.12 # honestly no idea how to represent something like a hunk such as `builtins.mul 2`; maybe something like braces to get an *actual* AST instead of a list of tokens? # this would then be an AST of a function application # (as opposed to a "partially applied primop mul", since that'd be post-eval) #{ builtins.mul 2 } )` ``` Although now I'm straying pretty far from the proposal, however this would generalize a lot of things. If we then also changed the backtick-parens to backtick-brackets to be in line with list syntax then it would be pretty clear that this is a list of tokens, and since formatting would be syntax inside quasi-quoting we wouldn't need a format builtin, but rather just a concat one which concats an AST-list of strings into a single string. ```text assert concat `[ "string1" " string2 " f"%02(5)d" ]` == "string1 string2 05" ``` However there are "drawbacks" such as requiring literals for the format specifier, but even ignoring [the really broken formatting options](https://github.com/HexHive/printbf) I don't think runtime format options within Nix code are that important, but that's not my area of expertise (none of this is, I cannot overstate that enough). In general though, this looks pretty neat to me: ```text # evaluating code and making the result a single AST node assert evalTypeOf `( builtins.mul 2 3 )` == "int" # making an actual AST *tree* (not just a list) which can be eval'd assert eval `{ builtins.mul 2 3 }` # list of tokens, can be concatenated assert concat `[ "string1" " string2 " f"%02(5)d" ]` == "string1 string2 05" # note that the above would be more or less the same as `{ [ "..." "..." "..." ]}` ``` </details>
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#835
No description provided.