Replace the regex with something consistent cross-platform #34

Open
opened 2024-03-14 20:17:53 +00:00 by jade · 7 comments
Owner

Perhaps fork rust regex or such. Whatever we want. Currently the regexes are not the same cross platform, which is hilarious.

Perhaps fork rust `regex` or such. Whatever we want. Currently the regexes are not the same cross platform, which is hilarious.
jade added the
stability
label 2024-03-14 20:17:53 +00:00
Owner

Wait does Lix not use Boost regex? Or is Boost regex seriously not cross-platform?

Wait does Lix not use Boost regex? Or is Boost regex seriously not cross-platform?
Author
Owner

Wait does Lix not use Boost regex? Or is Boost regex seriously not cross-platform?

I don't know. All I know is there's been language visible differences in regex behaviour across platforms which is ridiculous. It may have been fixed since?

> Wait does Lix not use Boost regex? Or is Boost regex seriously not cross-platform? I don't know. All I know is there's been language visible differences in regex behaviour across platforms which is *ridiculous*. It may have been fixed since?
Author
Owner

Oh no I found cursed information in my signal history: @rbt:

they tried to replace it with boost but had to revert it because it wasn’t compatible with the regex escaper in NIXPKGS

NOOOOOO

well. ok. this means we have to have a custom regex implementation or just vendor the glibc one. but std::regex is terrible as I recall.

Oh no I found cursed information in my signal history: @rbt: > they tried to replace it with boost but had to revert it because it wasn’t compatible with the regex escaper in NIXPKGS NOOOOOO well. ok. this means we have to have a custom regex implementation or just vendor the glibc one. but std::regex is terrible as I recall.
Author
Owner

i think there's three solutions here: 

  • go find if someone packaged libc++ regex, vendor it
  • hack rust regex into being compatible with the status quo, vendor it
  • delete features from boost regex until it is soundly escaped by nixpkgs, vendor it
i think there's three solutions here:  * go find if someone packaged libc++ regex, vendor it * hack rust regex into being compatible with the status quo, vendor it * delete features from boost regex until it is soundly escaped by nixpkgs, vendor it
jade added this to the Broken regexes project 2024-06-24 23:47:42 +00:00

out of curiosity, i took a look at rust's regex, and unlike boost regex engine, it's fine with }] as a regular expression with no need to escape it (nixpkgs doesn't escape those two characters, which makes sense considering for the current regex engine, \} and \] are outright syntax errors)

i think rust's regex is a good regex engine myself, albeit it may be a good idea to introduce a layer to convert current regular expressions (posix ereg) to those

of course, with such a layer, boost regex could work too

there are essentially two issues i can think of:

different character set syntax, which can break code in some situations

regex crate has a pretty unique character set syntax, providing access to set operations, most of those work fine in current parser, but behave completely differently, most of those are pretty tricky to use accidentally, admittedly, except for nested character classes

a regex like [[] will be accepted by current parser, but it won't be accepted by rust's regex, as it sees [ as an introduction for nested character class

a regex like [[=a=]] will change its meaning from [aA] to [=a], as rust's regex doesn't support equivalence class expressions (albeit, i quickly took a look at github, and nobody uses those with nix, probably because [aA] is a much more reasonable way to write this)

also, because the new regex engine is unicode aware, builtins.match ".*[Ω].*" "β" won't match anymore (personally i see this as a bugfix, but in theory in can break stuff)

new syntax, introducing forwards-compatibility hazard

regex crate has a lot of other syntax, which, while i don't think it introduces incompatibilities (other than character set additions i mentioned) as the current regex syntax is quite strict in terms of what it accepts (rejecting anything outside of posix spec), it will make it trickier to migrate to anything else in the future once migrated to

  • everything involving character sets, regex supports unusual features here with its set operations
  • unicode perl character classes - current regex engine doesn't support \d at all, and many other regex engines don't use unicode definitions for those
  • unicode character properties (\p{Greek} requires huge tables to work, albeit note that regex has a compile-time feature to disable them)
  • unicode boundaries (\b, \>, \<), supporting \> and \< syntax is quite unusual, most regex engines interpret those as literal > and <
  • uncommon flags not seen in other regex engines such as R (crlf mode) and U (which swaps meaning of * and *?)
out of curiosity, i took a look at rust's regex, and unlike boost regex engine, it's fine with `}]` as a regular expression with no need to escape it (nixpkgs doesn't escape those two characters, which makes sense considering for the current regex engine, `\}` and `\]` are outright syntax errors) i think rust's regex is a good regex engine myself, albeit it may be a good idea to introduce a layer to convert current regular expressions (posix ereg) to those of course, with such a layer, boost regex could work too there are essentially two issues i can think of: ## different character set syntax, which can break code in some situations `regex` crate has a [pretty unique character set syntax](https://docs.rs/regex/latest/regex/index.html#character-classes), providing access to set operations, most of those work fine in current parser, but behave completely differently, most of those are pretty tricky to use accidentally, admittedly, except for nested character classes a regex like `[[]` will be accepted by current parser, but it won't be accepted by rust's regex, as it sees `[` as an introduction for nested character class a regex like `[[=a=]]` will change its meaning from `[aA]` to `[=a]`, as rust's regex doesn't support equivalence class expressions (albeit, i quickly took a look at github, and nobody uses those with nix, probably because `[aA]` is a much more reasonable way to write this) also, because the new regex engine is unicode aware, `builtins.match ".*[Ω].*" "β"` won't match anymore (personally i see this as a bugfix, but in theory in can break stuff) ## new syntax, introducing forwards-compatibility hazard `regex` crate has a lot of other syntax, which, while i don't think it introduces incompatibilities (other than character set additions i mentioned) as the current regex syntax is quite strict in terms of what it accepts (rejecting anything outside of posix spec), it will make it trickier to migrate to anything else in the future once migrated to - everything involving character sets, `regex` supports unusual features here with its set operations - unicode perl character classes - current regex engine doesn't support `\d` at all, and many other regex engines don't use unicode definitions for those - unicode character properties (`\p{Greek}` requires huge tables to work, albeit note that `regex` has a compile-time feature to disable them) - unicode boundaries (`\b`, `\>`, `\<`), supporting `\>` and `\<` syntax is quite unusual, most regex engines interpret those as literal `>` and `<` - uncommon flags not seen in other regex engines such as `R` (crlf mode) and `U` (which swaps meaning of `*` and `*?`)
Author
Owner

thank you so much for doing this investigation. indeed our primary worry is the introduction of new syntax and our secondary worry is the introduction of new features.

there's also some material in here that talks about different stuff than you looked at that's also relevant: https://wiki.lix.systems/books/lix-contributors/page/regexp-engine-investigation

thank you so much for doing this investigation. indeed our primary worry is the introduction of new syntax and our secondary worry is the introduction of new features. there's also some material in here that talks about different stuff than you looked at that's also relevant: https://wiki.lix.systems/books/lix-contributors/page/regexp-engine-investigation
Member

This issue was mentioned on Gerrit on the following CLs:

  • commit message in cl/1821 ("libexpr: Replace regex engine with boost::regex")
<!-- GERRIT_LINKBOT: {"cls": [{"backlink": "https://gerrit.lix.systems/c/lix/+/1821", "number": 1821, "kind": "commit message"}], "cl_meta": {"1821": {"change_title": "libexpr: Replace regex engine with boost::regex"}}} --> This issue was mentioned on Gerrit on the following CLs: * commit message in [cl/1821](https://gerrit.lix.systems/c/lix/+/1821) ("libexpr: Replace regex engine with boost::regex")
jade reopened this issue 2024-08-23 00:12:08 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#34
No description provided.