Replace the regex with something consistent cross-platform

jade commented

2024-03-14 20:17:53 +00:00

Owner

Perhaps fork rust regex or such. Whatever we want. Currently the regexes are not the same cross platform, which is hilarious.

Perhaps fork rust `regex` or such. Whatever we want. Currently the regexes are not the same cross platform, which is hilarious.

jade added the

stability

label 2024-03-14 20:17:53 +00:00

qyriad commented

2024-03-14 22:50:15 +00:00

Owner

Wait does Lix not use Boost regex? Or is Boost regex seriously not cross-platform?

jade commented

2024-03-15 01:52:31 +00:00

Author

Owner

Wait does Lix not use Boost regex? Or is Boost regex seriously not cross-platform?

I don't know. All I know is there's been language visible differences in regex behaviour across platforms which is ridiculous. It may have been fixed since?

> Wait does Lix not use Boost regex? Or is Boost regex seriously not cross-platform? I don't know. All I know is there's been language visible differences in regex behaviour across platforms which is *ridiculous*. It may have been fixed since?

jade commented

2024-03-15 01:59:14 +00:00

Author

Owner

Oh no I found cursed information in my signal history: @rbt:

they tried to replace it with boost but had to revert it because it wasn’t compatible with the regex escaper in NIXPKGS

NOOOOOO

well. ok. this means we have to have a custom regex implementation or just vendor the glibc one. but std::regex is terrible as I recall.

Oh no I found cursed information in my signal history: @rbt: > they tried to replace it with boost but had to revert it because it wasn’t compatible with the regex escaper in NIXPKGS NOOOOOO well. ok. this means we have to have a custom regex implementation or just vendor the glibc one. but std::regex is terrible as I recall.

jade commented

2024-03-15 02:05:36 +00:00

Author

Owner

i think there's three solutions here:

go find if someone packaged libc++ regex, vendor it
hack rust regex into being compatible with the status quo, vendor it
delete features from boost regex until it is soundly escaped by nixpkgs, vendor it

i think there's three solutions here: * go find if someone packaged libc++ regex, vendor it * hack rust regex into being compatible with the status quo, vendor it * delete features from boost regex until it is soundly escaped by nixpkgs, vendor it

jade added this to the Broken regexes project 2024-06-24 23:47:42 +00:00

jade referenced this issue

2024-08-18 10:49:31 +00:00

Matching 500KB of data with builtins.match causes stack overflow #476

sugar commented

2024-08-18 15:38:38 +00:00

out of curiosity, i took a look at rust's regex, and unlike boost regex engine, it's fine with }] as a regular expression with no need to escape it (nixpkgs doesn't escape those two characters, which makes sense considering for the current regex engine, \} and \] are outright syntax errors)

i think rust's regex is a good regex engine myself, albeit it may be a good idea to introduce a layer to convert current regular expressions (posix ereg) to those

of course, with such a layer, boost regex could work too

there are essentially two issues i can think of:

different character set syntax, which can break code in some situations

regex crate has a pretty unique character set syntax, providing access to set operations, most of those work fine in current parser, but behave completely differently, most of those are pretty tricky to use accidentally, admittedly, except for nested character classes

a regex like [[] will be accepted by current parser, but it won't be accepted by rust's regex, as it sees [ as an introduction for nested character class

a regex like [[=a=]] will change its meaning from [aA] to [=a], as rust's regex doesn't support equivalence class expressions (albeit, i quickly took a look at github, and nobody uses those with nix, probably because [aA] is a much more reasonable way to write this)

also, because the new regex engine is unicode aware, builtins.match ".*[Ω].*" "β" won't match anymore (personally i see this as a bugfix, but in theory in can break stuff)

new syntax, introducing forwards-compatibility hazard

regex crate has a lot of other syntax, which, while i don't think it introduces incompatibilities (other than character set additions i mentioned) as the current regex syntax is quite strict in terms of what it accepts (rejecting anything outside of posix spec), it will make it trickier to migrate to anything else in the future once migrated to

everything involving character sets, regex supports unusual features here with its set operations
unicode perl character classes - current regex engine doesn't support \d at all, and many other regex engines don't use unicode definitions for those
unicode character properties (\p{Greek} requires huge tables to work, albeit note that regex has a compile-time feature to disable them)
unicode boundaries (\b, \>, \<), supporting \> and \< syntax is quite unusual, most regex engines interpret those as literal > and <
uncommon flags not seen in other regex engines such as R (crlf mode) and U (which swaps meaning of * and *?)

out of curiosity, i took a look at rust's regex, and unlike boost regex engine, it's fine with `}]` as a regular expression with no need to escape it (nixpkgs doesn't escape those two characters, which makes sense considering for the current regex engine, `\}` and `\]` are outright syntax errors) i think rust's regex is a good regex engine myself, albeit it may be a good idea to introduce a layer to convert current regular expressions (posix ereg) to those of course, with such a layer, boost regex could work too there are essentially two issues i can think of: ## different character set syntax, which can break code in some situations `regex` crate has a [pretty unique character set syntax](https://docs.rs/regex/latest/regex/index.html#character-classes), providing access to set operations, most of those work fine in current parser, but behave completely differently, most of those are pretty tricky to use accidentally, admittedly, except for nested character classes a regex like `[[]` will be accepted by current parser, but it won't be accepted by rust's regex, as it sees `[` as an introduction for nested character class a regex like `[[=a=]]` will change its meaning from `[aA]` to `[=a]`, as rust's regex doesn't support equivalence class expressions (albeit, i quickly took a look at github, and nobody uses those with nix, probably because `[aA]` is a much more reasonable way to write this) also, because the new regex engine is unicode aware, `builtins.match ".*[Ω].*" "β"` won't match anymore (personally i see this as a bugfix, but in theory in can break stuff) ## new syntax, introducing forwards-compatibility hazard `regex` crate has a lot of other syntax, which, while i don't think it introduces incompatibilities (other than character set additions i mentioned) as the current regex syntax is quite strict in terms of what it accepts (rejecting anything outside of posix spec), it will make it trickier to migrate to anything else in the future once migrated to - everything involving character sets, `regex` supports unusual features here with its set operations - unicode perl character classes - current regex engine doesn't support `\d` at all, and many other regex engines don't use unicode definitions for those - unicode character properties (`\p{Greek}` requires huge tables to work, albeit note that `regex` has a compile-time feature to disable them) - unicode boundaries (`\b`, `\>`, `\<`), supporting `\>` and `\<` syntax is quite unusual, most regex engines interpret those as literal `>` and `<` - uncommon flags not seen in other regex engines such as `R` (crlf mode) and `U` (which swaps meaning of `*` and `*?`)

jade commented

2024-08-18 19:04:20 +00:00

Author

Owner

thank you so much for doing this investigation. indeed our primary worry is the introduction of new syntax and our secondary worry is the introduction of new features.

there's also some material in here that talks about different stuff than you looked at that's also relevant: https://wiki.lix.systems/books/lix-contributors/page/regexp-engine-investigation

thank you so much for doing this investigation. indeed our primary worry is the introduction of new syntax and our secondary worry is the introduction of new features. there's also some material in here that talks about different stuff than you looked at that's also relevant: https://wiki.lix.systems/books/lix-contributors/page/regexp-engine-investigation

1

lix-bot commented

2024-08-19 22:40:09 +00:00

Member

This issue was mentioned on Gerrit on the following CLs:

commit message in cl/1821 ("libexpr: Replace regex engine with boost::regex")

This issue was mentioned on Gerrit on the following CLs: * commit message in [cl/1821](https://gerrit.lix.systems/c/lix/+/1821) ("libexpr: Replace regex engine with boost::regex")

lix-project referenced this issue from a commit

2024-08-22 07:20:21 +00:00

libexpr: Replace regex engine with boost::regex

lix-project closed this issue

2024-08-22 07:20:21 +00:00

jade referenced this issue

2024-08-22 18:33:39 +00:00

Regression in regex bug-compatibility in `builtins.match "\\.*(.*)" ".keep" == [ "keep" ]` to `[ ".keep" ]` #483

jade reopened this issue

2024-08-23 00:12:08 +00:00

Replace the regex with something consistent cross-platform #34

different character set syntax, which can break code in some situations

new syntax, introducing forwards-compatibility hazard