Replace the regex with something consistent cross-platform #34
Labels
No labels
Affects/CppNix
Affects/Nightly
Affects/Only nightly
Affects/Stable
Area/build-packaging
Area/cli
Area/evaluator
Area/fetching
Area/flakes
Area/language
Area/lix ci
Area/nix-eval-jobs
Area/profiles
Area/protocol
Area/releng
Area/remote-builds
Area/repl
Area/repl/debugger
Area/store
awaiting
author
awaiting
contributors
bug
Context
contributors
Context
drive-by
Context
maintainers
Context
RFD
crash 💥
Cross Compilation
devx
docs
Downstream Dependents
E/easy
E/hard
E/help wanted
E/reproducible
E/requires rearchitecture
Feature/S3
imported
Language/Bash
Language/C++
Language/NixLang
Language/Python
Language/Rust
Needs Langver
OS/Linux
OS/macOS
performance
regression
release-blocker
stability
Status
blocked
Status
invalid
Status
postponed
Status
wontfix
testing
testing/flakey
Topic/Large Scale Installations
ux
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lix-project/lix#34
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Perhaps fork rust
regexor such. Whatever we want. Currently the regexes are not the same cross platform, which is hilarious.Wait does Lix not use Boost regex? Or is Boost regex seriously not cross-platform?
I don't know. All I know is there's been language visible differences in regex behaviour across platforms which is ridiculous. It may have been fixed since?
Oh no I found cursed information in my signal history: @rbt:
NOOOOOO
well. ok. this means we have to have a custom regex implementation or just vendor the glibc one. but std::regex is terrible as I recall.
i think there's three solutions here:
out of curiosity, i took a look at rust's regex, and unlike boost regex engine, it's fine with
}]as a regular expression with no need to escape it (nixpkgs doesn't escape those two characters, which makes sense considering for the current regex engine,\}and\]are outright syntax errors)i think rust's regex is a good regex engine myself, albeit it may be a good idea to introduce a layer to convert current regular expressions (posix ereg) to those
of course, with such a layer, boost regex could work too
there are essentially two issues i can think of:
different character set syntax, which can break code in some situations
regexcrate has a pretty unique character set syntax, providing access to set operations, most of those work fine in current parser, but behave completely differently, most of those are pretty tricky to use accidentally, admittedly, except for nested character classesa regex like
[[]will be accepted by current parser, but it won't be accepted by rust's regex, as it sees[as an introduction for nested character classa regex like
[[=a=]]will change its meaning from[aA]to[=a], as rust's regex doesn't support equivalence class expressions (albeit, i quickly took a look at github, and nobody uses those with nix, probably because[aA]is a much more reasonable way to write this)also, because the new regex engine is unicode aware,
builtins.match ".*[Ω].*" "β"won't match anymore (personally i see this as a bugfix, but in theory in can break stuff)new syntax, introducing forwards-compatibility hazard
regexcrate has a lot of other syntax, which, while i don't think it introduces incompatibilities (other than character set additions i mentioned) as the current regex syntax is quite strict in terms of what it accepts (rejecting anything outside of posix spec), it will make it trickier to migrate to anything else in the future once migrated toregexsupports unusual features here with its set operations\dat all, and many other regex engines don't use unicode definitions for those\p{Greek}requires huge tables to work, albeit note thatregexhas a compile-time feature to disable them)\b,\>,\<), supporting\>and\<syntax is quite unusual, most regex engines interpret those as literal>and<R(crlf mode) andU(which swaps meaning of*and*?)thank you so much for doing this investigation. indeed our primary worry is the introduction of new syntax and our secondary worry is the introduction of new features.
there's also some material in here that talks about different stuff than you looked at that's also relevant: https://wiki.lix.systems/books/lix-contributors/page/regexp-engine-investigation
This issue was mentioned on Gerrit on the following CLs:
builtins.match "\\.*(.*)" ".keep" == [ "keep" ]to[ ".keep" ]#483