Find a file
2024-08-29 19:42:37 +02:00
src initial commit 2024-08-29 19:37:06 +02:00
.envrc initial commit 2024-08-29 19:37:06 +02:00
.gitignore initial commit 2024-08-29 19:37:06 +02:00
default.nix initial commit 2024-08-29 19:37:06 +02:00
meson.build initial commit 2024-08-29 19:37:06 +02:00
package.nix initial commit 2024-08-29 19:37:06 +02:00
README.md add warning 2024-08-29 19:42:37 +02:00
shell.nix initial commit 2024-08-29 19:37:06 +02:00

Regular expressions are hard

Writing a high-quality implementation of POSIX Extended Regular Expressions does not seem easy. Ideally, the following features would be offered at the same time:

In practice, most implementations fall short of these goals, due to a variety of issues:

  • Unclear wording in the standards, leading to diverging outcomes between implementations.
  • Exponential matching time due to backtracking implementation (sometimes even mandated by non-standard syntax extensions).
  • Excessive consumption of stack memory.
  • Spurious matching failures (either indication of a non-match, or an error).
  • Incorrect capturing behaviour of parenthesised subexpressions.

Here we test a small sample of regular expressions that are known to be hard to match properly against a couple of popular regex engines.

Warning

It is not clear whether the author's interpretation of the standard is actually correct in all cases. One can still observe that implementations show different behaviour in these situations.

Run it

Build using Lix:

nix-build -A default # or libcxx, or musl

List all available engines:

result/bin/driver list

Check the specified engines (tries all if none is specified, not recommended on libstdc++ because std may crash):

result/bin/driver check [engine]…

Print the match results of the specified engines (tries all if none is specified, not recommended on libstdc++ because std may crash):

result/bin/driver results [engine]…

List of supported engines

  • Boost.Regex (boost)
  • C standard library (c)
  • Oniguruma (oniguruma, does not claim POSIX compliance)
  • PCRE2 (pcre, does not claim POSIX compliance)
  • RE2 (re2, does not claim POSIX compliance)
  • C++ standard library (std)
  • TRE (tre)