regex-is-hard/README.md
2024-08-29 19:37:06 +02:00

2 KiB

Regular expressions are hard

Writing a high-quality implementation of POSIX Extended Regular Expressions does not seem easy. Ideally, the following features would be offered at the same time:

In practice, most implementations fall short of these goals, due to a variety of issues:

  • Unclear wording in the standards, leading to diverging outcomes between implementations.
  • Exponential matching time due to backtracking implementation (sometimes even mandated by non-standard syntax extensions).
  • Excessive consumption of stack memory.
  • Spurious matching failures (either indication of a non-match, or an error).
  • Incorrect capturing behaviour of parenthesised subexpressions.

Here we test a small sample of regular expressions that are known to be hard to match properly against a couple of popular regex engines.

Run it

Build using Lix:

nix-build -A default # or libcxx, or musl

List all available engines:

result/bin/driver list

Check the specified engines (tries all if none is specified, not recommended on libstdc++ because std may crash):

result/bin/driver check [engine]…

Print the match results of the specified engines (tries all if none is specified, not recommended on libstdc++ because std may crash):

result/bin/driver results [engine]…

List of supported engines

  • Boost.Regex (boost)
  • C standard library (c)
  • Oniguruma (oniguruma, does not claim POSIX compliance)
  • PCRE2 (pcre, does not claim POSIX compliance)
  • RE2 (re2, does not claim POSIX compliance)
  • C++ standard library (std)
  • TRE (tre)