src | ||
.envrc | ||
.gitignore | ||
default.nix | ||
meson.build | ||
package.nix | ||
README.md | ||
shell.nix |
Regular expressions are hard
Writing a high-quality implementation of POSIX Extended Regular Expressions does not seem easy. Ideally, the following features would be offered at the same time:
- Strict standards compliance (see https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html and https://pubs.opengroup.org/onlinepubs/9799919799/functions/regexec.html).
- Predictable performance (polynomial in the length of the inputs, and linear in the length of the matched string).
- Limited resource usage (memory, CPU time (also related to the previous point)).
- Absence of vendor-specific syntax extensions (for portability).
In practice, most implementations fall short of these goals, due to a variety of issues:
- Unclear wording in the standards, leading to diverging outcomes between implementations.
- Exponential matching time due to backtracking implementation (sometimes even mandated by non-standard syntax extensions).
- Excessive consumption of stack memory.
- Spurious matching failures (either indication of a non-match, or an error).
- Incorrect capturing behaviour of parenthesised subexpressions.
Here we test a small sample of regular expressions that are known to be hard to match properly against a couple of popular regex engines.
Run it
Build using Lix:
nix-build -A default # or libcxx, or musl
List all available engines:
result/bin/driver list
Check the specified engines (tries all if none is specified, not recommended on libstdc++ because std
may crash):
result/bin/driver check [engine]…
Print the match results of the specified engines (tries all if none is specified, not recommended on libstdc++ because std
may crash):
result/bin/driver results [engine]…
List of supported engines
- Boost.Regex (
boost
) - C standard library (
c
) - Oniguruma (
oniguruma
, does not claim POSIX compliance) - PCRE2 (
pcre
, does not claim POSIX compliance) - RE2 (
re2
, does not claim POSIX compliance) - C++ standard library (
std
) - TRE (
tre
)