regex-is-hard/README.md

53 lines
2.2 KiB
Markdown
Raw Normal View History

2024-08-29 17:36:29 +00:00
# Regular expressions are hard
Writing a high-quality implementation of POSIX Extended Regular Expressions does not seem easy.
Ideally, the following features would be offered at the same time:
* Strict standards compliance (see https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html and https://pubs.opengroup.org/onlinepubs/9799919799/functions/regexec.html).
* Predictable performance (polynomial in the length of the inputs, and linear in the length of the matched string).
* Limited resource usage (memory, CPU time (also related to the previous point)).
* Absence of vendor-specific syntax extensions (for portability).
In practice, most implementations fall short of these goals, due to a variety of issues:
* Unclear wording in the standards, leading to diverging outcomes between implementations.
* Exponential matching time due to backtracking implementation (sometimes even mandated by non-standard syntax extensions).
* Excessive consumption of stack memory.
* Spurious matching failures (either indication of a non-match, or an error).
* Incorrect capturing behaviour of parenthesised subexpressions.
Here we test a small sample of regular expressions that are known to be hard to match properly against a couple of popular regex engines.
2024-08-29 17:42:37 +00:00
## Warning
It is not clear whether the author's interpretation of the standard is actually correct in all cases.
One can still observe that implementations show different behaviour in these situations.
2024-08-29 17:36:29 +00:00
## Run it
Build using Lix:
nix-build -A default # or libcxx, or musl
List all available engines:
result/bin/driver list
Check the specified engines (tries all if none is specified, not recommended on libstdc++ because `std` may crash):
result/bin/driver check [engine]…
Print the match results of the specified engines (tries all if none is specified, not recommended on libstdc++ because `std` may crash):
result/bin/driver results [engine]…
## List of supported engines
* Boost.Regex (`boost`)
* C standard library (`c`)
* Oniguruma (`oniguruma`, does not claim POSIX compliance)
* PCRE2 (`pcre`, does not claim POSIX compliance)
* RE2 (`re2`, does not claim POSIX compliance)
* C++ standard library (`std`)
* TRE (`tre`)