53 lines
2.2 KiB
Markdown
53 lines
2.2 KiB
Markdown
# Regular expressions are hard
|
|
|
|
Writing a high-quality implementation of POSIX Extended Regular Expressions does not seem easy.
|
|
Ideally, the following features would be offered at the same time:
|
|
|
|
* Strict standards compliance (see https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html and https://pubs.opengroup.org/onlinepubs/9799919799/functions/regexec.html).
|
|
* Predictable performance (polynomial in the length of the inputs, and linear in the length of the matched string).
|
|
* Limited resource usage (memory, CPU time (also related to the previous point)).
|
|
* Absence of vendor-specific syntax extensions (for portability).
|
|
|
|
In practice, most implementations fall short of these goals, due to a variety of issues:
|
|
|
|
* Unclear wording in the standards, leading to diverging outcomes between implementations.
|
|
* Exponential matching time due to backtracking implementation (sometimes even mandated by non-standard syntax extensions).
|
|
* Excessive consumption of stack memory.
|
|
* Spurious matching failures (either indication of a non-match, or an error).
|
|
* Incorrect capturing behaviour of parenthesised subexpressions.
|
|
|
|
Here we test a small sample of regular expressions that are known to be hard to match properly against a couple of popular regex engines.
|
|
|
|
## Warning
|
|
|
|
It is not clear whether the author's interpretation of the standard is actually correct in all cases.
|
|
One can still observe that implementations show different behaviour in these situations.
|
|
|
|
## Run it
|
|
|
|
Build using Lix:
|
|
|
|
nix-build -A default # or libcxx, or musl
|
|
|
|
List all available engines:
|
|
|
|
result/bin/driver list
|
|
|
|
Check the specified engines (tries all if none is specified, not recommended on libstdc++ because `std` may crash):
|
|
|
|
result/bin/driver check [engine]…
|
|
|
|
Print the match results of the specified engines (tries all if none is specified, not recommended on libstdc++ because `std` may crash):
|
|
|
|
result/bin/driver results [engine]…
|
|
|
|
## List of supported engines
|
|
|
|
* Boost.Regex (`boost`)
|
|
* C standard library (`c`)
|
|
* Oniguruma (`oniguruma`, does not claim POSIX compliance)
|
|
* PCRE2 (`pcre`, does not claim POSIX compliance)
|
|
* RE2 (`re2`, does not claim POSIX compliance)
|
|
* C++ standard library (`std`)
|
|
* TRE (`tre`)
|