regex implementation categorization


Glenn Fowler <gsf@research.att.com>

AT&T Labs Research - Florham Park NJ


The regex tests in categorize.dat attempt to categorize regex implementations. The tests do not address internationalization. All implementations report the leftmost match; this is omitted from the table.

LABEL    ASSOC    SUBEXPR    REP_LONGEST    BUGS
A    right    precedence    first    -
B    right    grouping    first    repeat-null  repeat-short  repeat-artifact-nomatch
D    right    grouping    first    -
G    right    grouping    first    alternation-order  repeat-null  repeat-artifact  repeat-artifact-nomatch
H    right    grouping    first    alternation-order  nomatch-match  repeat-null  repeat-artifact  repeat-artifact-nomatch
I    right    grouping    first    repeat-any  repeat-short  repeat-artifact-nomatch
J    right    precedence    last    nomatch-match  repeat-artifact  repeat-artifact-nomatch  subexpression-first
M    right    precedence    last    range-null  repeat-artifact  repeat-artifact-nomatch  subexpression-first
O    right    grouping    first    repeat-null  repeat-short  repeat-artifact-nomatch
P    right    grouping    first    alternation-order  first-match  repeat-null  repeat-artifact
R    left    precedence    last    -
S    right    grouping    first    repeat-null  repeat-short  repeat-artifact-nomatch
T    left    precedence    last    -
U    right    precedence    first    repeat-null  subexpression-first
darwin.ppc    right    grouping    first    repeat-null  repeat-short
freebsd.i386    right    grouping    first    repeat-null  repeat-short
hp.pa    right    grouping    first    repeat-artifact
ibm.risc    right    grouping    first    alternation-order  nomatch-match  repeat-artifact  repeat-artifact-nomatch
linux.i386    right    grouping    first    alternation-order  repeat-artifact  repeat-null
sgi.mips3    right    grouping    first    repeat-short
sol8.sun4    right    grouping    first    alternation-order  nomatch-match  repeat-artifact
unixware.i386    right    precedence    first    repeat-null  subexpression-first

The categories are:

LABEL
The implementation label from testregex.
ASSOC
Subpattern (or atom) associativity: either left or right. The subexpression match rule in the rationale requires right for expressions where each concatenated part is a subexpression. There is no definition for subpattern, but it would be inconsistent for any definition to require different associativity than that for subexpressions. Some claim that the BRE and ERE grammars specify left associativity, but this interpretation disregards the subexpression match rule in the rationale. The grammar can also be interpreted to support right associativity, and this interpretation is in accord with the rationale.
SUBEXPR
Subexpression semantics: precedence if subexpressions can override the default associativity; grouping if subexpressions are for repetition and regmatch_t grouping only. The subexpression match rule in the rationale requires precedence.
REP_LONGEST
How repeated subexpressions that match more than once are handled: first if the longest possible matches occur first; last if the longest possible matches occur last; unknown otherwise. The subexpression match rule in the rationale requires first.
BUGS
Miscellaneous bugs (see categorize.dat for specific examples):
alternation-order
A change in the order of subexpression alternation operands, not involved in a tie, changes regmatch_t values. Some implementations with this bug can be coaxed into missing the overall longest match.
first-match
The first of the leftmost matches, instead of the longest of the leftmost matches, is returned.
nomatch-match
A back-reference to a regmatch_t (-1,-1) value is treated as matching.
range-null
A range-repeated subexpression that matches null does not report the match at offset (0,0).
repeat-artifact
A regmatch_t value is reported for a repeated match that is not the last match.
repeat-artifact-nomatch
To prevent not matching, a regmatch_t value is reported for a repeated match that is not the last match.
repeat-null
A repeated subexpression matches the null string even though it is not the only match and is not necessary to satisfy the exact or minimum number of occurrences for an interval expression.
repeat-short
Incorrect regmatch_t values for a repeated subexpression. This may be a variant of repeat-artifact.
subexpression-first
A subexpression match takes precedence over a subpattern to its left.


Glenn Fowler
Information and Software Systems Research
AT&T Labs Research
Florham Park NJ
June 01, 2004