An Interpretation of the POSIX regex Standard
Glenn Fowler <gsf@research.att.com>
AT&T Research - Florham Park NJ
Many passages in the POSIX
regex
standard seem to be open for interpretation.
Differences between several published
implementations
of the
regex
API bear this out.
Instead of relegating these differences to the
undefined behavior
bucket, this paper proposes a resolution to each
by direct application of the standard text.
The POSIX
regex
standard is spread across four documents:
It describes
BREs
(basic regular expressions, a.k.a.,
grep(1)
style) and
EREs
(extended regular expressions, a.k.a.,
egrep(1)
style)
and how an RE of each type matches subject strings.
The standard also provides an API:
regcomp(3)
for compiling an RE, and
regexec(3)
for matching a compiled RE against a subject string.
The
regexec
API
int regexec(const regex_t* restrict preg, const char* restrict string,
size_t nmatch, regmatch_t pmatch[restrict], int eflags);
is at the center of multiple, conflicting interpretations of the standard.
These interpretations differ on the setting of the
pmatch[]
array for index values > 0.
This note presents examples that demonstrate interpretation conflicts,
and then provides standard references that,
when taken as a whole,
resolve the conflicts.
Standard references use the notation
[document:begin[-end]]
where
document
is the document letter, { A D G R }, from the table above,
begin
is the beginning line number, and
end
is the ending line number.
Line numbers are taken from the 2001 X/Open printing.
Unfortunately the online links do not display line numbers.
For example, [A:37179-37180] is the reference for the
regexec
API prototype above.
Example patterns, subject strings, and
pmatch[]
array values use the regression test notation of
testregex.
You can download the source and compile it against your favorite regex
implementation.
All of the examples in this note have been placed in the file
interpretation.dat;
you can download this file and use it as input to
testregex.
For example, the
testregex
input
specifies that the ERE pattern "a+" matched against the
subject string "xaax" yields
pmatch[0].rm_so==1
and
pmatch[0].rm_eo==3.
The example is labeled RE#01 for indexing and referencing.
:RE#02:B .\(a*\). xaax (0,4)(1,3)
specifies that the BRE pattern ".\(a*\)." matched against the subject
string "xaax" yields
pmatch[0].rm_so==0,
pmatch[0].rm_eo==4,
pmatch[1].rm_so==1,
pmatch[1].rm_eo==3.
(?,?) denotes
rm_so
and
rm_eo
values of -1, i.e., a non-match.
The first field allows additional flags that exercise all of the
REG_*
regcomp
and
regexec
flags; see
testregex(1)
or
testregex --man
for details.
Note that
tab
is the field separator in the
testregex
syntax; if you mouse snarf then make sure that
tabs
are preserved.
- [G:41]Basic Regular Expression (BRE)
-
A regular expression used by the majority of utilities that select strings
from a set of character strings.
- [G:148]Entire Regular Expression
-
The concatenated set of one or more basic regular expressions or extended
regular expressions that make up the pattern specified for string selection.
- [G:158]Extended Regular Expression (ERE)
-
A regular expression that is an alternative to the Basic Regular
Expression using a more extensive syntax, occasionally used by some utilities.
- [G:269]Pattern
-
A sequence of characters used either with regular expression notation or for
pathname expansion, as a means of selecting various character strings or
pathnames, respectively.
- [G:316]Regular Expression
-
A pattern that selects specific strings from a set of character strings.
The
regex
standard is surprisingly cavalier with terminology:
some terms are used interchangeably, some are used in a general context
in one section and a specific context in another, and some are
used without any definition whatsoever.
Acutely subject to this abuse are:
RE,
pattern,
subpattern,
expression,
and
subexpression.
In particular,
subpattern
and
subexpression
are central to the description of the matching algorithm and how
pmatch[]
is assigned.
Any interpretation of the
regex
standard involving these terms, absent a precise and accurate definition
for each, is useless.
subexpression
appears 70 times, and each reference is in the context of parenthesis grouping:
- [D:5909-5911]
-
For example, matching the BRE "\(.*\).*" against "abcdef" , the
subexpression "(\1)" is "abcdef" , and matching the BRE
"\(a*\)*" against "bc" , the subexpression "(\1)" is the null
string.
- [D:5984-5988]
-
The asterisk shall be special except when used: As the first
character of a subexpression (after an initial '^' , if any);
- [D:6094-6097]
-
A subexpression can be defined within a BRE by enclosing it
between the character pairs "\(" and "\)" . Subexpressions can
be arbitrarily nested.
- [D:6100-6109]
-
The character 'n' shall be a digit from 1 through 9, specifying
the nth subexpression (the one that begins with the nth "\("
from the beginning of the pattern and ends with the
corresponding paired "\)" ). The expression is invalid if less
than n subexpressions precede the '\n' . For example, the
expression "\(.*\)\1$" matches a line consisting of two
adjacent appearances of the same string, and the expression
"\(a\)*\1" fails to match 'a' . When the referenced
subexpression matched more than one string, the back-referenced
expression shall refer to the last matched string. If the
subexpression referenced by the back-reference matches more
than one string because of an asterisk ( '*' ) or an interval
expression (see item (5)), the back-reference shall match the
last (rightmost) of these strings.
- [D:6110-6112]
-
When a BRE matching a single character, a subexpression, or a
back-reference is followed by the special character asterisk ('*' ),
together with that asterisk it shall match what zero or
more consecutive occurrences of the BRE would match.
- [D:6114-6117]
-
When a BRE matching a single character, a subexpression, or a
back-reference is followed by an interval expression of the
format "\{m\}" , "\{m,\}" , or "\{m,n\}" , together with that
interval expression it shall match what repeated consecutive
occurrences of the BRE would match. "\{m,n\}" , together with
that interval expression it shall match what repeated
consecutive occurrences of the BRE would match.
- [D:6127-6129]
-
A subexpression repeated by an asterisk ('*') or an interval expression
shall not match a null expression unless this is the only match for the
repetition or it is necessary to satisfy the exact or minimum number of
occurrences for the interval expression.
- [D:6136]
-
Subexpressions/back-references \(\) \n
- [D:6145-6151]
-
The implementation may treat the circumflex as an anchor when
used as the first character of a subexpression. The circumflex
shall anchor the
expression (or optionally subexpression) to the beginning of a
string; only sequences starting at the first character of a
string shall be matched by the BRE. For example, the BRE "^ab"
matches "ab" in the string "abcdef" , but fails to match in the
string "cdefab" . The BRE "\(^ab\)" may match the former
string. A portable BRE shall escape a leading circumflex in a
subexpression to match a literal circumflex.
- [D:6152-6156]
-
A dollar sign ( '$' ) shall be an anchor when used as the last
character of an entire BRE. The implementation may treat a
dollar sign as an anchor when used as the last character of a
subexpression. The dollar sign shall anchor the expression (or
optionally subexpression) to the end of the string being matched;
the dollar sign can be said to match the end-of-string following
the last character.
- [D:6265-6270]
-
A circumflex ( '^' ) outside a bracket expression shall anchor
the expression or subexpression it begins to the beginning of a
string; such an expression or subexpression can match only a
sequence starting at the first character of a string. For
example, the EREs "^ab" and "(^ab)" match "ab" in the string
"abcdef" , but fail to match in the string "cdefab" , and the
ERE "a^b" is valid, but can never match because the 'a'
prevents the expression "^b" from matching starting at the
first character.
- [D:6271-6276]
-
A dollar sign ( '$' ) outside a bracket expression shall anchor
the expression or subexpression it ends to the end of a string;
such an expression or subexpression can match only a sequence
ending at the last character of a string. For example, the EREs
"ef$" and "(ef$)" match "ef" in the string "abcdef" , but fail
to match in the string "cdefab" , and the ERE "e$f" is valid,
but can never match because the 'f' prevents the expression
"e$" from matching ending at the last character.
- [R:2359-2370]
-
It is possible to determine what strings correspond to
subexpressions by recursively applying the leftmost longest
rule to each subexpression, but only with the proviso that the
overall match is leftmost longest. For example, matching
"\(ac*\)c*d[ac]*\1" against acdacaaa matches acdacaaa (with
\1=a); simply matching the longest match for "\(ac*\)" would
yield \1=ac, but the overall match would be smaller (acdac).
Conceptually, the implementation must examine every possible
match and among those that yield the leftmost longest total
matches, pick the one that does the longest match for the
leftmost subexpression, and so on. Note that this means that
matching by subexpressions is context-dependent: a
subexpression within a larger RE may match a different string
from the one it would match as an independent RE, and two
instances of the same subexpression within the same larger RE
may match different lengths even in similar sequences of
characters. For example, in the ERE "(a.*b)(a.*b)" , the two
identical subexpressions would match four and six characters,
respectively, of accbaccccb.
- [R:2512-2520]
-
The limit of nine back-references to subexpressions in the RE
is based on the use of a single-digit identifier; increasing
this to multiple digits would break historical applications.
This does not imply that only nine subexpressions are allowed
in REs. The following is a valid BRE with ten subexpressions:
\(\(\(ab\)*c\)*d\)\(ef\)*\(gh\)\{2\}\(ij\)*\(kl\)*\(mn\)*\(op\)*\(qr\)*
The standard developers regarded the common historical
behavior, which supported "\n*" , but not "\n\{min,max\}" ,
"\(...\)*" , or "\(...\)\{min,max\}" , as a non-intentional
result of a specific implementation, and they supported both
duplication and interval expressions following subexpressions
and back-references.
- [R:2537-2544]
-
However, one relatively uncommon case was changed to allow an
extension used on some implementations. Historically, the BREs
"^foo" and "\(^foo\)" did not match the same string, despite
the general rule that subexpressions and entire BREs match the
same strings. To increase consensus, IEEE Std 1003.1-2001 has
allowed an extension on some implementations to treat these two
cases in the same way by declaring that anchoring may occur at
the beginning or end of a subexpression. Therefore, portable
BREs that require a literal circumflex at the beginning or a
dollar sign at the end of a subexpression must escape them.
Note that a BRE such as "a\(^bc\)" will either match "a^bc" or
nothing on different systems under the rules.
- [R:2549-2554]
-
Some implementations have extended the BRE syntax to add
alternation. For example, the subexpression "\(foo$\|bar\)"
would match either "foo" at the end of the string or "bar"
anywhere. The extension is triggered by the use of the
undefined "\|" sequence. Because the BRE is undefined for
portable scripts, the extending system is free to make other
assumptions, such that the '$' represents the end-of-line
anchor in the middle of a subexpression. If it were not for the
extension, the '$' would match a literal dollar sign under the
rules.
- [R:2617-2620]
-
The removal of the Back_open_paren Back_close_paren option from
the nondupl_RE specification is the result of PASC
Interpretation 1003.2-92 #43 submitted for the ISO POSIX-2:1993
standard. Although the grammar required support for null
subexpressions, this section does not describe the meaning of,
and historical practice did not support, this construct.
- [A:37188]
-
size_t re_nsub Number of parenthesized subexpressions
- [A:37206-37208]
-
If the REG_NOSUB flag was not set in cflags, then regcomp()
shall set re_nsub to the number of parenthesized subexpressions
(delimited by "\(\)" in basic regular expressions or "()" in
extended regular expressions) found in pattern.
- [A:37220-37257]
-
If nmatch is 0 or REG_NOSUB was set in the cflags argument to
regcomp(), then regexec() shall ignore the pmatch argument.
Otherwise, the application shall ensure that the pmatch
argument points to an array with at least nmatch elements, and
regexec() shall fill in the elements of that array with offsets
of the substrings of string that correspond to the
parenthesized subexpressions of pattern: pmatch[i].rm_so
shall be the byte offset of the beginning and pmatch[i].rm_eo
shall be one greater than the byte offset of the end of
substring i. (Subexpression i begins at the ith matched open
parenthesis, counting from 1.) Offsets in pmatch[0] identify
the substring that corresponds to the entire regular
expression. Unused elements of pmatch up to pmatch[nmatch-1]
shall be filled with -1. If there are more than nmatch
subexpressions in pattern ( pattern itself counts as a
subexpression), then regexec() shall still do the match, but
shall record only the first nmatch substrings.
When matching a basic or extended regular expression, any given
parenthesized subexpression of pattern might participate in the
match of several different substrings of string, or it might
not match any substring even though the pattern as a whole did
match. The following rules shall be used to determine which
substrings to report in pmatch when matching regular
expressions:
-
If subexpression i in a regular expression is not contained
within another subexpression, and it participated in the match
several times, then the byte offsets in pmatch[i] shall
delimit the last such match.
-
If subexpression i is not contained within another
subexpression, and it did not participate in an otherwise
successful match, the byte offsets in pmatch[i] shall be -1. A
subexpression does not participate in the match when:
'*' or "\{\}" appears immediately after the
subexpression in a basic regular expression, or '*' ,
'?' , or "{}" appears immediately after the
subexpression in an extended regular expression, and
the subexpression did not match (matched 0 times)
or:
'|' is used in an extended regular expression to select
this subexpression or another, and the other
subexpression matched.
-
If subexpression i is contained within another subexpression
j, and i is not contained within any other subexpression that
is contained within j, and a match of subexpression j is
reported in pmatch[j], then the match or non-match of
subexpression i reported in pmatch[i] shall be as described in
1. and 2. above, but within the substring reported in pmatch[
j] rather than the whole string. The offsets in pmatch[i] are
still relative to the start of string.
-
If subexpression i is contained in subexpression j, and the
byte offsets in pmatch[j] are -1, then the pointers in pmatch[
i] shall also be -1.
-
If subexpression i matched a zero-length string, then both
byte offsets in pmatch[i] shall be the byte offset of the
character or null terminator immediately following the
zero-length string.
- [A:37363-37366]
-
The regexec() function must fill in all nmatch elements of
pmatch, where nmatch and pmatch are supplied by the
application, even if some elements of pmatch do not correspond
to subexpressions in pattern. The application writer should
note that there is probably no reason for using a value of
nmatch that is larger than preg-> re_nsub+1.
- [A:37407-37413]
-
The number of subexpressions in the RE is reported in re_nsub
in preg. With this change to regexec(), consideration was given
to dropping the REG_NOSUB flag since the user can now specify
this with a zero nmatch argument to regexec(). However, keeping
REG_NOSUB allows an implementation to use a different (perhaps
more efficient) algorithm if it knows in regcomp() that no
subexpressions need be reported. The implementation is only
required to fill in pmatch if nmatch is not zero and if
REG_NOSUB is not specified.
This sentence is as close as the standard gets to a definition:
- [A:37225-37226]
-
Subexpression i begins at the ith matched open parenthesis, counting from 1.
Using nonterminals from the BRE [D:6371-6731] and ERE [D:6452-6452] grammar
productions (text not listed in this document) yields the following:
- DEFINITION
-
A
subexpression
corresponds to the
Back_open_paren RE_expression Back_close_paren
form of the
nondupl_RE
BRE grammar production or
the
'(' extended_reg_exp ')'
form of the
ERE_expression
ERE grammar production.
Subexpression i begins at the ith matched open parenthesis
(Back_open_paren
for BREs and '(' for EREs),
starting from the left and counting from 1.
Subexpression 0 is the entire RE.
This definition and the subexpression match rule [R:2359-2370] can be used to
to examine a class of EREs where the top level catenation operands are
subexpressions.
(A top level subexpression is not contained in any other subexpression
except subexpression 0.)
The subexpression match rule in pseudo code is:
-
determine the longest of the leftmost matches for subexpression-0
[R:2359-2361]
-
for 1<=i<=re_nsub
determine the longest match for
subexpression-i
consistent with the matches already determined for
subexpression-j,
0<=j<i.
[R:2359-2370] [A:37235-37257]
For example, given
:RE#03:E (a?)((ab)?) ab (0,2)(0,0)(0,2)(0,2)
the subexpressions are:
subexpression-0 (a?)((ab)?)
subexpression-1 (a?)
subexpression-2 ((ab)?)
subexpression-3 (ab)
The longest of the leftmost matches for subexpression-0 is (0,2).
The longest match for subexpression-1, consistent with the match
for subexpression-0, is (0,0); otherwise if it had matched (0,1) then
subexpression-2 would not match and the subexpression-0 match would be
limited to (0,1).
The longest match for subexpression-2, consistent with the matches
for subexpression-0 and subexpression-1, is (0,2).
The longest match for subexpression-3, consistent with the matches
for subexpression-0, subexpression-1 and subexpression-2, is (0,2).
This table illustrates the matching:
subexpr pattern match
0 (a?)((ab)?) (0,2)
1 (a?) (0,0)
2 ((ab)?) (0,2)
3 (ab) (0,2)
RE#04 is a similar example that exposes the associativity of subexpression
concatenation:
:RE#04:E (a?)((ab)?)(b?) ab (0,2)(0,1)(1,1)(?,?)(1,2)
subexpr pattern match
0 (a?)((ab)?)(b?) (0,2)
1 (a?) (0,1)
2 ((ab)?) (1,1)
3 (ab) (?,?)
4 (b?) (1,2)
[R:2363-2365] also shows that parenthesis can be used to alter the
order of matching:
:RE#05:E ((a?)((ab)?))(b?) ab (0,2)(0,2)(0,0)(0,2)(0,2)(2,2)
subexpr pattern match
0 ((a?)((ab)?))(b?) (0,2)
1 ((a?)((ab)?)) (0,2)
2 (a?) (0,0)
3 ((ab)?) (0,2)
4 (ab) (0,2)
5 (b?) (2,2)
In RE#05 the extra parenthesis (around subexpression-1 and subexpression-2 in
RE#04) form a new subexpression-1, and change the
match for the last subexpression
(b?)
to (2,2) (from (1,2) in RE#04.)
:RE#06:E (a?)(((ab)?)(b?)) ab (0,2)(0,1)(1,2)(1,1)(?,?)(1,2)
subexpr pattern match
0 (a?)(((ab)?)(b?)) (0,2)
1 (a?) (0,1)
2 (((ab)?)(b?)) (1,2)
3 ((ab)?) (1,1)
4 (ab) (?,?)
5 (b?) (1,2)
In RE#06 the extra parenthesis pair forces right associativity and results
in the same match of (1,2) for the last subexpression
(b?)
as in RE#04.
These examples show that:
- PROPERTY
-
Subexpression grouping can alter the precedence of concatenation.
- PROPERTY
-
Subexpression concatenation is right associative.
The following examples examine replicated subexpressions.
:RE#07:E (.?) x (0,1)(0,1)
:RE#08:E (.?){1} x (0,1)(0,1)
:RE#09:E (.?)(.?) x (0,1)(0,1)(1,1)
:RE#10:E (.?){2} x (0,1)(1,1)
:RE#11:E (.?)* x (0,1)(0,1)
[D:6227-6234] specifies that RE#07 and RE#08 are equivalent, and that
RE#09 and RE#10 are equivalent, and
[D:6217-6219] specifies that RE#09 and RE#11 are equivalent.
- [D:6227-6234]
-
When an ERE matching a single character or an ERE enclosed in
parentheses is followed by an interval expression of the format "{m}" ,
"{m,}" , or "{m,n}" , together with that interval expression it shall
match what repeated consecutive occurrences of the ERE would match. The
values of m and n are decimal integers in the range 0 <= m<= n<=
{RE_DUP_MAX}, where m specifies the exact or minimum number of
occurrences and n specifies the maximum number of occurrences. The
expression "{m}" matches exactly m occurrences of the preceding ERE,
"{m,}" matches at least m occurrences, and "{m,n}" matches any number
of occurrences between m and n, inclusive.
- [D:6217-6219]
-
When an ERE matching a single character or an ERE enclosed in
parentheses is followed by the special character asterisk ( '*' ),
together with that asterisk it shall match what zero or more
consecutive occurrences of the ERE would match.
In RE#09 subexpression-1 matches (0,1), leaving the null string at (1,1) for
subexpression-2.
In RE#10 the first iteration of subexpression-1 matches (0,1), the same
as subexpression-1 in RE#09, and the second iteration of subexpression-1
matches (1,1), the same as subexpression-2 in RE#09.
RE#07 and RE#08 show that only one iteration is needed to match the subject
string, so the match in RE#11 requires only one iteration, and as such is the
last iteration of [D:6107-6109] [A:37235-37237].
RE#10 and RE#11 also illustrate [D:6127-6129] [D:6239-6241], which
specify that a repeated RE matches the null string only if it is the only
match (not this case) or if it is necessary to satisfy an interval expression
minimum (2 in this case.)
- [D:6239-6241]
-
An ERE matching a single character repeated by an '*' , '?' , or an
interval expression shall not match a null expression unless this is
the only match for the repetition or it is necessary to satisfy the
exact or minimum number of occurrences for the interval expression.
The following examples dig deeper into replicated subexpressions.
:RE#12:E (.?.?) xxx (0,2)(0,2)
:RE#13:E (.?.?){1} xxx (0,2)(0,2)
:RE#14:E (.?.?)(.?.?) xxx (0,3)(0,2)(2,3)
:RE#15:E (.?.?){2} xxx (0,3)(2,3)
:RE#16:E (.?.?)(.?.?)(.?.?) xxx (0,3)(0,2)(2,3)(3,3)
:RE#17:E (.?.?){3} xxx (0,3)(3,3)
:RE#18:E (.?.?)* xxx (0,3)(2,3)
Here RE#14 shows that only two iterations are needed for a complete match,
making the last iteration match for RE#18 (2,3), since the first
iteration matched (0,2), as in RE#14.
The term
subpattern
appears exactly once:
- [D:5907-5908]
-
Consistent with the whole match being the longest of the leftmost matches,
each subpattern, from left to right, shall match the longest possible string.
Consider RE#04 and RE#05 again:
:RE#04:E (a?)((ab)?)(b?) ab (0,2)(0,1)(1,1)(?,?)(1,2)
:RE#05:E ((a?)((ab)?))(b?) ab (0,2)(0,2)(0,0)(0,2)(0,2)(2,2)
If a subpattern were an entity that combined adjacent subexpressions,
e.g.,
(a?)((ab)?)
in RE#04, then [D:5907-5908] would violate [R:2359-2370].
Similarly, if a subpattern were an entity that "went inside" subexpressions,
e.g.,
(a?)
in RE#05, then again [D:5907-5908] would violate [R:2359-2370].
In other words, a subpattern can be neither larger than nor smaller than
a subexpression;
a subpattern must be a grammatical entity equivalent to a subexpression.
This corresponds to the nonterminal
nondupl_RE
in the BRE grammar; there is no direct correspondence to a nonterminal
in the ERE grammar.
However, if the optional duplication operator (*,+,?,range) is included then
subpattern corresponds to
simple_RE
in the BRE grammar and
ERE_expression
in the ERE grammar, and both [D:5907-5908] and [R:2359-2370] are satisfied.
- DEFINITION
-
A
subpattern
corresponds to the
simple_RE
nonterminal in the BRE grammar or the
ERE_expression
nonterminal in the ERE grammar.
This means that subexpressions and subpatterns are of equal importance
in RE matching.
Also note that any other definition for subpattern will put
[D:5907-5908] in direct conflict with [R:2359-2370].
RE#19, RE#20 and RE#21 examine the relationship between subexpression
and subpattern:
:RE#19:E a?((ab)?)(b?) ab (0,2)(1,1)(?,?)(1,2)
:RE#20:E (a?)((ab)?)b? ab (0,2)(0,1)(1,1)(?,?)
:RE#21:E a?((ab)?)b? ab (0,2)(1,1)(?,?)
These are all variations of RE#04.
Other than subexpression renumbering, the match for the subexpression
((ab)?)
must be the same in RE#04, RE#19, RE#20 and RE#21.
a?
is a subpattern in RE#19 and RE#21, of equal matching importance to
(a?)
in RE#04, and
b?
is a subpattern in RE#20 and RE#21, of equal matching
importance to
(b?)
in RE#04.
The remaining examples explore dark corners of the standard
and implementations.
Although the differences between some of the examples are subtle,
for some implementations it may mean the difference between an answer and
a core dump.
In RE#22 subexpression
(a*)
matches the null string at (0,0), and continues to match at that position
until the minimal range count is satisfied.
:RE#22:E (a*){2} xxxxx (0,0)(0,0)
RE#23 through RE#27 expose implementations that sometimes do
first match
for alternation within subexpressions.
Some implementations erroneously match the first iteration of
subexpression-1 in RE#24 through RE#27 to (0,1).
RE#27 is equivalent to RE#26; the match requires two iterations, the first
matching (0,2) and the last matching (2,3).
:RE#23:E (ab?)(b?a) aba (0,3)(0,2)(2,3)
:RE#24:E (a|ab)(ba|a) aba (0,3)(0,2)(2,3)
:RE#25:E (a|ab|ba) aba (0,2)(0,2)
:RE#26:E (a|ab|ba)(a|ab|ba) aba (0,3)(0,2)(2,3)
:RE#27:E (a|ab|ba)* aba (0,3)(2,3)
RE#28 through RE#33 expose implementations that report short matches
for some repeated subexpressions.
Some implementations report incorrect matches for
subexpression-1 in RE#30 and RE#33.
:RE#28:E (aba|a*b) ababa (0,3)(0,3)
:RE#29:E (aba|a*b)(aba|a*b) ababa (0,5)(0,2)(2,5)
:RE#30:E (aba|a*b)* ababa (0,5)(2,5)
:RE#31:E (aba|ab|a) ababa (0,3)(0,3)
:RE#32:E (aba|ab|a)(aba|ab|a) ababa (0,5)(0,2)(2,5)
:RE#33:E (aba|ab|a)* ababa (0,5)(2,5)
RE#34 through RE#36 expose implementations that report subexpression matches
for earlier iterations of the subexpression.
Some implementations report a match for subexpression-2 in RE#36
while reporting the (2,3) match for subexpression-1: clearly a bug.
:RE#34:E (a(b)?) aba (0,2)(0,2)(1,2)
:RE#35:E (a(b)?)(a(b)?) aba (0,3)(0,2)(1,2)(2,3)(?,?)
:RE#36:E (a(b)?)+ aba (0,3)(2,3)(?,?)
RE#37 and RE#38 expose implementations that give priority to subexpression
matching over subpattern matching.
:RE#37:E (.*)(.*) xx (0,2)(0,2)(2,2)
:RE#38:E .*(.*) xx (0,2)(2,2)
RE#39 through RE#41 expose implementations that treat explicit vs. implicit
subexpression repetition differently.
This is a theme common to many of the previous examples.
Again, the subexpression in RE#41 requires two iterations to match,
and the second iteration matches (5,7), as illustrated by RE#40.
:RE#39:E (a.*z|b.*y) azbazby (0,5)(0,5)
:RE#40:E (a.*z|b.*y)(a.*z|b.*y) azbazby (0,7)(0,5)(5,7)
:RE#41:E (a.*z|b.*y)* azbazby (0,7)(5,7)
RE#42 is another
first match
test.
Some implementations erroneously report a match of (0,1) for subexpression-1.
:RE#42:E (.|..)(.*) ab (0,2)(0,2)(2,2)
RE#43 through RE#45 require only one iteration of subexpression-1 to
match the entire subject string.
RE#45 exposes three separate bugs in the implementations that were tested.
The most common was
over iteration,
where subexpression-1 is matched for a second iteration to the null string
at (3,3).
:RE#43:E ((..)*(...)*) xxx (0,3)(0,3)(?,?)(0,3)
:RE#44:E ((..)*(...)*)((..)*(...)*) xxx (0,3)(0,3)(?,?)(0,3)(3,3)(?,?)(?,?)
:RE#45:E ((..)*(...)*)* xxx (0,3)(0,3)(?,?)(0,3)
RE#46 through RE#82 are nasty;
backreferences are intuitive neither for the implementor nor the user.
RE#49, RE#53, RE#67 and RE#68 illustrate the second part of the
subpattern
rule:
- [D:5908-5909]
-
For this purpose, a null string shall be considered to be longer than
no match at all.
RE#53 requires close examination to see why the match is (0,2)(1,1)(2,2)
instead of (0,2)(0,1)(?,?).
The match of (0,1) for subexpression-1 is longer than (1,1), but
subexpression-1 can be repeated, and that second iteration allows
subexpression-2 to match (2,2), which is longer than (?,?) by [D:5908-5909].
:RE#46:B \(a\{0,1\}\)*b\1 ab (0,2)(1,1)
:RE#47:B \(a*\)*b\1 ab (0,2)(1,1)
:RE#48:B \(a*\)b\1* ab (0,2)(0,1)
:RE#49:B \(a*\)*b\1* ab (0,2)(1,1)
:RE#50:B \(a\{0,1\}\)*b\(\1\) ab (0,2)(1,1)(2,2)
:RE#51:B \(a*\)*b\(\1\) ab (0,2)(1,1)(2,2)
:RE#52:B \(a*\)b\(\1\)* ab (0,2)(0,1)(?,?)
:RE#53:B \(a*\)*b\(\1\)* ab (0,2)(1,1)(2,2)
:RE#54:B \(a\{0,1\}\)*b\1 aba (0,3)(0,1)
:RE#55:B \(a*\)*b\1 aba (0,3)(0,1)
:RE#56:B \(a*\)b\1* aba (0,3)(0,1)
:RE#57:B \(a*\)*b\1* aba (0,3)(0,1)
:RE#58:B \(a*\)*b\(\1\)* aba (0,3)(0,1)(2,3)
:RE#59:B \(a\{0,1\}\)*b\1 abaa (0,3)(0,1)
:RE#60:B \(a*\)*b\1 abaa (0,3)(0,1)
:RE#61:B \(a*\)b\1* abaa (0,4)(0,1)
:RE#62:B \(a*\)*b\1* abaa (0,4)(0,1)
:RE#63:B \(a*\)*b\(\1\)* abaa (0,4)(0,1)(3,4)
:RE#64:B \(a\{0,1\}\)*b\1 aab (0,3)(2,2)
:RE#65:B \(a*\)*b\1 aab (0,3)(2,2)
:RE#66:B \(a*\)b\1* aab (0,3)(0,2)
:RE#67:B \(a*\)*b\1* aab (0,3)(2,2)
:RE#68:B \(a*\)*b\(\1\)* aab (0,3)(2,2)(3,3)
:RE#69:B \(a\{0,1\}\)*b\1 aaba (0,4)(1,2)
:RE#70:B \(a*\)*b\1 aaba (0,4)(1,2)
:RE#71:B \(a*\)b\1* aaba (0,3)(0,2)
:RE#72:B \(a*\)*b\1* aaba (0,4)(1,2)
:RE#73:B \(a*\)*b\(\1\)* aaba (0,4)(1,2)(3,4)
:RE#74:B \(a\{0,1\}\)*b\1 aabaa (0,4)(1,2)
:RE#75:B \(a*\)*b\1 aabaa (0,5)(0,2)
:RE#76:B \(a*\)b\1* aabaa (0,5)(0,2)
:RE#77:B \(a*\)*b\1* aabaa (0,5)(0,2)
:RE#78:B \(a*\)*b\(\1\)* aabaa (0,5)(0,2)(3,5)
:RE#79:B \(x\)*a\1 a NOMATCH
:RE#80:B \(x\)*a\1* a (0,1)(?,?)
:RE#81:B \(x\)*a\(\1\) a NOMATCH
:RE#82:B \(x\)*a\(\1\)* a (0,1)(?,?)(?,?)
:RE#83:E (aa(b(b))?)+ aabbaa (0,6)(4,6)(?,?)(?,?)
:RE#84:E (a(b)?)+ aba (0,3)(2,3)(?,?)
:RE#85:E ([ab]+)([bc]+)([cd]*) abcd (0,4)(0,2)(2,3)(3,4)
:RE#86:B \([ab]*\)\([bc]*\)\([cd]*\)\1 abcdaa (0,5)(0,1)(1,3)(3,4)
:RE#87:B \([ab]*\)\([bc]*\)\([cd]*\)\1 abcdab (0,6)(0,2)(2,3)(3,4)
:RE#88:B \([ab]*\)\([bc]*\)\([cd]*\)\1* abcdaa (0,6)(0,1)(1,3)(3,4)
:RE#89:B \([ab]*\)\([bc]*\)\([cd]*\)\1* abcdab (0,6)(0,2)(2,3)(3,4)
:RE#90:E ^(A([^B]*))?(B(.*))? Aa (0,2)(0,2)(1,2)
:RE#91:E ^(A([^B]*))?(B(.*))? Bb (0,2)(?,?)(?,?)(0,2)(1,2)
:RE#92:B .*\([AB]\).*\1 ABA (0,3)(0,1)
:RE#93:B$ [^A]*A \nA (0,2)
It is possible to use the 2001 issue of the POSIX
regex
standard,
with the addition of one sentence,
to resolve the interpretation differences that have surfaced since 1995.
That key sentence is a precise and consistent definition for the term
subpattern.
By noting the relationship between
subpatterns
and
subexpressions,
the proposed definition is shown to be the only one that can be
consistent with all parts of the standard.
|
|
Glenn Fowler |
|
|
Information and Software Systems Research |
|
|
AT&T Labs Research |
|
|
Florham Park NJ |
|
|
January 2003 |
|