210 lines
12 KiB
HTML
210 lines
12 KiB
HTML
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN" "http://www.w3.org/TR/REC-html40/frameset.dtd">
|
||
|
<HTML>
|
||
|
<HEAD>
|
||
|
<META name="generator" content="mm2html (AT&T Labs Research) 2005-10-15">
|
||
|
<META name="keywords" content="regex implementation categorization">
|
||
|
<TITLE> ../re/re-categorize.mm mm document </TITLE>
|
||
|
<META name="author" content="gsf">
|
||
|
</HEAD>
|
||
|
<BODY bgcolor=white link=slateblue vlink=teal >
|
||
|
<TABLE border=0 align=center width=96%>
|
||
|
<TBODY><TR><TD valign=top align=left>
|
||
|
<!--INDEX--><!--/INDEX-->
|
||
|
<P>
|
||
|
<HR>
|
||
|
<CENTER>
|
||
|
<H3><CENTER><FONT color=red><FONT face=courier>regex implementation categorization</FONT></FONT></CENTER></H3>
|
||
|
<BR>Glenn Fowler <SMALL><<A href=mailto:gsf@research.att.com>gsf@research.att.com</A>></SMALL>
|
||
|
<P><I>AT&T Labs Research - Florham Park NJ</I>
|
||
|
</CENTER>
|
||
|
<P><HR><P>
|
||
|
The
|
||
|
<STRONG>regex</STRONG>
|
||
|
tests in
|
||
|
<A href="http://web.archive.org/web/20080726034626id_/http://www.research.att.com/~gsf/testregex/categorize.dat">categorize.dat</A>
|
||
|
attempt to categorize
|
||
|
<STRONG>regex</STRONG>
|
||
|
implementations.
|
||
|
The tests do not address internationalization.
|
||
|
All implementations report the leftmost match; this is omitted from the table.
|
||
|
<P></P><TABLE border=0 frame=void rules=none width=100%><TBODY><TR><TD>
|
||
|
<TABLE align=center bgcolor=papayawhip border=0 bordercolor=white cellpadding=2 cellspacing=2 frame=void rules=none >
|
||
|
<TBODY>
|
||
|
<TR><TD align=center>LABEL </TD><TD align=center> ASSOC </TD><TD align=center> SUBEXPR </TD><TD align=center> REP_LONGEST </TD><TD align=center> BUGS</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
A </TD><TD align=center> right </TD><TD align=center> precedence </TD><TD align=center> first </TD><TD align=center> -</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
B </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> repeat-null repeat-short repeat-artifact-nomatch</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
D </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> -</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
G </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> alternation-order repeat-null repeat-artifact repeat-artifact-nomatch</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
H </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> alternation-order nomatch-match repeat-null repeat-artifact repeat-artifact-nomatch</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
I </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> repeat-any repeat-short repeat-artifact-nomatch</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
J </TD><TD align=center> right </TD><TD align=center> precedence </TD><TD align=center> last </TD><TD align=center> nomatch-match repeat-artifact repeat-artifact-nomatch subexpression-first</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
M </TD><TD align=center> right </TD><TD align=center> precedence </TD><TD align=center> last </TD><TD align=center> range-null repeat-artifact repeat-artifact-nomatch subexpression-first</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
O </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> repeat-null repeat-short repeat-artifact-nomatch</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
P </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> alternation-order first-match repeat-null repeat-artifact</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
R </TD><TD align=center> left </TD><TD align=center> precedence </TD><TD align=center> last </TD><TD align=center> -</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
S </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> repeat-null repeat-short repeat-artifact-nomatch</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
T </TD><TD align=center> left </TD><TD align=center> precedence </TD><TD align=center> last </TD><TD align=center> -</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
U </TD><TD align=center> right </TD><TD align=center> precedence </TD><TD align=center> first </TD><TD align=center> repeat-null subexpression-first</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
darwin.ppc </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> repeat-null repeat-short</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
freebsd.i386 </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> repeat-null repeat-short</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
hp.pa </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> repeat-artifact</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
ibm.risc </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> alternation-order nomatch-match repeat-artifact repeat-artifact-nomatch</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
linux.i386 </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> alternation-order repeat-artifact repeat-null</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
sgi.mips3 </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> repeat-short</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
sol8.sun4 </TD><TD align=center> right </TD><TD align=center> grouping </TD><TD align=center> first </TD><TD align=center> alternation-order nomatch-match repeat-artifact</TD></TR>
|
||
|
<TR><TD align=center>
|
||
|
unixware.i386 </TD><TD align=center> right </TD><TD align=center> precedence </TD><TD align=center> first </TD><TD align=center> repeat-null subexpression-first</TD></TR>
|
||
|
</TBODY></TABLE></TD></TR></TBODY></TABLE>
|
||
|
<P>
|
||
|
The categories are:
|
||
|
<DL COMPACT>
|
||
|
<DL COMPACT>
|
||
|
<DT><STRONG>LABEL</STRONG><DD>
|
||
|
The implementation label from
|
||
|
<A href="http://web.archive.org/web/20080726034626id_/http://www.research.att.com/~gsf/testregex/">testregex.</A>
|
||
|
<DT><STRONG>ASSOC</STRONG><DD>
|
||
|
Subpattern (or atom) associativity: either
|
||
|
<STRONG>left</STRONG>
|
||
|
or
|
||
|
<STRONG>right</STRONG>.
|
||
|
The subexpression match rule in the rationale requires
|
||
|
<STRONG>right</STRONG>
|
||
|
for expressions where each concatenated part is a subexpression.
|
||
|
There is no definition for
|
||
|
<EM>subpattern</EM>,
|
||
|
but it would be inconsistent for any definition to require different
|
||
|
associativity than that for subexpressions.
|
||
|
Some claim that the BRE and ERE grammars specify
|
||
|
<STRONG>left</STRONG>
|
||
|
associativity, but this interpretation disregards
|
||
|
the subexpression match rule in the rationale.
|
||
|
The grammar can also be interpreted to support
|
||
|
<STRONG>right</STRONG>
|
||
|
associativity, and this interpretation is in accord with the rationale.
|
||
|
<DT><STRONG>SUBEXPR</STRONG><DD>
|
||
|
Subexpression semantics:
|
||
|
<STRONG>precedence</STRONG>
|
||
|
if subexpressions can override the default associativity;
|
||
|
<STRONG>grouping</STRONG>
|
||
|
if subexpressions are for repetition and
|
||
|
<STRONG>regmatch_t</STRONG>
|
||
|
grouping only.
|
||
|
The subexpression match rule in the rationale requires
|
||
|
<STRONG>precedence</STRONG>.
|
||
|
<DT><STRONG>REP_LONGEST</STRONG><DD>
|
||
|
How repeated subexpressions that match more than once are handled:
|
||
|
<STRONG>first</STRONG>
|
||
|
if the longest possible matches occur first;
|
||
|
<STRONG>last</STRONG>
|
||
|
if the longest possible matches occur last;
|
||
|
<STRONG>unknown</STRONG>
|
||
|
otherwise.
|
||
|
The subexpression match rule in the rationale requires
|
||
|
<STRONG>first</STRONG>.
|
||
|
<DT><STRONG>BUGS</STRONG><DD>
|
||
|
Miscellaneous bugs (see
|
||
|
<A href="http://web.archive.org/web/20080726034626id_/http://www.research.att.com/~gsf/testregex/categorize.dat">categorize.dat</A>
|
||
|
for specific examples):
|
||
|
<DL COMPACT>
|
||
|
<DL COMPACT>
|
||
|
<DT><STRONG>alternation-order</STRONG><DD>
|
||
|
A change in the order of subexpression alternation operands,
|
||
|
<EM>not involved in a tie</EM>,
|
||
|
changes
|
||
|
<STRONG>regmatch_t</STRONG>
|
||
|
values.
|
||
|
Some implementations with this bug can be coaxed into missing the
|
||
|
overall longest match.
|
||
|
<DT><STRONG>first-match</STRONG><DD>
|
||
|
The first of the leftmost matches, instead of the longest of the
|
||
|
leftmost matches, is returned.
|
||
|
<DT><STRONG>nomatch-match</STRONG><DD>
|
||
|
A back-reference to a
|
||
|
<STRONG>regmatch_t</STRONG>
|
||
|
(-1,-1) value is treated as matching.
|
||
|
<DT><STRONG>range-null</STRONG><DD>
|
||
|
A range-repeated subexpression that matches null does not report the match
|
||
|
at offset (0,0).
|
||
|
<DT><STRONG>repeat-artifact</STRONG><DD>
|
||
|
A
|
||
|
<STRONG>regmatch_t</STRONG>
|
||
|
value is reported for a repeated match that is not the last match.
|
||
|
<DT><STRONG>repeat-artifact-nomatch</STRONG><DD>
|
||
|
To prevent not matching,
|
||
|
a
|
||
|
<STRONG>regmatch_t</STRONG>
|
||
|
value is reported for a repeated match that is not the last match.
|
||
|
<DT><STRONG>repeat-null</STRONG><DD>
|
||
|
A repeated subexpression matches the null string even though it is not
|
||
|
the only match and is not necessary to satisfy the exact or minimum
|
||
|
number of occurrences for an interval expression.
|
||
|
<DT><STRONG>repeat-short</STRONG><DD>
|
||
|
Incorrect
|
||
|
<STRONG>regmatch_t</STRONG>
|
||
|
values for a repeated subexpression.
|
||
|
This may be a variant of
|
||
|
<STRONG>repeat-artifact</STRONG>.
|
||
|
<DT><STRONG>subexpression-first</STRONG><DD>
|
||
|
A subexpression match takes precedence over a subpattern
|
||
|
to its left.
|
||
|
</DL>
|
||
|
</DL>
|
||
|
</DL>
|
||
|
</DL>
|
||
|
<P>
|
||
|
<HR>
|
||
|
<TABLE border=0 align=center width=96%>
|
||
|
<TR>
|
||
|
<TD align=left></TD>
|
||
|
<TD align=center></TD>
|
||
|
<TD align=right><A href="mailto:gsf@research.att.com?subject= ../re/re-categorize.mm mm document">Glenn Fowler</A></TD>
|
||
|
</TR>
|
||
|
<TR>
|
||
|
<TD align=left></TD>
|
||
|
<TD align=center></TD>
|
||
|
<TD align=right>Information and Software Systems Research</TD>
|
||
|
</TR>
|
||
|
<TR>
|
||
|
<TD align=left></TD>
|
||
|
<TD align=center></TD>
|
||
|
<TD align=right>AT&T Labs Research</TD>
|
||
|
</TR>
|
||
|
<TR>
|
||
|
<TD align=left></TD>
|
||
|
<TD align=center></TD>
|
||
|
<TD align=right>Florham Park NJ</TD>
|
||
|
</TR>
|
||
|
<TR>
|
||
|
<TD align=left></TD>
|
||
|
<TD align=center></TD>
|
||
|
<TD align=right>June 01, 2004</TD>
|
||
|
</TR>
|
||
|
</TABLE>
|
||
|
<P>
|
||
|
|
||
|
</TD></TR></TBODY></TABLE>
|
||
|
|
||
|
</BODY>
|
||
|
</HTML>
|