Re: Re[6]: Pattern Regular Expressions: Consecutive ORs not handled corr ectl y

robert.emmery Fri, 13 May 2005 21:59:35 -0700

Sergey,

The way ORO's "matches" function works in this case coincides with the way 
perl regex engine works. While foo|foot example may seem odd to someone new 
to regular expressions, it is a well understood fact about perl's matching 
mechanics to many people. The simple rule about perl's regex that may help 
is: Perl's alternations are not greedy. Knowing this, and understanding the 
basic workings of traditional NFA engines should help explain why foo|foot 
will match "foo" in "football" and not "foot" as one may expect. Other types 
of engines (DFA or Posix NFA) will always match the longest of the leftmost, 
in this case "foot".

Traditional NFA such as perl starts with the regular expression's first 
character and tries to match the first character in the text. If you have 
expression \d{3}|d{3}\.\d{2} and text "123.45", perl will look at \d{3} 
first and see if it can match it to "1" in "123". To have a successful 
match, at least one permutation of the regular expression must be matched 
against the text. What's different about traditional NFA is that the first 
permutation that matches is good enough. Knowing this can be pretty valuable 
since you can craft your expressions so that the permutations which would 
match fastest are tried first. In many cases, fine tuned perl regex can 
outperform a DFA regex which keeps track of all matches so far until it 
finds the longest. Another important thing to note is that all quantifiers 
(like * and ? ) are greedy. With this in mind, one can achieve a greedy 
alternation in traditional NFA by using ? quantifier. If you had an 
expression like he(ll|llo) and text "hello" you could match "hello" if you 
rewrite the expression as he(ll(llo)?)? However, I would still re-write the 
expression as he(llo|ll), as long as I understand that it translates 
to "Match hello if you can, if not, try to match hell". This is not the same 
as "Match the longest of either hell or hello". If you think about it, the 
last sentence is really semantically equivalent to "Match the longest of 
either hello or hell" which is also the syntax that perl expects for such 
semantic interpretation.

DFA and Posix will try to match text to the regular expression. So in your 
case, they'll take "1" in "123.45" and try to match it to \{d3}|d{3}\.d{2}. 
The result will always be the same, no matter how you order your 
alternations, as the longst match wins.

Clearly two different approaches to matching. This said, asking for ORO's 
matcher to have greedy alternations would be asking for a completely 
different flavor of the regex engine inside ORO. Finally, making this type 
of change would break many applications which currently rely on perl's regex 
semantics. 

Regards,
-Rob

On Fri, 13 May 2005 23:11:27 +0400, Sergey Samokhodkin wrote
> Hello Daniel!
> 
> Friday, May 6, 2005, 11:16:54 PM, you wrote:
> 
> DFS> .....
> DFS> The heart of the matter seems to be a difference in 
> expectations.  I
> 
> Of course, but isn't my expectation *natural*?
> 
> DFS> understand why you could expect matches() to behave that way. 
>  However, DFS> its documentation explains that it's not the same as 
> ^pattern$.  I'll
> 
> In fact, it only states the difference without any real explanation.
> Let me cite:
> > matches() literally looks for an exact match according to the rules
> > of Perl5 expression matching. Therefore, if you have a pattern
> > foo|foot and are matching the input foot it will not produce an exact 
match 
> 
> How "therefore"???
> Anyone who finds it clear (esp. Kevin Markey), please guess which of
> the following is true: 
> 
> /foot?/ matches "foot"
> /foot?/ matches "foo"
> /foot??/ matches "foot"
> /foot??/ matches "foo"
> 
> DFS> matches() tests whether or not a pattern matches the input it 
> is given. DFS> This means that the matching process must start at 
> the beginning of DFS> the input and stop at the end of the input.  
> If the matching process stops DFS> before the end of the input, then 
> there's no match.  The method answers DFS> the question "Is this 
> input character sequence a member of the set of all DFS> the 
> character sequences matched by this pattern?"
> 
> Ooops!
> The matching set for "foo|foot" is {"foo","foot"}.
> The matching set for "foot|foo" is ***the same***. Order doesn't
> matter in sets.
> 
> DFS> It may make more sense thinking about it this way.  matches() 
> returns true DFS> if and only if S =~ m/(P)/ is true and $1 equals 
> S.  For example:
> 
> DFS>   sub matches(@) {
> DFS>     my ($pat, $str) = @_;
> DFS>     $str =~ m/($pat)/;
> DFS>     return ($str eq $1);
> DFS>   }
> 
> DFS>   printf "%d\n%d\n", matches("foo|foot", "foo"),
>  matches("foo|foot", "foot");
> 
> Yes, that's it. Something like that had to be in the docs.
> 
> DFS> In my opinion, the important thing is for the behavior to be 
documented.
> DFS> If it's not sufficiently clear, then we ought to make it more clear.
> DFS> Documentation patches are welcome.
> 
> DFS> Now, one can argue that we should add a validate() method specifically
> DFS> for input validation with the behavior you expected.  My opinion
> 
> I'd say that the best method would be the "matches()" itself (see my
> first remark).
> Otherwise the question is closed.
> Thanks a lot for your patience.
> 
> DFS> daniel
> 
> -- 
> Best regards,
>  Sergey
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Re[6]: Pattern Regular Expressions: Consecutive ORs not handled corr ectl y

Reply via email to