Sergey,
The way ORO's "matches" function works in this case coincides with the way
perl regex engine works. While foo|foot example may seem odd to someone new
to regular expressions, it is a well understood fact about perl's matching
mechanics to many people. The simple rule about perl's regex that may help
is: Perl's alternations are not greedy. Knowing this, and understanding the
basic workings of traditional NFA engines should help explain why foo|foot
will match "foo" in "football" and not "foot" as one may expect. Other types
of engines (DFA or Posix NFA) will always match the longest of the leftmost,
in this case "foot".
Traditional NFA such as perl starts with the regular expression's first
character and tries to match the first character in the text. If you have
expression \d{3}|d{3}\.\d{2} and text "123.45", perl will look at \d{3}
first and see if it can match it to "1" in "123". To have a successful
match, at least one permutation of the regular expression must be matched
against the text. What's different about traditional NFA is that the first
permutation that matches is good enough. Knowing this can be pretty valuable
since you can craft your expressions so that the permutations which would
match fastest are tried first. In many cases, fine tuned perl regex can
outperform a DFA regex which keeps track of all matches so far until it
finds the longest. Another important thing to note is that all quantifiers
(like * and ? ) are greedy. With this in mind, one can achieve a greedy
alternation in traditional NFA by using ? quantifier. If you had an
expression like he(ll|llo) and text "hello" you could match "hello" if you
rewrite the expression as he(ll(llo)?)? However, I would still re-write the
expression as he(llo|ll), as long as I understand that it translates
to "Match hello if you can, if not, try to match hell". This is not the same
as "Match the longest of either hell or hello". If you think about it, the
last sentence is really semantically equivalent to "Match the longest of
either hello or hell" which is also the syntax that perl expects for such
semantic interpretation.
DFA and Posix will try to match text to the regular expression. So in your
case, they'll take "1" in "123.45" and try to match it to \{d3}|d{3}\.d{2}.
The result will always be the same, no matter how you order your
alternations, as the longst match wins.
Clearly two different approaches to matching. This said, asking for ORO's
matcher to have greedy alternations would be asking for a completely
different flavor of the regex engine inside ORO. Finally, making this type
of change would break many applications which currently rely on perl's regex
semantics.
Regards,
-Rob
On Fri, 13 May 2005 23:11:27 +0400, Sergey Samokhodkin wrote
> Hello Daniel!
>
> Friday, May 6, 2005, 11:16:54 PM, you wrote:
>
> DFS> .....
> DFS> The heart of the matter seems to be a difference in
> expectations. I
>
> Of course, but isn't my expectation *natural*?
>
> DFS> understand why you could expect matches() to behave that way.
> However, DFS> its documentation explains that it's not the same as
> ^pattern$. I'll
>
> In fact, it only states the difference without any real explanation.
> Let me cite:
> > matches() literally looks for an exact match according to the rules
> > of Perl5 expression matching. Therefore, if you have a pattern
> > foo|foot and are matching the input foot it will not produce an exact
match
>
> How "therefore"???
> Anyone who finds it clear (esp. Kevin Markey), please guess which of
> the following is true:
>
> /foot?/ matches "foot"
> /foot?/ matches "foo"
> /foot??/ matches "foot"
> /foot??/ matches "foo"
>
> DFS> matches() tests whether or not a pattern matches the input it
> is given. DFS> This means that the matching process must start at
> the beginning of DFS> the input and stop at the end of the input.
> If the matching process stops DFS> before the end of the input, then
> there's no match. The method answers DFS> the question "Is this
> input character sequence a member of the set of all DFS> the
> character sequences matched by this pattern?"
>
> Ooops!
> The matching set for "foo|foot" is {"foo","foot"}.
> The matching set for "foot|foo" is ***the same***. Order doesn't
> matter in sets.
>
> DFS> It may make more sense thinking about it this way. matches()
> returns true DFS> if and only if S =~ m/(P)/ is true and $1 equals
> S. For example:
>
> DFS> sub matches(@) {
> DFS> my ($pat, $str) = @_;
> DFS> $str =~ m/($pat)/;
> DFS> return ($str eq $1);
> DFS> }
>
> DFS> printf "%d\n%d\n", matches("foo|foot", "foo"),
> matches("foo|foot", "foot");
>
> Yes, that's it. Something like that had to be in the docs.
>
> DFS> In my opinion, the important thing is for the behavior to be
documented.
> DFS> If it's not sufficiently clear, then we ought to make it more clear.
> DFS> Documentation patches are welcome.
>
> DFS> Now, one can argue that we should add a validate() method specifically
> DFS> for input validation with the behavior you expected. My opinion
>
> I'd say that the best method would be the "matches()" itself (see my
> first remark).
> Otherwise the question is closed.
> Thanks a lot for your patience.
>
> DFS> daniel
>
> --
> Best regards,
> Sergey
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]