Re: Need help optimizing a regexp

Jordi Salvat i Alabart Tue, 02 Dec 2003 05:36:32 -0800

Hi Daniel,

just writing back to thank you for your help and report on the outcome:

As you said, Awk regexps perform the same whether I factor out that '<'.

It was easy to use Awk regexps on byte input, which reduced the garbage-generation rate, but did not change CPU consumption measurably.

The 'good' Perl regexp (the one with the '<' factored out) still performed slightly better than the Awk option in CPU usage. Since Perl's are much more flexible and allowed for more readable regexps, I decided to stay with Perl's.

--
Thanks again,

Jordi.

En/na Daniel F. Savarese ha escrit:

In message <[EMAIL PROTECTED]>, Jordi Salvat i Alabart writes:

First question (out of sheer curiosity): why is this later regexp faster than the earlier one?

The expression is too long for me to analyze on a glance, but anything
you can do to rewrite a pattern that reduces backtracking will yield
performance gains.  It looks like you moved the common prefix < for
each alternation to the beginning of the expression and grouped the rest.
That will definitely reduce backtracking and since it definitively
establishes the first character of the pattern, match attempts need
only be made at instances of < in the input.

Second question: I would like to run the regexps against the HTML content as a byte array (byte[]) without having to convert it into a string. Can ORO do this?

On the reasons why I don't want to do the byte[]-to-String conversion: 1/ Memory efficiency. 2/ I don't need it: even if there were multi-byte characters in the input, they are not part of my problem. 3/ The conversion can cause problems if the input is wrong.

Since you're dealing with 8-bit input, I think you may get better
results using AwkMatcher.  It would have been able to optimize
your original expression rather than requiring you to tweak it
yourself.  In other words, both of the expressions you listed
would have been converted into the same DFA, whereas they result
in two different Perl NFAs.  You still have to work with char
input, but if you're just doing a single pass on the input
an feeding an InputStreamReader that wraps ByteArrayInputStream
to AwkStreamInput will work.  But it would be more efficient
to just store the HTML in a char[].

Third question: I've read that byte-based regexp engines use a type of state machines which is significantly faster than char-based regexp engines. Am I correct? Can ORO take advantage of this? Could you recommend a regexp engine which can?

The difference is really DFA versus NFA.  You can't build a straight
table-based DFA with 16-bit characters because you wind up with
64k possible inputs for each state.  Building table based DFAs
for 16-bit characters just uses too much memory or is too slow
if you try to save memory using sparse matrices, so they tend to
be implemented as NFAs of one sort or another.  8-bit characters
give you 256 transitions for each state, which is manageable when
using array-based table lookups.  Even though AwkMatcher uses char
input, it only pays attention to the lower 8 bits, so it can be
rather fast in the right situations like your application.  The
thing to keep in mind is that it builds the DFA for a pattern
in lazy fashion as it is performing a match, so initial matches
(when the DFA is being built) will be slower than later matches
(after the DFA has been built).

In any case, I suggest you stick with Perl5Matcher if it meets your
needs just because Perl regular expressions have a richer syntax
than awk.  Otherwise, try AwkMatcher.  You shouldn't have to change
any code other than making your PatternMatcher a new AwkMatcher()
instead of a new Perl5Matcher().  However your pattern will have
to change slightly since the non capturing (?:) group construct
isn't supported by awk.

daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Need help optimizing a regexp

Reply via email to