just writing back to thank you for your help and report on the outcome:
As you said, Awk regexps perform the same whether I factor out that '<'.
It was easy to use Awk regexps on byte input, which reduced the garbage-generation rate, but did not change CPU consumption measurably.
The 'good' Perl regexp (the one with the '<' factored out) still performed slightly better than the Awk option in CPU usage. Since Perl's are much more flexible and allowed for more readable regexps, I decided to stay with Perl's.
-- Thanks again,
Jordi.
En/na Daniel F. Savarese ha escrit:
In message <[EMAIL PROTECTED]>, Jordi Salvat i Alabart writes:
First question (out of sheer curiosity): why is this later regexp faster than the earlier one?
The expression is too long for me to analyze on a glance, but anything you can do to rewrite a pattern that reduces backtracking will yield performance gains. It looks like you moved the common prefix < for each alternation to the beginning of the expression and grouped the rest. That will definitely reduce backtracking and since it definitively establishes the first character of the pattern, match attempts need only be made at instances of < in the input.
Second question: I would like to run the regexps against the HTML content as a byte array (byte[]) without having to convert it into a string. Can ORO do this?
On the reasons why I don't want to do the byte[]-to-String conversion:
1/ Memory efficiency.
2/ I don't need it: even if there were multi-byte characters in the input, they are not part of my problem.
3/ The conversion can cause problems if the input is wrong.
Since you're dealing with 8-bit input, I think you may get better results using AwkMatcher. It would have been able to optimize your original expression rather than requiring you to tweak it yourself. In other words, both of the expressions you listed would have been converted into the same DFA, whereas they result in two different Perl NFAs. You still have to work with char input, but if you're just doing a single pass on the input an feeding an InputStreamReader that wraps ByteArrayInputStream to AwkStreamInput will work. But it would be more efficient to just store the HTML in a char[].
Third question: I've read that byte-based regexp engines use a type of state machines which is significantly faster than char-based regexp engines. Am I correct? Can ORO take advantage of this? Could you recommend a regexp engine which can?
The difference is really DFA versus NFA. You can't build a straight table-based DFA with 16-bit characters because you wind up with 64k possible inputs for each state. Building table based DFAs for 16-bit characters just uses too much memory or is too slow if you try to save memory using sparse matrices, so they tend to be implemented as NFAs of one sort or another. 8-bit characters give you 256 transitions for each state, which is manageable when using array-based table lookups. Even though AwkMatcher uses char input, it only pays attention to the lower 8 bits, so it can be rather fast in the right situations like your application. The thing to keep in mind is that it builds the DFA for a pattern in lazy fashion as it is performing a match, so initial matches (when the DFA is being built) will be slower than later matches (after the DFA has been built).
In any case, I suggest you stick with Perl5Matcher if it meets your needs just because Perl regular expressions have a richer syntax than awk. Otherwise, try AwkMatcher. You shouldn't have to change any code other than making your PatternMatcher a new AwkMatcher() instead of a new Perl5Matcher(). However your pattern will have to change slightly since the non capturing (?:) group construct isn't supported by awk.
daniel
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
