Re: [RFC] Add prototype pure-DFA matcher

Paolo Bonzini Sun, 06 Dec 2009 05:48:16 -0800

[going back on the list]

Thanks for this trimmed-down code. It at least gives an impression of
what is going on in GNU regex. But I would still like to start from scratch
for libunistring,
   1. because I want code that works the same way in UTF-8 as in UTF-32,

You can do that, I think, incrementally from the current code. Itshould be possible to tweak it so that OP_UTF8_PERIOD just "folds" intoOP_PERIOD when you're processing UTF-8, and so on.

   2. in the hope that I can find a set of node/operators that is more
      efficient (peephole optimization, like you said it) and possibly
      integrate the "kwset" trick in some way.

I think the kwset search should be done separately (i.e. aspreprocessing). The tree-based representation of regcomp.c should bequite amenable to extracting the required keywords.

One inefficiency I have in the posted code is that I try every singlestarting point, which makes for O(n^2) performance instead of O(n). Itshould be easy to prepend a .* in regcomp.c to fix this. But foreverything else it is a pretty natural DFA implementation.

There are other optimizations possible, for example I think you onlyneed MB_CUR_MAX elements in the state log so you could make it 8-entrieslong (or 4 if you only want the million-something Unicode characters).The current implementation still has a state log as big as the originalstring, which is a remnant of the support for backreferences.

It would be nice to have a POSIX (or almost POSIX) mode in libunistring.The multibyte DFA code is so bitrot in GNU grep that relying onlibunistring for UTF-8 locales would be a good cleanup...


Paolo

Re: [RFC] Add prototype pure-DFA matcher

Reply via email to