[going back on the list]
Thanks for this trimmed-down code. It at least gives an impression of what is going on in GNU regex. But I would still like to start from scratch for libunistring, 1. because I want code that works the same way in UTF-8 as in UTF-32,
You can do that, I think, incrementally from the current code. It should be possible to tweak it so that OP_UTF8_PERIOD just "folds" into OP_PERIOD when you're processing UTF-8, and so on.
2. in the hope that I can find a set of node/operators that is more efficient (peephole optimization, like you said it) and possibly integrate the "kwset" trick in some way.
I think the kwset search should be done separately (i.e. as preprocessing). The tree-based representation of regcomp.c should be quite amenable to extracting the required keywords.
One inefficiency I have in the posted code is that I try every single starting point, which makes for O(n^2) performance instead of O(n). It should be easy to prepend a .* in regcomp.c to fix this. But for everything else it is a pretty natural DFA implementation.
There are other optimizations possible, for example I think you only need MB_CUR_MAX elements in the state log so you could make it 8-entries long (or 4 if you only want the million-something Unicode characters). The current implementation still has a state log as big as the original string, which is a remnant of the support for backreferences.
It would be nice to have a POSIX (or almost POSIX) mode in libunistring. The multibyte DFA code is so bitrot in GNU grep that relying on libunistring for UTF-8 locales would be a good cleanup...
Paolo