Hi Peff,
On Tue, 6 Sep 2016, Jeff King wrote:
> On Tue, Sep 06, 2016 at 06:02:59PM +0200, Johannes Schindelin wrote:
>
> > It will still be quite tricky, because we have to touch a function that is
> > rather at the bottom of the food chain: diff_populate_filespec() is called
> > from fill_textconv(), which in turn is called from pickaxe_match(), and
> > only pickaxe_match() knows whether we want to call regexec() or not (it
> > depends on its regexp parameter).
> >
> > Adding a flag to diff_populate_filespec() sounds really reasonable until
> > you see how many call sites fill_textconv() has.
>
> I was thinking of something quite gross, like a global "switch to using
> slower-but-safer NUL termination" flag (but I agree with Junio's point
> elsewhere that we do not even know if it is "slower").
Urgh.
;-)
> > So now for the better idea.
> >
> > While I was researching the code for this reply, I hit upon one thing
> > that I never knew existed, introduced in f96e567 (grep: use
> > REG_STARTEND for all matching if available, 2010-05-22). Apparently,
> > NetBSD introduced an extension to regexec() where you can specify
> > buffer boundaries using REG_STARTEND. Which is pretty much what we
> > need.
>
> Yes, and compat/regex support this, too. My question is whether it is
> portable.
That is only one question.
Another, important question is: is it efficient?
I have no idea whether there exists any hardware-accelerated regex library
out there, maybe even using CUDA (I know that there is some code out there
using SSE to perform LF -> CR/LF conversion, unfortunately it is
intentionally incompatible with GPLv2).
We cannot simply switch everybody and her dog to compat/regex/ just
because we want to avoid a segfault.
> > diff --git a/diff.c b/diff.c
> > index 534c12e..2c5a360 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -951,7 +951,13 @@ static int find_word_boundaries(mmfile_t *buffer,
> > regex_t *word_regex,
> > {
> > if (word_regex && *begin < buffer->size) {
> > regmatch_t match[1];
> > - if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
> > 0)) {
> > + int f = 0;
> > +#ifdef REG_STARTEND
> > + match[0].rm_so = 0;
> > + match[0].rm_eo = *end - *begin;
> > + f = REG_STARTEND;
> > +#endif
> > + if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
> > f)) {
Heh. You introduced the same bug I did. Or maybe you just fetched my
mmap-regexec branch and looked at an intermediate iteration?
The problem with this patch is that *end is uninitialized. I actually
initialized it in my patch, but it was still incorrect. I settled on using
buffer->size - *begin in the end.
> What happens to those poor souls on systems without REG_STARTEND? Do
> they get to keep segfaulting?
Of course not. Those poor souls on systems without REG_STARTEND pay a
little price for that: malloc(); memcpy(); *end = '\0'; ... free();
I think it is worth it: maintenance of the code is much easier that way
than forcing everybody and her dog and her dog's hamster to compat/regex/.
> But I much prefer this approach to copying the data just to add a NUL.
I think it is not worth the burden. The only regex implementation in
semi-widespread use that do not support REG_STARTEND seems to be musl.
I'd rather not spend *so much* effort just to support an obscure platform.
Not when the users of that obscure platform could spend that effort
themselves. And probably won't, because we only copy data to add a NUL on
those platforms when regexec() is called on an mmfile_t.
Better to keep it simple,
Dscho