On Thu, Sep 08, 2016 at 09:29:58AM +0200, Johannes Schindelin wrote:
> sorry for the late answer, I was really busy trying to come up with a new
> and improved version of the patch series, and while hunting a bug I
> introduced got bogged down with other tasks.
No problem. I am not in a hurry.
> > I always assumed the _point_ of re_search taking a ptr/len pair was
> > exactly to handle this case. The documentation[1] says:
> >
> > `string` is the string you want to match; it can contain newline and
> > null characters. `size` is the length of that string.
> >
> > Which seems pretty definitive to me (that's for re_match(), but
> > re_search() is defined in the docs in terms of re_match()).
>
> Right. The problem is: I *really* want to avoid using GNU-isms.
I don't think GNU-isms are a problem if we wrap them to give a nice
interface, and if we rely on having compat/regex. But if you mean "I do
not want to rely on using compat/regex everywhere", then OK. I can see
arguments both for and against using a consistent regex library, but I
do not care that much either way myself.
> > We can contain this to the existing compat/regexec/regexec.c, and just
> > provide a wrapper that is similar to regexec but takes a ptr/len pair.
>
> But we can do even better than that: we can provide a wrapper that uses
> REG_STARTEND where available (which is really the majority of platforms we
> care about: Linux, MacOSX, Windows, and even the *BSDs). Where it is not
> available, we simply malloc(), memcpy() and append a NUL.
Doesn't that make things much _worse_ for people on systems without
REG_STARTEND? If we imagine that most regexec calls would operate on a
NUL-terminated buffer, then they are now paying the extra malloc and
copy for each call to regexec_buf(), even if the buffer was already
NUL-terminated (because they have no idea whether it was or not).
I think I'd rather just have:
#ifndef REG_STARTEND
#error "Your regex library sucks. Compile with NO_REGEX=NeedsStartEnd"
#endif
(or you could just use REG_STARTEND and let the compiler complain, but
then the user has to figure out the right knob to twiddle).
One other question about REG_STARTEND is: what does it do with NULs
inside the buffer? Certainly glibc (and our compat/regex) treat it as a
buffer with a particular length and ignore embedded NULs, as we want.
But the NetBSD documentation says only:
REG_STARTEND The string is considered to start at string +
pmatch[0].rm_so and to have a terminating NUL
located at string + pmatch[0].rm_eo (there need not
actually be a NUL at that location),
Besides avoiding a segfault, one of the benefits of regcomp_buf() is
that we will now find pickaxe-regex strings inside mixed binary/text
files. But it's not clear to me that NetBSD's implementation does this.
I guess we can assume it is fine (it is certainly no _worse_ than the
current behavior), and if people's platforms do not handle it, they can
build with NO_REGEX.
-Peff