Re: gawk regex stuff you may want

Aharon Robbins Wed, 20 Jan 2016 12:00:06 -0800

Hi Paul.

> Here are comments on the parts of the diff not incorporated in this round:
>
> > -static const char __re_error_msgid[] =
> > +static const char __re_error_msgid[] attribute_hidden =
>
> Since the constant is static, there should be no need for attribute_hidden, 
> as 
> the constant is used only in this compilation unit. Similarly for other 
> introductions of attribute_hidden.


Thanks, I'll review that.

> > -  re_dfa_t *dfa = bufp->buffer;
> > +  re_dfa_t *dfa = (re_dfa_t *) bufp->buffer;
>
> This cast is not needed, as bufp->buffer is already of the proper type. 
> Similarly elsewhere.

I don't see that. In regex.h:

| struct re_pattern_buffer
| {
|   /* Space that holds the compiled pattern.  It is declared as
|      `unsigned char *' because its elements are sometimes used as
|      array indexes.  */
|   unsigned char *__REPB_PREFIX(buffer);
| ...
| typedef struct re_pattern_buffer regex_t;

And this is:

| static bin_tree_t *
| parse (re_string_t *regexp, regex_t *preg, reg_syntax_t syntax,
|        reg_errcode_t *err)
| {
|   re_dfa_t *dfa = (re_dfa_t *) preg->buffer;

So the cast seems necessary.

> > -      preg->buffer = dfa;
> > +      preg->buffer = (unsigned char *) dfa;
>
> This cast seems counterproductive, as dfa is already of the proper type
> and the unsigned char * is not the proper type.

Same argument for the cast.

> > -  dfa->subexp_map = re_malloc (Idx, preg->re_nsub);
> > +  /* some malloc()-checkers don't like zero allocations */
> > +  if (preg->re_nsub > 0)
> > +    dfa->subexp_map = re_malloc (int, preg->re_nsub);
> > +  else
> > +    dfa->subexp_map = NULL;
> > +
>
> Since malloc (0) is well-defined to return either NULL or a valid pointer
> to a zero-byte object that can be freed, the code is working as-is.

Not on older systems. Although yes, I'm not sure I still support such
systems.  I don't know which checker complained. I think it was valgrind
at some point in the past.

> I'd rather look for a solution that involves silencing the checking
> without making the code bulkier and typically slower.

This is in compilation, not execution; not sure the difference
is really noticable.

> > +    /*
> > +     * Fedora Core 2, maybe others, have broken `btowc' that returns -1
> > +     * for any value > 127. Sigh. Note that `start_ch' and `end_ch' are
> > +     * unsigned, so we don't have sign extension problems.
> > +     */
> >      start_wc = ((start_elem->type == SB_CHAR || start_elem->type == 
> > COLL_SYM)
> > -           ? __btowc (start_ch) : start_elem->opr.wch);
> > +           ? start_ch : start_elem->opr.wch);
> >      end_wc = ((end_elem->type == SB_CHAR || end_elem->type == COLL_SYM)
> > -         ? __btowc (end_ch) : end_elem->opr.wch);
> > +         ? end_ch : end_elem->opr.wch);
> >      if (start_wc == WEOF || end_wc == WEOF)
> >        return REG_ECOLLATE;
>
> I don't see how this patch is helpful. If btowc returns -1 we are looking
> at an encoding error, and we can't treat that byte as if it were a
> character. In some single-byte locales, btowc returns 1 for values < 128,
> and they're encoding errors too. Perhaps you could mention a use case?

Gawk's profile5 test fails without this.  This code way predates that
test though.

If a byte value given inside [...] is greater than 127, it should
be left alone to be matched as-is (no arbitrary limits).

> >        /* Build single byte matching table for this equivalence class.  */
> > +      char_buf[1] = (unsigned char) '\0';
>
> This should be unnecessary, as the rest of the code shouldn't be looking
> at that byte. Is this something to pacify a lint checker?

Or valgrind at some point. I think a static checker that someone submitted
a report from.

> > -   wc = wc2;
> > +   wc = (wint_t) wc2;
>
> wc and wc2 are both integers so the cast is unnecessary. Similarly elsewhere.

There was a compiler (maybe VMS) for which this was necessary.

Thanks,

Arnold

Re: gawk regex stuff you may want

Reply via email to