Re: gnulib portability issues

Rich Felker Mon, 11 Jun 2012 06:12:57 -0700

On Mon, Jun 11, 2012 at 06:13:03AM -0600, Eric Blake wrote:
> On 06/10/2012 06:43 AM, Rich Felker wrote:
> 
> >> Come to think of it, a variant might even work for seekable files.
> >> Use dup2 to move the file descriptor somewhere else.  Close the
> >> fd.  Keep reading until error, and count the bytes read.  Then
> >> ungetc all the bytes that you read, in reverse order, and restore
> >> the file descriptor.  Of course ISO C doesn't guarantee this, but
> >> it should be fairly portable in practice.
> > 
> > No, ungetc normally can only unget one character. musl is fairly
> > unique in allowing you to unget more,
> 
> Wrong.  Pretty much every libc out there lets you ungetc() more than one
> byte.  It's just that no one exploits that fact, because ISO C99 doesn't
> guarantee that it will work, and POSIX hasn't added any wording to
> require it to work either.


I've seen at least one implementation (can't remember which now;
uclibc perhaps?) that actually went to some trouble to prevent you
from ungetting more than one character.

In any case, the reason you cannot unget multiple characters is not
just arbitrary; it's that you can't make assumptions about whether/how
the implementation is using the buffer, and whether there's space in
the buffer for additional characters. For instance an implementation
that wants to keep the buffer unmodified (to allow seek-in-buffer, or
if the buffer is a read-only mapping of the underlying file) will need
additional space outside the main buffer to store unget, and you have
no idea how much additional space is available. Other implementations
(musl included) will put the characters directly back into the buffer
(musl reserves some extra space just before the buffer for unget on an
empty buffer), but you still can't be sure how much is available at
any given time. Fortunately ungetc does not invoke UB if you try to
unget too much, though; it reports the error.

> In fact, most implementations of fscanf() use more than one ungetc()
> when encountering multi-byte ambiguous inputs.  For example, when
> parsing "%g" against the partial input "1.e+", whether you push back the
> multiple character sequence "e+" or consume it in addition to the next
> byte depends on whether the 5th byte is numeric.

The behavior you are describing is a bug in glibc. scanf cannot push
back the "e+". Per the C standard, if the next character is not a
digit, you must push back only that next character (consuming the
"e+") and return with a matching failure.

The relevant language (from 7.19.6.2) is:

"An input item is read from the stream, unless the specification
includes an n specifier. An input item is defined as the longest
sequence of input characters which does not exceed any specified field
width and which is, or is a prefix of, a matching input sequence.251)
The first character, if any, after the input item remains unread."

"251) fscanf pushes back at most one input character onto the input
stream. Therefore, some sequences that are acceptable to strtod,
strtol, etc., are unacceptable to fscanf."

Or, you can hear it from Fred J. Tydeman, Vice-chair of PL22.11:

http://newsgroups.derkeiler.com/Archive/Comp/comp.std.c/2009-09/msg00045.html

There is an open glibc bug report on this that has a chance of finally
getting fixed now that Drepper is gone:

http://sourceware.org/bugzilla/show_bug.cgi?id=12701


Rich

Re: gnulib portability issues

Reply via email to