On Mon, Jun 11, 2012 at 06:13:03AM -0600, Eric Blake wrote: > On 06/10/2012 06:43 AM, Rich Felker wrote: > > >> Come to think of it, a variant might even work for seekable files. > >> Use dup2 to move the file descriptor somewhere else. Close the > >> fd. Keep reading until error, and count the bytes read. Then > >> ungetc all the bytes that you read, in reverse order, and restore > >> the file descriptor. Of course ISO C doesn't guarantee this, but > >> it should be fairly portable in practice. > > > > No, ungetc normally can only unget one character. musl is fairly > > unique in allowing you to unget more, > > Wrong. Pretty much every libc out there lets you ungetc() more than one > byte. It's just that no one exploits that fact, because ISO C99 doesn't > guarantee that it will work, and POSIX hasn't added any wording to > require it to work either.
I've seen at least one implementation (can't remember which now; uclibc perhaps?) that actually went to some trouble to prevent you from ungetting more than one character. In any case, the reason you cannot unget multiple characters is not just arbitrary; it's that you can't make assumptions about whether/how the implementation is using the buffer, and whether there's space in the buffer for additional characters. For instance an implementation that wants to keep the buffer unmodified (to allow seek-in-buffer, or if the buffer is a read-only mapping of the underlying file) will need additional space outside the main buffer to store unget, and you have no idea how much additional space is available. Other implementations (musl included) will put the characters directly back into the buffer (musl reserves some extra space just before the buffer for unget on an empty buffer), but you still can't be sure how much is available at any given time. Fortunately ungetc does not invoke UB if you try to unget too much, though; it reports the error. > In fact, most implementations of fscanf() use more than one ungetc() > when encountering multi-byte ambiguous inputs. For example, when > parsing "%g" against the partial input "1.e+", whether you push back the > multiple character sequence "e+" or consume it in addition to the next > byte depends on whether the 5th byte is numeric. The behavior you are describing is a bug in glibc. scanf cannot push back the "e+". Per the C standard, if the next character is not a digit, you must push back only that next character (consuming the "e+") and return with a matching failure. The relevant language (from 7.19.6.2) is: "An input item is read from the stream, unless the specification includes an n specifier. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence.251) The first character, if any, after the input item remains unread." "251) fscanf pushes back at most one input character onto the input stream. Therefore, some sequences that are acceptable to strtod, strtol, etc., are unacceptable to fscanf." Or, you can hear it from Fred J. Tydeman, Vice-chair of PL22.11: http://newsgroups.derkeiler.com/Archive/Comp/comp.std.c/2009-09/msg00045.html There is an open glibc bug report on this that has a chance of finally getting fixed now that Drepper is gone: http://sourceware.org/bugzilla/show_bug.cgi?id=12701 Rich