UTF-8 support for fold(1)

Ingo Schwarze Thu, 19 May 2016 15:05:56 -0700

Hi,

here is a UTF-8 capable version of the POSIX-specified fold(1)
utility.  It was surprisingly difficult to implement a functionally
perfect version, basically requiring a complete rewrite of the
central function fold().


First, i'd like to stress one basic design decision that we made
for all our small non-interactive, filter-style userland utilities,
and that we can only maintain because we are supporting UTF-8 only,
which is stateless.  To my knowledge, no other implementation can
achieve the following.  When encountering invalid bytes that do not
form UTF-8 characters, we never abort the program, and we never
delete, modify, or insert bytes to make the character stream valid.
The only modifications done are those that the utility in question
is intended to make, in this case, inserting some '\n' bytes, but
never inserting other bytes or modyfying or deleting any bytes.

Like for the somewhat similar case of fmt(1) handled earlier, an
approach using a utf8.c file with an isolated helper function is
not helpful here.  Calls to mbtowc(3) and wcwidth(3) form an integral
part of the algorithm, such that trying to wrap them away would not
provide any value.

Note that it's no longer possible to simply write out the whole
buffer when a newline is encountered in the input because the newline
might interrupt an incomplete UTF-8 sequence, turning it into a
sequence of invalid bytes requiring more than one output column,
such that a line break needs to be inserted *earlier* than at the
newline character.

Here are three alternative approaches that might seem superior on
first sight, but have serious downsides and were conseqeuntly
avoided:

 1. Using getline(3), which we used for many other utilities, would
    be a bad idea in this case because fold(1) is specifically
    intended to be used on files that contain no line breaks yet,
    or only few of them, so very long input lines are a legitimate
    use case.

 2. It would be possible to read byte-by-byte and test with mbrtowc(3)
    after each byte read, then handle incomplete and invalid sequences
    differently.  That wouldn't simplify the code, though, because
    it doesn't change the fact that a newline might interrupt an
    incomplete sequence, still causing the complication already
    explained above.

 3. As usual, the getwc(3) and fgetws(3) interfaces are unusable
    when adhering to the design decision explained above.  When an
    encoding error occurs, they make it impossible to retrieve the
    invalid bytes.

The function new_column_position() is too simplistic for the UTF-8
case in multiple respects, and the code had to be rearranged such
that its new equivalent occurs at one single place only, as the
cornerstone of the changed algorithm.

A few minor cleanups are rolled in (sort headers, drop NOTREACHED,
return from main).

Changes to the manual:

 - Document semantics of backspace, carriage return, and tab,
   which are required by POSIX and were completely missing.
 - Document the effect of LC_CTYPE.
 - Delete the misleading BUGS entry about backspace encoding.
   For single-width characters, fold(1) does not mess up
   backspace encoding.  For double-width characters, that's merely
   a corollary of the more general bug in the POSIX specification,
   see below.
 - Delete the misleading BUGS entry about tabs.  If a tab exceeds
   the display width, it gets moved to the next line just like any
   other character, and i can't see what might be buggy about that.
 - Describe the bug that is indeed present in the POSIX standard
   and that we faithfully implement.

OK?
  Ingo


Index: usr.bin/fold/fold.1
===================================================================
RCS file: /cvs/src/usr.bin/fold/fold.1,v
retrieving revision 1.17
diff -u -p -r1.17 fold.1
--- usr.bin/fold/fold.1 5 Jan 2016 12:44:55 -0000       1.17
+++ usr.bin/fold/fold.1 19 May 2016 21:43:28 -0000
@@ -48,7 +48,7 @@ or the standard input if no files are sp
 breaking the lines to have a maximum of 80 display columns.
 .Pp
 The options are as follows:
-.Bl -tag -width Ds
+.Bl -tag -width 8n
 .It Fl b
 Count
 .Ar width
@@ -62,10 +62,31 @@ possible.
 .It Fl w Ar width
 Specifies a line width to use instead of the default of 80.
 .El
+.Pp
+Unless
+.Fl b
+is specified, a backspace character decrements the column position
+by one, a carriage return resets the column position to zero, and
+a tab advances the column position to the next multiple of eight.
+.Sh ENVIRONMENT
+.Bl -tag -width 8n
+.It Ev LC_CTYPE
+The character set
+.Xr locale 1 .
+It is used to decide which byte sequences form characters and what
+their display width is.
+If it is unset or set to
+.Qq C ,
+.Qq POSIX ,
+or an unsupported value, each byte except backspace, tab, newline,
+and carriage return is assumed to represent a character of display
+width 1.
+.El
 .Sh EXIT STATUS
 .Ex -std fold
 .Sh SEE ALSO
-.Xr expand 1
+.Xr expand 1 ,
+.Xr fmt 1
 .Sh STANDARDS
 The
 .Nm
@@ -100,15 +121,17 @@ rewrote the command in 1990, and
 .An J. T. Conklin
 added the missing options in 1993.
 .Sh BUGS
-If underlining (see
-.Xr ul 1 )
-is present it may be messed up by folding.
-.Pp
-.Ar width
-should be a multiple of 8 if tabs are present, or the tabs should
-be expanded using
-.Xr expand 1
-before using
-.Nm fold .
-.Pp
-Multibyte character support is missing.
+Traditional
+.Xr roff 7
+output semantics, implemented both by GNU nroff and by
+.Xr mandoc 1 ,
+only uses a single backspace for backing up the previous character,
+even for double-width characters.
+The
+.Nm
+backspace semantics required by POSIX mishandles such backspace-encoded
+sequences, breaking lines early.
+The
+.Xr fmt 1
+utility provides similar functionality and does not suffer from that
+problem, but isn't standardized by POSIX.
Index: usr.bin/fold/fold.c
===================================================================
RCS file: /cvs/src/usr.bin/fold/fold.c,v
retrieving revision 1.17
diff -u -p -r1.17 fold.c
--- usr.bin/fold/fold.c 9 Oct 2015 01:37:07 -0000       1.17
+++ usr.bin/fold/fold.c 19 May 2016 21:43:28 -0000
@@ -33,19 +33,22 @@
  * SUCH DAMAGE.
  */
 
+#include <ctype.h>
+#include <err.h>
+#include <limits.h>
+#include <locale.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
-#include <ctype.h>
-#include <err.h>
-#include <limits.h>
+#include <wchar.h>
 
 #define        DEFLINEWIDTH    80
 
 static void fold(unsigned int);
-static unsigned int new_column_position(unsigned int, int);
+static int isu8cont(unsigned char);
 static __dead void usage(void);
+
 int count_bytes = 0;
 int split_words = 0;
 
@@ -56,6 +59,8 @@ main(int argc, char *argv[])
        unsigned int width;
        const char *errstr;
 
+       setlocale(LC_CTYPE, "");
+
        if (pledge("stdio rpath", NULL) == -1)
                err(1, "pledge");
 
@@ -110,12 +115,11 @@ main(int argc, char *argv[])
                for (; *argv; ++argv) {
                        if (!freopen(*argv, "r", stdin))
                                err(1, "%s", *argv);
-                               /* NOTREACHED */
                        else
                                fold(width);
                }
        }
-       exit(0);
+       return 0;
 }
 
 /*
@@ -130,100 +134,131 @@ main(int argc, char *argv[])
  * returns embedded in the input stream.
  */
 static void
-fold(unsigned int width)
+fold(unsigned int max_width)
 {
-       static char *buf = NULL;
-       static int   buf_max = 0;
-       int ch;
-       unsigned int col, indx;
-
-       col = indx = 0;
-       while ((ch = getchar()) != EOF) {
-               if (ch == '\n') {
-                       if (indx != 0)
-                               fwrite(buf, 1, indx, stdout);
-                       putchar('\n');
-                       col = indx = 0;
-                       continue;
-               }
+       static char     *buf = NULL;
+       static size_t    bufsz = 2048;
+       char            *cp;    /* Current mb character. */
+       char            *np;    /* Next mb character. */
+       char            *sp;    /* To search for the last space. */
+       wchar_t          wc;    /* Current wide character. */
+       int              ch;    /* Last byte read. */
+       int              len;   /* Bytes in the current mb character. */
+       unsigned int     col;   /* Current display position. */
+       int              width; /* Display width of wc. */
+
+       if (buf == NULL && (buf = malloc(bufsz)) == NULL)
+               err(1, NULL);
+
+       np = cp = buf;
+       ch = 0;
+       col = 0;
+
+       while (ch != EOF) {  /* Loop on input characters. */
+               while ((ch = getchar()) != EOF) {  /* Loop on input bytes. */
+                       if (np + 1 == buf + bufsz) {
+                               buf = reallocarray(buf, 2, bufsz);
+                               if (buf == NULL)
+                                       err(1, NULL);
+                               bufsz *= 2;
+                       }
+                       *np++ = ch;
 
-               col = new_column_position(col, ch);
-               if (col > width) {
-                       unsigned int i, last_space;
-
-                       if (split_words) {
-                               for (i = 0, last_space = -1; i < indx; i++)
-                                       if(buf[i] == ' ')
-                                               last_space = i;
+                       /*
+                        * Read up to and including the first byte of
+                        * the next character, such that we are sure
+                        * to have a complete character in the buffer.
+                        * There is no need to read more than five bytes
+                        * ahead, since UTF-8 characters are four bytes
+                        * long at most.
+                        */
+
+                       if (np - cp > 4 || (np - cp > 1 && !isu8cont(ch)))
+                               break;
+               }
+
+               while (cp < np) {  /* Loop on output characters. */
+
+                       /* Handle end of line and backspace. */
+
+                       if (*cp == '\n' || (*cp == '\r' && !count_bytes)) {
+                               fwrite(buf, 1, ++cp - buf, stdout);
+                               memmove(buf, cp, np - cp);
+                               np = buf + (np - cp);
+                               cp = buf;
+                               col = 0;
+                               continue;
+                       }
+                       if (*cp == '\b' && !count_bytes) {
+                               if (col)
+                                       col--;
+                               cp++;
+                               continue;
                        }
 
-                       if (split_words && last_space != -1) {
-                               last_space++;
+                       /*
+                        * Measure display width.
+                        * Process the last byte only if
+                        * end of file was reached.
+                        */
+
+                       if (np - cp > (ch != EOF)) {
+                               len = 1;
+                               width = 1;
+
+                               if (*cp == '\t') {
+                                       if (count_bytes == 0)
+                                               width = 8 - (col & 7);
+                               } else if ((len = mbtowc(&wc, cp,
+                                   np - cp)) == -1)
+                                       len = 1;
+                               else if (count_bytes)
+                                       width = len;
+                               else if ((width = wcwidth(wc)) < 0)
+                                       width = 1;
+
+                               col += width;
+                               if (col <= max_width || cp == buf) {
+                                       cp += len;
+                                       continue;
+                               }
+                       }
 
-                               fwrite(buf, 1, last_space, stdout);
-                               memmove(buf, buf+last_space, indx-last_space);
+                       /* Line break required. */
 
-                               indx -= last_space;
-                               col = 0;
-                               for (i = 0; i < indx; i++) {
-                                       col = new_column_position(col, buf[i]);
+                       if (col > max_width) {
+                               if (split_words) {
+                                       for (sp = cp; sp > buf; sp--) {
+                                               if (sp[-1] == ' ') {
+                                                       cp = sp;
+                                                       break;
+                                               }
+                                       }
                                }
-                       } else {
-                               fwrite(buf, 1, indx, stdout);
-                               col = indx = 0;
+                               fwrite(buf, 1, cp - buf, stdout);
+                               putchar('\n');
+                               memmove(buf, cp, np - cp);
+                               np = buf + (np - cp);
+                               cp = buf;
+                               col = 0;
+                               continue;
                        }
-                       putchar('\n');
 
-                       /* calculate the column position for the next line. */
-                       col = new_column_position(col, ch);
-               }
+                       /* Need more input. */
 
-               if (indx + 1 > buf_max) {
-                       int newmax = buf_max + 2048;
-                       char *newbuf;
-
-                       /* Allocate buffer in LINE_MAX increments */
-                       if ((newbuf = realloc(buf, newmax)) == NULL) {
-                               err(1, NULL);
-                               /* NOTREACHED */
-                       }
-                       buf = newbuf;
-                       buf_max = newmax;
+                       break;
                }
-               buf[indx++] = ch;
        }
+       fwrite(buf, 1, np - buf, stdout);
 
-       if (indx != 0)
-               fwrite(buf, 1, indx, stdout);
+       if (ferror(stdin))
+               err(1, NULL);
 }
 
-/*
- * calculate the column position 
- */
-static unsigned int
-new_column_position(unsigned int col, int ch)
+static int
+isu8cont(unsigned char c)
 {
-       if (!count_bytes) {
-               switch (ch) {
-               case '\b':
-                       if (col > 0)
-                               --col;
-                       break;
-               case '\r':
-                       col = 0;
-                       break;
-               case '\t':
-                       col = (col + 8) & ~7;
-                       break;
-               default:
-                       ++col;
-                       break;
-               }
-       } else {
-               ++col;
-       }
-
-       return col;
+       return MB_CUR_MAX > 1 && (c & (0x80 | 0x40)) == 0x80;
 }
 
 static __dead void
Index: regress/usr.bin/fold/fold.sh
===================================================================
RCS file: /cvs/src/regress/usr.bin/fold/fold.sh,v
retrieving revision 1.1
diff -u -p -r1.1 fold.sh
--- regress/usr.bin/fold/fold.sh        3 May 2016 16:06:11 -0000       1.1
+++ regress/usr.bin/fold/fold.sh        19 May 2016 21:43:28 -0000
@@ -14,11 +14,18 @@
 # ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
 # OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 
+FOLD=/usr/bin/fold
+
+# Arguments of the test function:
+# 1. command line arguments for fold(1)
+# 2. standard input for fold, backslash-encoded
+# 3. expected standard output, backslash-encoded
+# 4. expected standard output of "fold -b", backslash-encoded
+#    (optional, by default the same as argument 3.)
 test_fold()
 {
        expect=`echo -n "$3" ; echo .`
-    if [ $SKIPUTF8 -eq 0 ]; then
-       result=`echo -n "$2" | fold $1 2>&1 ; echo .`
+       result=`echo -n "$2" | $FOLD $1 2>&1 ; echo .`
        if [ "$result" != "$expect" ]; then
                echo "fold $1 \"$2\":"
                echo -n "$2" | hexdump -C
@@ -28,9 +35,8 @@ test_fold()
                echo -n "$result" | hexdump -C
                exit 1
        fi
-    fi
        [ -n "$4" ] && expect=`echo -n "$4" ; echo .`
-       result=`echo -n "$2" | fold -b $1 2>&1 ; echo .`
+       result=`echo -n "$2" | $FOLD -b $1 2>&1 ; echo .`
        if [ "$result" != "$expect" ]; then
                echo "fold -b $1 \"$2\":"
                echo -n "$2" | hexdump -C
@@ -44,17 +50,21 @@ test_fold()
 
 export LC_ALL=C
 
-SKIPUTF8=0
-
 test_fold "" "" ""
+
+# newline
 test_fold "" "\n" "\n"
 test_fold "" "\n\n" "\n\n"
 test_fold "-w 1" "\n\n" "\n\n"
+test_fold "-w 2" "1\n12\n123" "1\n12\n12\n3"
+test_fold "-w 2" "12345" "12\n34\n5"
+test_fold "-w 2" "12345\n" "12\n34\n5\n"
 
 # backspace
 test_fold "-w 2" "123" "12\n3" 
 test_fold "-w 2" "1\b234" "1\b23\n4" "1\b\n23\n4"
 test_fold "-w 2" "\b1234" "\b12\n34" "\b1\n23\n4"
+test_fold "-w 2" "12\b\b345" "12\b\b34\n5" "12\n\b\b\n34\n5"
 test_fold "-w 2" "12\r3" "12\r3" "12\n\r3"
 
 # tabulator
@@ -66,20 +76,34 @@ test_fold "-w 9" "1\t9\b\b89012" "1\t9\b
 test_fold "-sw 4" "1 23 45" "1 \n23 \n45"
 test_fold "-sw 3" "1234 56" "123\n4 \n56"
 
-export LC_ALL=en_US.UTF-8
-
 # invalid characters
 test_fold "-w 3" "1\037734" "1\03773\n4"
 test_fold "-w 3" "1\000734" "1\00073\n4"
 
-SKIPUTF8=1
+export LC_ALL=en_US.UTF-8
 
 # double width characters
 test_fold "-w 4" "1\0343\0201\020145" "1\0343\0201\02014\n5" \
                "1\0343\0201\0201\n45"
+test_fold "-w 3" "\0343\0201\0201\0343\0201\020134" \
+               "\0343\0201\0201\n\0343\0201\02013\n4" \
+               "\0343\0201\0201\n\0343\0201\0201\n34"
+test_fold "-w 2" "\0343\0201\0201\b23" "\0343\0201\0201\b2\n3" \
+               "\0343\0201\0201\n\b2\n3"
+test_fold "-w 1" "1\0343\0201\02014" "1\n\0343\0201\0201\n4"
 
 # zero width characters
-test_fold "-w 3" "1a\0314\020034" "1a\0314\02003\n4" "1a\0314\n\020034"
+test_fold "-w 3" "1a\0314\020034" "1a\0314\02003\n4" "1a\n\0314\02003\n4"
 test_fold "-w 2" "1a\0314\02003" "1a\0314\0200\n3" "1a\n\0314\0200\n3"
+
+# four byte UTF-8 encoding
+test_fold "-w 3" "1\0360\0220\0200\020034" "1\0360\0220\0200\02003\n4" \
+               "1\n\0360\0220\0200\0200\n34"
+
+# invalid UTF-8
+test_fold "-w 3" "\0343\0201\0201\0201\0201\0201\0201\0201\n" \
+               "\0343\0201\0201\0201\n\0201\0201\0201\n\0201\n" \
+               "\0343\0201\0201\n\0201\0201\0201\n\0201\0201\n"
+test_fold "-w 2" "\0343\0343\0201\0201\n" "\0343\n\0343\0201\0201\n"
 
 exit 0

UTF-8 support for fold(1)

Reply via email to