Re: UTF-8 support for uniq(1)

Ingo Schwarze Fri, 11 Dec 2015 03:41:24 -0800

Oh well.  If something is supposedly simple...

I got three OKs on this code, including one senior developer calling
it simple and saying it shows how something like this should be done.


Fortunately, Patrick Keshishian privately mailed me that he suspected
a regression.  Even though the regression isn't in the loop he
pointed at, exactly what he said is broken in the *other* loop.
Here is how, with the first patch i sent:

   $ echo "1 t\n2\tt" | ./obj/uniq -f 1
  1 t
   $

The second loop advances *str even for the first blank after the
last field that is to be skipped, so if the first significant blank
differs, output is wrong.  (The first loop also skips the first non-
blank in each field, but that's not a problem because the second loop
is intended to do that anyway.  Besides, the first loop called
mbtowc(3) on the null string, but that wasn't a problem because
it returns a length of 0.)

So, here is a version that is both correcter and moar short.

Coding is hard, let's go shopping.
  Ingo


Index: uniq.1
===================================================================
RCS file: /cvs/src/usr.bin/uniq/uniq.1,v
retrieving revision 1.17
diff -u -p -r1.17 uniq.1
--- uniq.1      3 Sep 2010 11:09:29 -0000       1.17
+++ uniq.1      11 Dec 2015 11:10:47 -0000
@@ -114,6 +114,14 @@ A file name of
 .Ql -
 denotes the standard input or the standard output
 .Pq depending on its position on the command line .
+.Sh ENVIRONMENT
+.Bl -tag -width LC_CTYPE
+.It Ev LC_CTYPE
+The character set
+.Xr locale 1 .
+Determines which groups of bytes are treated as characters
+and which characters are considered blank.
+.El
 .Sh EXIT STATUS
 .Ex -std uniq
 .Sh SEE ALSO
Index: uniq.c
===================================================================
RCS file: /cvs/src/usr.bin/uniq/uniq.c,v
retrieving revision 1.23
diff -u -p -r1.23 uniq.c
--- uniq.c      2 Nov 2015 20:25:42 -0000       1.23
+++ uniq.c      11 Dec 2015 11:10:47 -0000
@@ -37,10 +37,13 @@
 #include <err.h>
 #include <errno.h>
 #include <limits.h>
+#include <locale.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <wchar.h>
+#include <wctype.h>
 
 #define        MAXLINELEN      (8 * 1024)
 
@@ -61,6 +64,8 @@ main(int argc, char *argv[])
        int ch;
        char *prevline, *thisline;
 
+       setlocale(LC_CTYPE, "");
+
        if (pledge("stdio rpath wpath cpath", NULL) == -1)
                err(1, "pledge");
 
@@ -176,16 +181,32 @@ show(FILE *ofp, char *str)
 char *
 skip(char *str)
 {
+       wchar_t wc;
        int nchars, nfields;
+       int len;
+       int field_started;
 
        for (nfields = numfields; nfields && *str; nfields--) {
-               while (isblank((unsigned char)*str))
-                       str++;
-               while (*str && !isblank((unsigned char)*str))
-                       str++;
+               /* Skip one field, including preceding blanks. */
+               for (field_started = 0; *str != '\0'; str += len) {
+                       if ((len = mbtowc(&wc, str, MB_CUR_MAX)) == -1) {
+                               (void)mbtowc(NULL, NULL, MB_CUR_MAX);
+                               wc = L'?';
+                               len = 1;
+                       }
+                       if (iswblank(wc)) {
+                               if (field_started)
+                                       break;
+                       } else
+                               field_started = 1;
+               }
        }
-       for (nchars = numchars; nchars-- && *str && *str != '\n'; ++str)
-               ;
+
+       /* Skip some additional characters. */
+       for (nchars = numchars; nchars-- && *str != '\0'; str += len)
+               if ((len = mblen(str, MB_CUR_MAX)) == -1)
+                       len = 1;
+
        return (str);
 }

Re: UTF-8 support for uniq(1)

Reply via email to