POSIX.1-2008 added a new sentence to the description of uniq:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html

> [...] The trailing <newline> of each line in the input shall be
> ignored when doing comparisons.

It comes from this interpretation:

https://collaboration.opengroup.org/austin/interps/documents/14355/AI-133.txt

POSIX.1-2001 doesn't have this sentence:

https://pubs.opengroup.org/onlinepubs/009695399/utilities/uniq.html

This distinction changes behavior slightly.  My interpretation is that
the last line in the input can be a duplicate of the penultimate line
even if said last line is missing the terminating newline byte.

Consider a practical example.  Here is -current uniq(1):

$ printf "line\nline" | uniq -c
   1 line
   1 line$ 

(note that we do not append the missing newline to the output and my
shell prompt is printed on the same line.)

Here is uniq(1) with my patch:

$ printf "line\nline" | obj/uniq -c
   2 line

Do we want this?

On the one hand, it is intuitive that two buffers are not literally
the same if one has a newline and the other does not.  strcmp(3)
agrees with this.

... On the other hand, it also seems intuitive that two records are
the same even if one record doesn't have a record delimiter.  You
could argue that this makes the utility more flexible in a small
corner case.

Thoughts?

Of note is that we already claim conformance to POSIX.1-2008 in the
uniq.1 manpage even though we are missing this behavior.  This is the
only major behavior change inttroduced between POSIX.1-2001 and
POSIX.1-2008.  If we decide not to add this behavior I think we should
dial back the standard quoted in the manpage to POSIX.1-2001.

FWIW, GNU uniq implements this behavior:

$ printf "line\nline" | guniq
      2 line

As does FreeBSD, since 2002, although they don't mention it in their
manpage.  Here's the commit:

https://cgit.freebsd.org/src/commit/usr.bin/uniq/uniq.c?id=4e774f7fbe7c154f101f90c115b702780920ebcb

Index: uniq.c
===================================================================
RCS file: /cvs/src/usr.bin/uniq/uniq.c,v
retrieving revision 1.28
diff -u -p -r1.28 uniq.c
--- uniq.c      1 Nov 2021 23:20:35 -0000       1.28
+++ uniq.c      2 Nov 2021 01:16:52 -0000
@@ -49,7 +49,7 @@ int cflag, dflag, iflag, uflag;
 int numchars, numfields, repeats;
 
 FILE   *file(char *, char *);
-void    show(FILE *, char *);
+void    show(FILE *, char *, int);
 char   *skip(char *);
 void    obsolete(char *[]);
 __dead void    usage(void);
@@ -60,7 +60,8 @@ main(int argc, char *argv[])
        char *prevline, *t1, *t2, *thisline;
        FILE *ifp = NULL, *ofp = NULL;
        size_t prevsize, thissize, tmpsize;
-       int ch;
+       ssize_t len;
+       int ch, prevnl, thisnl;
 
        setlocale(LC_CTYPE, "");
 
@@ -133,16 +134,22 @@ main(int argc, char *argv[])
 
        prevsize = 0;
        prevline = NULL;
-       if (getline(&prevline, &prevsize, ifp) == -1) {
+       if ((len = getline(&prevline, &prevsize, ifp)) == -1) {
                free(prevline);
                if (ferror(ifp))
                        err(1, "getline");
                exit(0);
        }
+       prevnl = prevline[len - 1] == '\n';
+       if (prevnl)
+               prevline[len - 1] = '\0';
        
        thissize = 0;
        thisline = NULL;
-       while (getline(&thisline, &thissize, ifp) != -1) {
+       while ((len = getline(&thisline, &thissize, ifp)) != -1) {
+               thisnl = thisline[len - 1] == '\n';
+               if (thisnl)
+                       thisline[len - 1] = '\0';
                /* If requested get the chosen fields + character offsets. */
                if (numfields || numchars) {
                        t1 = skip(thisline);
@@ -154,13 +161,14 @@ main(int argc, char *argv[])
 
                /* If different, print; set previous to new value. */
                if ((iflag ? strcasecmp : strcmp)(t1, t2)) {
-                       show(ofp, prevline);
+                       show(ofp, prevline, prevnl);
                        t1 = prevline;
                        prevline = thisline;
                        thisline = t1;
                        tmpsize = prevsize;
                        prevsize = thissize;
                        thissize = tmpsize;
+                       prevnl = thisnl;
                        repeats = 0;
                } else
                        ++repeats;
@@ -169,7 +177,7 @@ main(int argc, char *argv[])
        if (ferror(ifp))
                err(1, "getline");
 
-       show(ofp, prevline);
+       show(ofp, prevline, prevnl);
        free(prevline);
 
        exit(0);
@@ -181,13 +189,15 @@ main(int argc, char *argv[])
  *     of the line.
  */
 void
-show(FILE *ofp, char *str)
+show(FILE *ofp, char *str, int newline)
 {
        if ((dflag && repeats) || (uflag && !repeats)) {
                if (cflag)
                        (void)fprintf(ofp, "%4d %s", repeats + 1, str);
                else
                        (void)fprintf(ofp, "%s", str);
+               if (newline)
+                       fputc('\n', ofp);
        }
 }
 

Reply via email to