Michael Musenbrock wrote:
> sort produces wrong output with combined '-u' and '-n' switch, if input lines 
> have an preceding '\' character.

I think you have a misunderstanding of what is happening here.

> # echo -e "\2\n\1\n\2" | sort -u -n

First lets see what is produced by echo -e with the backslash
sequences given.  Given that you are using the bash shell (different
shells will produce different results).  If using escape sequences it
is always better to use the 'printf' utility for portability.

  echo -e "\2\n\1\n\2" | od -tx1 -c
    0000000  5c  32  0a  5c  31  0a  5c  32  0a
              \   2  \n   \   1  \n   \   2  \n
  echo -e "\2\n\1\n\2"
    \2
    \1
    \2

Good.  So now we know that the first character on each line will be a
backslash.  Because those are not valid escape sequences and escape
has no interpretation for them and passes them through unchanged.

This is a rather scary unportable construct.  Other implementations
may do something different.

> expected output:
>  \1
>  \2

That would be incorrect.

> actual output:
>  \2

That is the correct output.  You have forgotten that the -n option
interprets each line as numeric.  Here is what the sort -n docs say:

‘-n’
‘--numeric-sort’
‘--sort=numeric’
     Sort numerically.  The number begins each line and consists of
     optional blanks, an optional ‘-’ sign, and zero or more digits
     possibly separated by thousands separators, optionally followed by
     a decimal-point character and zero or more digits.  An empty number
     is treated as ‘0’.  The ‘LC_NUMERIC’ locale specifies the
     decimal-point character and thousands separator.  By default a
     blank is a space or a tab, but the ‘LC_CTYPE’ locale can change
     this.

     Comparison is exact; there is no rounding error.

     Neither a leading ‘+’ nor exponential notation is recognized.  To
     compare such strings numerically, use the ‘--general-numeric-sort’
     (‘-g’) option.

When interpreted numerically "\1" is zero and "\2" is also zero
because the numeric part of the string is empty.  Therefore both "\1"
and "\2" compare numerically as the same value.  Both are zero.

A recent addition to sort is the --debug option.  Using --debug helps
to find these problems.  Using it here shows:

  echo -e "\2\n\1\n\2" | sort --debug -u -n
  sort: using simple byte comparison
  \2
  ^ no match for key

> I tried it with different local settings, all produces the same output.

That sounds correct.  But there are many locale and I am not familiar
with all of them.

> # echo -e "\2\n\1\n\2" | sort -u | sort -n
> and
> # echo -e "\2\n\1\n\2" | sort -n | sort -u
> procude an correct output.

Agreed.  Both of those produce correct output.

> Seems that the '\' may be interpreted as escape sequence somewhere.

Yes but that isn't the problem in this case.  The problem is trying to
convert "\1" to a number and convert "\2" to a number.  I will guess
you are thinking they will convert to 1 and 2 respectively.  However
the non-digit character at the start of those strings prevent that
from being possible.

Most often this type of problem is seen when sorting using multiple
fields and not realizing that -k should be used to restrict to the
specific field.  In this case the -k option may be useful too.  I am
not sure what your overall task is but you can slice off the first
character from the sort field using the -k .C syntax.

  printf "%s\n" "\2" "\1" "\2" | sort -u -k1.2,1n
    \1
    \2

Here the - is unique as you know.  The -k1.2,1n sorts using the first
field and stops sorting on the first field.  Treat the first field
numerically due to the 'n' option.  Start at the second character due
to the .2 part.  Here are the full docs.

‘-k POS1[,POS2]’
‘--key=POS1[,POS2]’
     Specify a sort field that consists of the part of the line between
     POS1 and POS2 (or the end of the line, if POS2 is omitted),
     _inclusive_.

     In its simplest form POS specifies a field number (starting with
     1), with fields being separated by runs of blank characters, and by
     default those blanks being included in the comparison at the start
     of each field.  To adjust the handling of blank characters see the
     ‘-b’ and ‘-t’ options.

     More generally, each POS has the form ‘F[.C][OPTS]’, where F is the
     number of the field to use, and C is the number of the first
     character from the beginning of the field.  Fields and character
     positions are numbered starting with 1; a character position of
     zero in POS2 indicates the field’s last character.  If ‘.C’ is
     omitted from POS1, it defaults to 1 (the beginning of the field);
     if omitted from POS2, it defaults to 0 (the end of the field).
     OPTS are ordering options, allowing individual keys to be sorted
     according to different rules; see below for details.  Keys can span
     multiple fields.

     Example: To sort on the second field, use ‘--key=2,2’ (‘-k 2,2’).
     See below for more notes on keys and more examples.  See also the
     ‘--debug’ option to help determine the part of the line being used
     in the sort.

Hope this helps explain the behavior you are seeing.  I see no bug in
sort here.

Bob

Reply via email to