Michael Musenbrock wrote: > sort produces wrong output with combined '-u' and '-n' switch, if input lines > have an preceding '\' character.
I think you have a misunderstanding of what is happening here. > # echo -e "\2\n\1\n\2" | sort -u -n First lets see what is produced by echo -e with the backslash sequences given. Given that you are using the bash shell (different shells will produce different results). If using escape sequences it is always better to use the 'printf' utility for portability. echo -e "\2\n\1\n\2" | od -tx1 -c 0000000 5c 32 0a 5c 31 0a 5c 32 0a \ 2 \n \ 1 \n \ 2 \n echo -e "\2\n\1\n\2" \2 \1 \2 Good. So now we know that the first character on each line will be a backslash. Because those are not valid escape sequences and escape has no interpretation for them and passes them through unchanged. This is a rather scary unportable construct. Other implementations may do something different. > expected output: > \1 > \2 That would be incorrect. > actual output: > \2 That is the correct output. You have forgotten that the -n option interprets each line as numeric. Here is what the sort -n docs say: ‘-n’ ‘--numeric-sort’ ‘--sort=numeric’ Sort numerically. The number begins each line and consists of optional blanks, an optional ‘-’ sign, and zero or more digits possibly separated by thousands separators, optionally followed by a decimal-point character and zero or more digits. An empty number is treated as ‘0’. The ‘LC_NUMERIC’ locale specifies the decimal-point character and thousands separator. By default a blank is a space or a tab, but the ‘LC_CTYPE’ locale can change this. Comparison is exact; there is no rounding error. Neither a leading ‘+’ nor exponential notation is recognized. To compare such strings numerically, use the ‘--general-numeric-sort’ (‘-g’) option. When interpreted numerically "\1" is zero and "\2" is also zero because the numeric part of the string is empty. Therefore both "\1" and "\2" compare numerically as the same value. Both are zero. A recent addition to sort is the --debug option. Using --debug helps to find these problems. Using it here shows: echo -e "\2\n\1\n\2" | sort --debug -u -n sort: using simple byte comparison \2 ^ no match for key > I tried it with different local settings, all produces the same output. That sounds correct. But there are many locale and I am not familiar with all of them. > # echo -e "\2\n\1\n\2" | sort -u | sort -n > and > # echo -e "\2\n\1\n\2" | sort -n | sort -u > procude an correct output. Agreed. Both of those produce correct output. > Seems that the '\' may be interpreted as escape sequence somewhere. Yes but that isn't the problem in this case. The problem is trying to convert "\1" to a number and convert "\2" to a number. I will guess you are thinking they will convert to 1 and 2 respectively. However the non-digit character at the start of those strings prevent that from being possible. Most often this type of problem is seen when sorting using multiple fields and not realizing that -k should be used to restrict to the specific field. In this case the -k option may be useful too. I am not sure what your overall task is but you can slice off the first character from the sort field using the -k .C syntax. printf "%s\n" "\2" "\1" "\2" | sort -u -k1.2,1n \1 \2 Here the - is unique as you know. The -k1.2,1n sorts using the first field and stops sorting on the first field. Treat the first field numerically due to the 'n' option. Start at the second character due to the .2 part. Here are the full docs. ‘-k POS1[,POS2]’ ‘--key=POS1[,POS2]’ Specify a sort field that consists of the part of the line between POS1 and POS2 (or the end of the line, if POS2 is omitted), _inclusive_. In its simplest form POS specifies a field number (starting with 1), with fields being separated by runs of blank characters, and by default those blanks being included in the comparison at the start of each field. To adjust the handling of blank characters see the ‘-b’ and ‘-t’ options. More generally, each POS has the form ‘F[.C][OPTS]’, where F is the number of the field to use, and C is the number of the first character from the beginning of the field. Fields and character positions are numbered starting with 1; a character position of zero in POS2 indicates the field’s last character. If ‘.C’ is omitted from POS1, it defaults to 1 (the beginning of the field); if omitted from POS2, it defaults to 0 (the end of the field). OPTS are ordering options, allowing individual keys to be sorted according to different rules; see below for details. Keys can span multiple fields. Example: To sort on the second field, use ‘--key=2,2’ (‘-k 2,2’). See below for more notes on keys and more examples. See also the ‘--debug’ option to help determine the part of the line being used in the sort. Hope this helps explain the behavior you are seeing. I see no bug in sort here. Bob