2019-12-24 12:16:41 -0500, Eli Schwartz: [...] > > Also note that sort -u and sort | uniq are not quite the same, the -u > > option only considers the key fields when deciding which records (lines) > > are unique (of course, with no key options, the whole line is the key, > > in which case they are more or less the same). > > Hmm, is that "more or less" the same, or actually the same? Seems like > it would be actually the same... in which case I'd rephrase it to say > "sort -u can do things that uniq can't, because it takes an optional key > modifier". (uniq does have -s / -f which is sort of kind of partially > approaching the same thing, but eh, doesn't really count.) [...]
It depends on the implementation. sort is meant to compare strings with strcoll(), that is as per the locale's collation rules. If two strings collate the same, some implementations will resort to a strcmp()-type comparison, some won't. So sort -u will report the first (or possibly last) of any sequence of lines that sort the same, and whether that first one is the one with the lowest byte values or not will depend on the implementation. uniq itself is meant to do byte-to-byte comparison instead of strcoll(), so sort | uniq on an input that contains different strings that collate the same could very well give random results. sort | uniq with a POSIX uniq would only work correctly with sort implementations that resort to a byte to byte comparison for lines that collate the same. GNU uniq does use strcoll() instead of strcmp(). As such, it's not POSIX compliant, but that means that sort -u works the same as sort | uniq there. As an example. Here on a GNU system (glibc 2.30, coreutils 8.30) and in the en_GB.UTF-8 locale: $ perl -C -le 'no warnings; print chr$_ for 0..0xd7ff, 0xe000..0x10ffff' | wc -l 1112065 $ perl -C -le 'no warnings; print chr$_ for 0..0xd7ff, 0xe000..0x10ffff' | sort -u | wc -l 50714 That is, out of the 1M+ characters in Unicode, GNU sort only considers 50k distinct ones. Note that it used to be a lot worse than that. $ cat c 🧝 🧜 🧙 🧛 🧝 🧚 $ u < c U+1F9DD ELF U+1F9DC MERPERSON U+1F9D9 MAGE U+1F9DB VAMPIRE U+1F9DD ELF U+1F9DA FAIRY $ sort -u c 🧝 $ sort c | uniq 🧝 Those characters have not been assigned any sort order, and end up sorting the same. The GNU sort algorithm is "stable" in that it keeps the original order for lines that have equal sort keys. So here, we get the merperson because it happens to be the first in the input. You can see GNU uniq is not POSIX as there are 5 different lines in the input but it returns only one. Even if it was POSIX, it would fail to remove the duplicate Elf as they are not adjacent. Now, let's look at the heirloom toolchest tools $ sort c 🧝 🧚 🧛 🧝 🧙 🧜 $ sort -u c 🧜 $ sort c | uniq 🧝 🧚 🧛 🧝 🧙 🧜 That sort is not stable, so we get some random order on those lines with identical sort order. It's uniq which here is POSIX compliant, failed to remove the duplicate Elf as the Elves were not adjacent. Since the 2018 edition of the standard, it's recommended that locales that don't have a @ in their name should have a total ordering of all characters and that sort/ls/globs (globs being the only thing on topic here)... should do a last-resort strcmp()-like comparison for lines that collate the same. The next major release will make it a requirement. See: http://austingroupbugs.net/view.php?id=938 http://austingroupbugs.net/view.php?id=963 http://austingroupbugs.net/view.php?id=1070 For now, use: - sort -u to get one of each set of lines that sort the same (which one it is undefined) - LC_ALL=C sort -u or LC_ALL=C sort | LC_ALL=C uniq to get unique lines (sorted by byte value) - LC_ALL=C sort -u | sort to get unique lines sorted as per the collation's sort order (note that the order may not be deterministic for lines that collate equally) sort | uniq itself can't be used reliably outside the C locale. For the record, that "u" was: u() { perl -Mcharnames=full -Mopen=locale -lne ' printf "U+%04X %s\n", ord($_), charnames::viacode(ord($_)) for /./g' "$@" } -- Stephane