Re: Unicode range and enumeration support.

Stephane Chazelas Wed, 25 Dec 2019 08:20:45 -0800

2019-12-24 12:16:41 -0500, Eli Schwartz:
[...]
> > Also note that sort -u and sort | uniq are not quite the same, the -u
> > option only considers the key fields when deciding which records (lines)
> > are unique (of course, with no key options, the whole line is the key,
> > in which case they are more or less the same).
> 
> Hmm, is that "more or less" the same, or actually the same? Seems like
> it would be actually the same... in which case I'd rephrase it to say
> "sort -u can do things that uniq can't, because it takes an optional key
> modifier". (uniq does have -s / -f which is sort of kind of partially
> approaching the same thing, but eh, doesn't really count.)
[...]


It depends on the implementation.

sort is meant to compare strings with strcoll(), that is as per
the locale's collation rules. If two strings collate the same,
some implementations will resort to a strcmp()-type comparison,
some won't.

So sort -u will report the first (or possibly last) of any
sequence of lines that sort the same, and whether that first one
is the one with the lowest byte values or not will depend on the
implementation.

uniq itself is meant to do byte-to-byte comparison instead of
strcoll(), so sort | uniq on an input that contains different
strings that collate the same could very well give random
results. sort | uniq with a POSIX uniq would only work correctly
with sort implementations that resort to a byte to byte
comparison for lines that collate the same.

GNU uniq does use strcoll() instead of strcmp(). As such, it's
not POSIX compliant, but that means that sort -u works the same
as sort | uniq there.

As an example. Here on a GNU system (glibc 2.30, coreutils 8.30)
and in the en_GB.UTF-8 locale:

$ perl -C -le 'no warnings; print chr$_ for 0..0xd7ff, 0xe000..0x10ffff' | wc -l
1112065
$ perl -C -le 'no warnings; print chr$_ for 0..0xd7ff, 0xe000..0x10ffff' | sort 
-u | wc -l
50714

That is, out of the 1M+ characters in Unicode, GNU sort only
considers 50k distinct ones. Note that it used to be a lot worse
than that.

$ cat c           
🧝
🧜
🧙
🧛
🧝
🧚
$ u < c
U+1F9DD ELF
U+1F9DC MERPERSON
U+1F9D9 MAGE
U+1F9DB VAMPIRE
U+1F9DD ELF
U+1F9DA FAIRY
$ sort -u c
🧝
$ sort c | uniq
🧝


Those characters have not been assigned any sort order, and end
up sorting the same. 

The GNU sort algorithm is "stable" in that it keeps the original
order for lines that have equal sort keys. So here, we get the
merperson because it happens to be the first in the input.

You can see GNU uniq is not POSIX as there are 5 different lines
in the input but it returns only one. Even if it was POSIX, it
would fail to remove the duplicate Elf as they are not adjacent.

Now, let's look at the heirloom toolchest tools 

$ sort c    
🧝
🧚
🧛
🧝
🧙
🧜
$ sort -u c
🧜
$ sort c | uniq
🧝
🧚
🧛
🧝
🧙
🧜


That sort is not stable, so we get some random order on those
lines with identical sort order. It's uniq which here is POSIX
compliant, failed to remove the duplicate Elf as the Elves were
not adjacent.

Since the 2018 edition of the standard, it's recommended that
locales that don't have a @ in their name should have a total
ordering of all characters and that sort/ls/globs (globs being
the only thing on topic here)... should do a last-resort
strcmp()-like comparison for lines that collate the same.

The next major release will make it a requirement.

See:

http://austingroupbugs.net/view.php?id=938
http://austingroupbugs.net/view.php?id=963
http://austingroupbugs.net/view.php?id=1070

For now, use:

- sort -u
  to get one of each set of lines that sort the same (which one
  it is undefined)
- LC_ALL=C sort -u
  or
  LC_ALL=C sort | LC_ALL=C uniq
  to get unique lines (sorted by byte value)
- LC_ALL=C sort -u | sort
  to get unique lines sorted as per the collation's sort order
  (note that the order may not be deterministic for lines that
  collate equally)

sort | uniq itself can't be used reliably outside the C locale.

For the record, that "u" was:

u() {
  perl -Mcharnames=full -Mopen=locale -lne '
    printf "U+%04X %s\n", ord($_), charnames::viacode(ord($_)) for /./g' "$@"
}

-- 
Stephane

Re: Unicode range and enumeration support.

Reply via email to