On Feb 17, 2011, at 4:30 PM, Eric Blake wrote:
On 02/17/2011 01:46 PM, Bob Harris wrote:
Howdy,
(note: I know I should give you version information with this, but
(1) I
am not sure that this message will be read by anyone, and (2) I think
the problem probably transcends versions. If I get a response and
the
actual version is important, I will take the time to find it.)
Thanks for the report, and you are correct that your issue transcends
versions. However, if you use coreutils 8.6 or newer (the latest is
8.10), then the new --debug option would have helped you.
I have a file of genomic short sequence info in which it so happens
that
two of my sort key values are similar. The two keys are
HWI-ST407_110127_0082_A80L25ABXX:5:2:11746:46371#0/1
HWI-ST407_110127_0082_A80L25ABXX:5:21:17464:6371#0/1
As you can see, these are identical if one removes the colons.
Which sounds like exactly what sort does when you are sorting in the
en_US.UTF-8 locale.
I have tried several different options but none seem to work. -d
seems
to be the default, and it has the behavior indicated above. -n fails
completely. -g also fails. Reading the man page, I don't see any
other
options to control the comparison function.
Then you missed this part (in the sort man page, which is in turn
generated from 'sort --help'):
*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.
I understand *why* -d considers these two keys equal. What I don't
understand is why there is no option that says "order them
lexicographically".
That option is your set of locale-specific environment variables. Why
it's not an explicit option is due to historical accident (that's the
way POSIX specified it). Maybe GNU sort should add a
--collate-locale=... option as an extension that overrides LC_ALL, but
that seems a bit like bloat, and doesn't buy much over using the
standardized means of choosing collation sequencing.
Is there a hidden sort option that will do what I need?
Yep - try 'LC_ALL=C sort ...' to see the difference.
I'm pretty sure I'm not the first person to run into this problem.
You're not. It's a FAQ:
http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021
--
Eric Blake [email protected] +1-801-349-2682
Libvirt virtualization library http://libvirt.org
Thanks Eric, for the informative reply, and the FAQ link.
That makes sense (in the sense that I can see how to correct it).
Currently my LC_ALL is set to nothing. Some additional googling after
sending my previous message revealed that this is also affected by
LANG (my default is en_US.UTF-8 and LANG=C gives the desired result).
I was in the process of investigating what else I might break my
fiddling with LANG when your message arrived.
So I'll investigate LC_ALL instead, and see if there are potentially
any negative side effects, so I'll (hopefully) know what trade off I
am making (if any).
Thanks again,
Bob H
P.S. You're right, I missed the warning in the man page. I was
diligently looking through the options for one that would do what I
needed, and didn't realize there were other descriptive notes below
the options.