Re: efficient version of 'sort | uniq -c | sort -n'?

Paul Eggert Mon, 21 May 2007 14:42:30 -0700

Matthew Woehlke <[EMAIL PROTECTED]> writes:

> Is there an efficient implementation of 'sort | uniq -c | sort -n'? I
> have a 4 GB core file I want to run 'strings' on, and the above is
> really slow.

See Jon Bently, Don Knuth, Doug McIlroy, "Programming pearls: a
literate program", CACM 29, 6 (June 1986), 471-483
<http://doi.acm.org/10.1145/5948.315654>.  Source code is included.  I
think Knuth's solution will run rings around all the solutions
proposed so far, if you tune it right (see below).

> Is there a way already in coreutils to do this? If not, would there
> be any interest in adding such a method?

I dunno, it sounds pretty specialized.  Though there may be some
interest in a combination "sort | uniq -c", I wouldn't think there'd
be any interest in combining all three.

Philip Rowlands <[EMAIL PROTECTED]> writes:

> A fundamentally more efficient approach would be something like:
>
> perl -lne '$bucket{$_}++; END { foreach $key (keys %bucket) { print 
> "$bucket{$key} $key" } }' | \
>   sort -n

Hmm, well, I tried that, and the Perl approach was tons slower on my
box (Debian stable, 2.4 GHz P4, 1 GB RAM).  The problem was that perl
grabbed too much memory and started thrashing.  The naive 'sort'
pipeline was much faster since it did not thrash.

Assuming there are a lot of duplicates, if you want to speed up the
naive 'sort' implementation you might try this:

strings | sort -S 50% | uniq -c | sort -n

On my host this causes the first 'sort' to take about 150% of the CPU
time of the 'strings' invocation.  The 'uniq' and second 'sort' take
about 100% each.  So the total CPU time is about 4.5x the time it
takes to run 'strings'.  Your mileage may vary of course.

_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: efficient version of 'sort | uniq -c | sort -n'?

Reply via email to