Matthew Woehlke <[EMAIL PROTECTED]> writes: > Is there an efficient implementation of 'sort | uniq -c | sort -n'? I > have a 4 GB core file I want to run 'strings' on, and the above is > really slow.
See Jon Bently, Don Knuth, Doug McIlroy, "Programming pearls: a literate program", CACM 29, 6 (June 1986), 471-483 <http://doi.acm.org/10.1145/5948.315654>. Source code is included. I think Knuth's solution will run rings around all the solutions proposed so far, if you tune it right (see below). > Is there a way already in coreutils to do this? If not, would there > be any interest in adding such a method? I dunno, it sounds pretty specialized. Though there may be some interest in a combination "sort | uniq -c", I wouldn't think there'd be any interest in combining all three. Philip Rowlands <[EMAIL PROTECTED]> writes: > A fundamentally more efficient approach would be something like: > > perl -lne '$bucket{$_}++; END { foreach $key (keys %bucket) { print > "$bucket{$key} $key" } }' | \ > sort -n Hmm, well, I tried that, and the Perl approach was tons slower on my box (Debian stable, 2.4 GHz P4, 1 GB RAM). The problem was that perl grabbed too much memory and started thrashing. The naive 'sort' pipeline was much faster since it did not thrash. Assuming there are a lot of duplicates, if you want to speed up the naive 'sort' implementation you might try this: strings | sort -S 50% | uniq -c | sort -n On my host this causes the first 'sort' to take about 150% of the CPU time of the 'strings' invocation. The 'uniq' and second 'sort' take about 100% each. So the total CPU time is about 4.5x the time it takes to run 'strings'. Your mileage may vary of course. _______________________________________________ Bug-coreutils mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-coreutils
