On Sun, Jul 13, 2008 at 10:05:23AM +0100, j t wrote: > I might be butting up against the edge of what's theoretically > possible ("computer science"-wise) but I think that my requirements > have something to do with lossless compression algorithms. Perhaps I > should start reading the source code for gzip/bzip2...?
You're on the right track here, at least for getting as far as detecting maximal-length identical strings. As I recall, Huffman encoding should be what you're looking for. Another place to look would be search indexing algorithms. I used to know a guy who'd done graduate work in that area and, from talking to him about it, it sounded like this is one of their key techniques. Although, if you're just looking for identical log entries (rather than arbitrary repeated segments in freeform text), using awk/sed to strip out timestamps, then feeding the result through `sort | uniq -cd` should handle that case. (There are already standard log analysis packages which do essentially this, but I can't think of any names at the moment.) -- News aggregation meets world domination. Can you see the fnews? http://seethefnews.com/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]