Hmm, Otis, very nice! Koji
Otis Gospodnetic wrote:
Hi, Wouldn't this be as easy as: - split email into "paragraphs" - for each paragraph compute signature (MD5 or something fuzzier, like in SOLR-799) - for each signature look for other emails with this signature - when you find an email with an identical signature, you know you've found the "banner" I'd do this in a pre-processing phase. You may have to add special logic for ">" and other email-quoting characters. Perhaps you can make use of assumption that banners always come at the end of emails. Perhaps you can make use of situations where the banner appears multiple times in a single email (the one with lots of back-and-forth replies, for example). This is similar to MoreLikeThis on paragraph level. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch