Hmm, Otis, very nice!

Koji

Otis Gospodnetic wrote:
Hi,

Wouldn't this be as easy as:
- split email into "paragraphs"
- for each paragraph compute signature (MD5 or something fuzzier, like in 
SOLR-799)
- for each signature look for other emails with this signature
- when you find an email with an identical signature, you know you've found the 
"banner"

I'd do this in a pre-processing phase.  You may have to add special logic for 
">" and other email-quoting characters.  Perhaps you can make use of assumption 
that banners always come at the end of emails.  Perhaps you can make use of situations where 
the banner appears multiple times in a single email (the one with lots of back-and-forth 
replies, for example).

This is similar to MoreLikeThis on paragraph level.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply via email to