Hmm, Otis, very nice!
Koji
Otis Gospodnetic wrote:
Hi,
Wouldn't this be as easy as:
- split email into "paragraphs"
- for each paragraph compute signature (MD5 or something fuzzier, like in
SOLR-799)
- for each signature look for other emails with this signature
- when you find an email with
X
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 11:05:40 PM
Subject: Re: Word Locations & Search Components
Basically I'm working on the Enron dataset, and I've already de-duplicated
the collection and applied a spam filter. All the e-mails after this have
been parsed to
I think you essentially have to do much of the same work either
way, so take whatever comes easiest. Personally, I think
that pre-processing the data (and using two fields) would be
easiest, but it's up to you.
Using a custom analyzer would involve collecting all the contents,
deciding what is "re
Basically I'm working on the Enron dataset, and I've already de-duplicated
the collection and applied a spam filter. All the e-mails after this have
been parsed to XML and each field (so To, From, Date etc) has been
separated, along with one large field for the remaining e-mail content
(called Con
I would go for a business logic solution and not a Solr customization in
this case, as you need to filter information that you actually would like to
see in diferent fields on your index.
Did you already tried to split the email in several fields like subject,
from, to, content, signature, etc etc
On Feb 15, 2009, at 10:33 PM, Johnny X wrote:
Hi there,
I was told before that I'd need to create a custom search component
to do
what I want to do, but I'm thinking it might actually be a custom
analyzer.
Basically, I'm indexing e-mail in XML in Solr and searching the
'content'
fie