Re: Word Locations & Search Components

2009-02-17 Thread Koji Sekiguchi
Hmm, Otis, very nice! Koji Otis Gospodnetic wrote: Hi, Wouldn't this be as easy as: - split email into "paragraphs" - for each paragraph compute signature (MD5 or something fuzzier, like in SOLR-799) - for each signature look for other emails with this signature - when you find an email with

Re: Word Locations & Search Components

2009-02-16 Thread Otis Gospodnetic
X To: solr-user@lucene.apache.org Sent: Monday, February 16, 2009 11:05:40 PM Subject: Re: Word Locations & Search Components Basically I'm working on the Enron dataset, and I've already de-duplicated the collection and applied a spam filter. All the e-mails after this have been parsed to

Re: Word Locations & Search Components

2009-02-16 Thread Erick Erickson
I think you essentially have to do much of the same work either way, so take whatever comes easiest. Personally, I think that pre-processing the data (and using two fields) would be easiest, but it's up to you. Using a custom analyzer would involve collecting all the contents, deciding what is "re

Re: Word Locations & Search Components

2009-02-16 Thread Johnny X
Basically I'm working on the Enron dataset, and I've already de-duplicated the collection and applied a spam filter. All the e-mails after this have been parsed to XML and each field (so To, From, Date etc) has been separated, along with one large field for the remaining e-mail content (called Con

Re: Word Locations & Search Components

2009-02-16 Thread Alexander Ramos Jardim
I would go for a business logic solution and not a Solr customization in this case, as you need to filter information that you actually would like to see in diferent fields on your index. Did you already tried to split the email in several fields like subject, from, to, content, signature, etc etc

Re: Word Locations & Search Components

2009-02-16 Thread Grant Ingersoll
On Feb 15, 2009, at 10:33 PM, Johnny X wrote: Hi there, I was told before that I'd need to create a custom search component to do what I want to do, but I'm thinking it might actually be a custom analyzer. Basically, I'm indexing e-mail in XML in Solr and searching the 'content' fie