On Feb 15, 2009, at 10:33 PM, Johnny X wrote:


Hi there,


I was told before that I'd need to create a custom search component to do what I want to do, but I'm thinking it might actually be a custom analyzer.

Basically, I'm indexing e-mail in XML in Solr and searching the 'content'
field which is parsed as 'text'.

I want to ignore certain elements of the e-mail (i.e. corporate banners), but also identify the actual content of those e-mails including corporate
information.

To identify the banners I need something a little more developed than a stop word list. I need to evaluate the frequency of certain words around words like 'privileged' and 'corporate' within a word window of about 100ish words
to determine whether they're banners and then remove them from being
indexed.

I need to do the opposite during the same time to identify, in a similar manner, which e-mails include corporate information in their actual content.

I suppose if I'm doing this I don't want what's processed to be indexed as what's returned in a search, because then presumably it won't be the full e-mail, so do I need to store some kind of copy field that keeps the full
e-mail and is fully indexed to be returned instead?

Storage and indexing are separate things in Lucene/Solr, so setting the Field as stored will keep the original, so no need for a copy field for this particular issue.



Can what I'm suggesting be done and can anyone direct me to a guide?

Hmm, this kind of stuff may be better off as part of preprocessing, but it could be done as an analyzer, I suppose. How are you determining the words to evaluate? Is it based on collection statistics or just within a document? Or do you just have a list of "marker" words that indicate the areas of interest? Do you need to keep track of anything beyond the life of one document being analyzed?

If you were doing this as an analyzer, you would need to buffer the tokens internally so that you could examine them in a window, and then make a decision as to what tokens to output. I believe the RemoveDuplicatesTokenFilter demonstrates how to do this. Basically, you just need a List to store the tokens in if you see certain conditions met.






On another note, is there an easy way to destroy an index...any custom code?

Send in a delete by query command with the *:* query.




Thanks for any help!



--
View this message in context: 
http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
Sent from the Solr - User mailing list archive at Nabble.com.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to