Re: Word Locations & Search Components

Grant Ingersoll Mon, 16 Feb 2009 05:21:04 -0800


On Feb 15, 2009, at 10:33 PM, Johnny X wrote:

Hi there,
I was told before that I'd need to create a custom search componentto dowhat I want to do, but I'm thinking it might actually be a customanalyzer.
Basically, I'm indexing e-mail in XML in Solr and searching the'content'
field which is parsed as 'text'.
I want to ignore certain elements of the e-mail (i.e. corporatebanners),but also identify the actual content of those e-mails includingcorporate
information.
To identify the banners I need something a little more developedthan a stopword list. I need to evaluate the frequency of certain words aroundwordslike 'privileged' and 'corporate' within a word window of about100ish words
to determine whether they're banners and then remove them from being
indexed.
I need to do the opposite during the same time to identify, in asimilarmanner, which e-mails include corporate information in their actualcontent.
I suppose if I'm doing this I don't want what's processed to beindexed aswhat's returned in a search, because then presumably it won't be thefulle-mail, so do I need to store some kind of copy field that keeps thefull
e-mail and is fully indexed to be returned instead?

Storage and indexing are separate things in Lucene/Solr, so settingthe Field as stored will keep the original, so no need for a copyfield for this particular issue.



Can what I'm suggesting be done and can anyone direct me to a guide?

Hmm, this kind of stuff may be better off as part of preprocessing,but it could be done as an analyzer, I suppose. How are youdetermining the words to evaluate? Is it based on collectionstatistics or just within a document? Or do you just have a list of"marker" words that indicate the areas of interest? Do you need tokeep track of anything beyond the life of one document being analyzed?

If you were doing this as an analyzer, you would need to buffer thetokens internally so that you could examine them in a window, and thenmake a decision as to what tokens to output. I believe theRemoveDuplicatesTokenFilter demonstrates how to do this. Basically,you just need a List to store the tokens in if you see certainconditions met.

On another note, is there an easy way to destroy an index...anycustom code?


Send in a delete by query command with the *:* query.




Thanks for any help!



--
View this message in context: 
http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
Sent from the Solr - User mailing list archive at Nabble.com.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Word Locations & Search Components

Reply via email to