On Feb 15, 2009, at 10:33 PM, Johnny X wrote:
Hi there,
I was told before that I'd need to create a custom search component
to do
what I want to do, but I'm thinking it might actually be a custom
analyzer.
Basically, I'm indexing e-mail in XML in Solr and searching the
'content'
field which is parsed as 'text'.
I want to ignore certain elements of the e-mail (i.e. corporate
banners),
but also identify the actual content of those e-mails including
corporate
information.
To identify the banners I need something a little more developed
than a stop
word list. I need to evaluate the frequency of certain words around
words
like 'privileged' and 'corporate' within a word window of about
100ish words
to determine whether they're banners and then remove them from being
indexed.
I need to do the opposite during the same time to identify, in a
similar
manner, which e-mails include corporate information in their actual
content.
I suppose if I'm doing this I don't want what's processed to be
indexed as
what's returned in a search, because then presumably it won't be the
full
e-mail, so do I need to store some kind of copy field that keeps the
full
e-mail and is fully indexed to be returned instead?
Storage and indexing are separate things in Lucene/Solr, so setting
the Field as stored will keep the original, so no need for a copy
field for this particular issue.
Can what I'm suggesting be done and can anyone direct me to a guide?
Hmm, this kind of stuff may be better off as part of preprocessing,
but it could be done as an analyzer, I suppose. How are you
determining the words to evaluate? Is it based on collection
statistics or just within a document? Or do you just have a list of
"marker" words that indicate the areas of interest? Do you need to
keep track of anything beyond the life of one document being analyzed?
If you were doing this as an analyzer, you would need to buffer the
tokens internally so that you could examine them in a window, and then
make a decision as to what tokens to output. I believe the
RemoveDuplicatesTokenFilter demonstrates how to do this. Basically,
you just need a List to store the tokens in if you see certain
conditions met.
On another note, is there an easy way to destroy an index...any
custom code?
Send in a delete by query command with the *:* query.
Thanks for any help!
--
View this message in context:
http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
Sent from the Solr - User mailing list archive at Nabble.com.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search