How to index multivalued field tokens by their attached metadata?

Fuu Thu, 16 Aug 2012 06:54:43 -0700

Hello all Solrians.

I'm fairly new to Solr, having only played with it for about a month now.
I'm working with the Solr 4.0.0-Alpha release, trying to figure out a proper
approach to an indexing problem, but the methods I've come up with are not
panning out. I describe below the problem and my 3 attempts of solving it. I
hope someone here has had similar issues and solved them or can tell me that
my current ones are no good. :)

Problem: I have a dataset that consists of email type of documents. From
these documents I need to extract to extract certain tokens, attach meta
information to each token and then make them searchable based on the
attached meta information. If it works, I could search the index for tokens
that were in a document created in certain data range, or based on any other
metadata like this attached to the token.

Basically: For each document on disk => X amount of extracted token based
documents in the index.

Attempt one: As a starting point I first used a PatternTokenizer to get the
tokens that I want, so each indexed document now would have a multivalued
field of tokens. I then wrote a TokenFilter that attached the metadata to
each token as a payload. I tried searching by payload and discovered it only
worked if I used the token as search parameter. Apparently searching by
keywords in token payload is not implemented yet?

Attempt two: I read about PpdateRequestProcessors and processor chains, and
tried writing a Processor that would take in a document, check if it has a
field with my tokens extracted using the TokenFilter from the first approach
in it, and then write hand out each token as a separate document to the next
processor. I couldn't figure out how to do this, apparently once you call
super.processAdd() it jumps to the next document, rather than allow me to
insert new document based on next token of the current document.

Attempt 3: Use a Lucene IndexWriter directly from the custom
UpdaterequestProcessor to write the created meta-token documents to a
separate index. As a concept it should work, but how would this second index
conform to it's Solr schema if i directly write data to an Index? I assume
that I would configure the index as a second core with it's own schema and
search parameters. Can Solr still query the index normally?

As you can see I'm little bit at loss on how to implement this. Are all of
the above approaches are bad? Have I misunderstood one of them and it should
actually work? I could go back to basics and write a document processor in
Python to do all the parsing and patternmatching and token extraction
outside of Solr and just feed it documents to index, but this seems like
something that Solr should do and I'm just not seeing The Right Way.

Regards,
Juha

--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-index-multivalued-field-tokens-by-their-attached-metadata-tp4001627.html
Sent from the Solr - User mailing list archive at Nabble.com.

How to index multivalued field tokens by their attached metadata?

Reply via email to