Hello all Solrians. I'm fairly new to Solr, having only played with it for about a month now. I'm working with the Solr 4.0.0-Alpha release, trying to figure out a proper approach to an indexing problem, but the methods I've come up with are not panning out. I describe below the problem and my 3 attempts of solving it. I hope someone here has had similar issues and solved them or can tell me that my current ones are no good. :)
Problem: I have a dataset that consists of email type of documents. From these documents I need to extract to extract certain tokens, attach meta information to each token and then make them searchable based on the attached meta information. If it works, I could search the index for tokens that were in a document created in certain data range, or based on any other metadata like this attached to the token. Basically: For each document on disk => X amount of extracted token based documents in the index. Attempt one: As a starting point I first used a PatternTokenizer to get the tokens that I want, so each indexed document now would have a multivalued field of tokens. I then wrote a TokenFilter that attached the metadata to each token as a payload. I tried searching by payload and discovered it only worked if I used the token as search parameter. Apparently searching by keywords in token payload is not implemented yet? Attempt two: I read about PpdateRequestProcessors and processor chains, and tried writing a Processor that would take in a document, check if it has a field with my tokens extracted using the TokenFilter from the first approach in it, and then write hand out each token as a separate document to the next processor. I couldn't figure out how to do this, apparently once you call super.processAdd() it jumps to the next document, rather than allow me to insert new document based on next token of the current document. Attempt 3: Use a Lucene IndexWriter directly from the custom UpdaterequestProcessor to write the created meta-token documents to a separate index. As a concept it should work, but how would this second index conform to it's Solr schema if i directly write data to an Index? I assume that I would configure the index as a second core with it's own schema and search parameters. Can Solr still query the index normally? As you can see I'm little bit at loss on how to implement this. Are all of the above approaches are bad? Have I misunderstood one of them and it should actually work? I could go back to basics and write a document processor in Python to do all the parsing and patternmatching and token extraction outside of Solr and just feed it documents to index, but this seems like something that Solr should do and I'm just not seeing The Right Way. Regards, Juha -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-index-multivalued-field-tokens-by-their-attached-metadata-tp4001627.html Sent from the Solr - User mailing list archive at Nabble.com.