Hello, First some background...
I am indexing a multilingual document set where documents themselves can contain multiple languages. The language(s) within my documents are known ahead of time. I have tried separate fields per language, and due to the poor query performance I'm seeing with that approach (many languages / fields), I'm trying to create a single multilingual field. One approach to this problem is given in Section 14.6.4<https://docs.google.com/a/basistech.com/file/d/0B3NlE_uL0pqwR0hGV0M1QXBmZm8/edit>of the new Solr In Action book. The approach is to take the document content field and prepend it with the list contained languages followed by a special delimiter. A new field type is defined that maps languages to sub field types, and the new type's tokenizer then runs all of the sub field type analyzers over the field and merges results, adjusts offsets for the prepended data, etc. Due to the tokenizer complexity incurred, I'd like to pursue a more flexible approach, which is to run the various language-specific analyzers not based on prepended codes, but instead based on other field values (i.e., a language field). I don't see a straightforward way to do this, mostly because a field analyzer doesn't have access to the rest of the document. On the flip side, an UpdateRequestProcessor would have access to the document but doesn't really give a path to wind up where I want to be (single field with different analyzers run dynamically). Finally, my question: is it possible to thread cache document language(s) during UpdateRequestProcessor execution (where we have access to the full document), so that the analyzer can then read from the cache to determine which analyzer(s) to run? More specifically, if a document is run through it's URP chain on thread T, will its analyzer(s) also run on thread T and will no other documents be run through the URP on that thread in the interim? Thanks, Dave