Hi David, What version of the Solr in Action MEAP are you looking at (current version is 12, and version 13 is coming out later this week, and prior versions had significant bugs in the code you are referencing)? I added an update processor in the most recent version that can do language identification and prepend the language codes for you (even removing them from the stored version of the field and only including them on the indexed version for text analysis).
You could easily modify this update processor to read the value from the language field and use it as the basis of the pre-pended languages. Otherwise, if you want to do language detection instead of passing in the language manually, MultiTextField in chapter 14 of Solr in Action and the corresponding MultiTextFieldLanguageIdentifierUpdateProcessor should handle all of the language detection and pre-pending automatically for you (and also append the identified language to a separate field). If it were easy/possible to have access to the rest of the fields in the document from within a field's Analyzer then I would have certainly opted for that approach instead of the whole pre-pending languages to content option. If it is too cumbersome, you could probably rewrite the MultiTextField to pull the language from the field name instead of the content (i.e. <field name="myField|en,fr">blah, blah</field> instead of <field name="myField">en,fr|blah, blah</field> as currently designed). This would make specifying the language much easier (especially at query time since you only have to specify the languages once instead of on each term), and you could have Solr still search the same underlying field for all languages. Same general idea, though. In terms of your ThreadLocal cache idea... that sounds really scary to me. The Analyzers' TokenStreamComponents are cached in a ThreadLocal context depending upon to the internal ReusePolicy, and I'm skeptical that you'll be able to pull this off cleanly. It would really be hacking around the Lucene API's even if you were able to pull it off. -Trey On Mon, Oct 28, 2013 at 5:15 PM, Jack Krupansky <j...@basetechnology.com>wrote: > Consider an update processor - it can operate on any field and has access > to all fields. > > You could have one update processor to combine all the fields to process, > into a temporary, dummy field. Then run a language detection update > processor on the combined field. Then process the results and place in the > desired field. And finally remove any temporary fields. > > -- Jack Krupansky > -----Original Message----- From: David Anthony Troiano > Sent: Monday, October 28, 2013 4:47 PM > To: solr-user@lucene.apache.org > Subject: Single multilingual field analyzed based on other field values > > > Hello, > > First some background... > > I am indexing a multilingual document set where documents themselves can > contain multiple languages. The language(s) within my documents are known > ahead of time. I have tried separate fields per language, and due to the > poor query performance I'm seeing with that approach (many languages / > fields), I'm trying to create a single multilingual field. > > One approach to this problem is given in Section > 14.6.4<https://docs.google.**com/a/basistech.com/file/d/** > 0B3NlE_uL0pqwR0hGV0M1QXBmZm8/**edit<https://docs.google.com/a/basistech.com/file/d/0B3NlE_uL0pqwR0hGV0M1QXBmZm8/edit> > >of > the new Solr In Action book. The approach is to take the document > content field and prepend it with the list contained languages followed by > a special delimiter. A new field type is defined that maps languages to > sub field types, and the new type's tokenizer then runs all of the sub > field type analyzers over the field and merges results, adjusts offsets for > the prepended data, etc. > > Due to the tokenizer complexity incurred, I'd like to pursue a more > flexible approach, which is to run the various language-specific analyzers > not based on prepended codes, but instead based on other field values > (i.e., a language field). > > I don't see a straightforward way to do this, mostly because a field > analyzer doesn't have access to the rest of the document. On the flip > side, an UpdateRequestProcessor would have access to the document but > doesn't really give a path to wind up where I want to be (single field with > different analyzers run dynamically). > > Finally, my question: is it possible to thread cache document language(s) > during UpdateRequestProcessor execution (where we have access to the full > document), so that the analyzer can then read from the cache to determine > which analyzer(s) to run? More specifically, if a document is run through > it's URP chain on thread T, will its analyzer(s) also run on thread T and > will no other documents be run through the URP on that thread in the > interim? > > Thanks, > Dave >