Solr In Action has a significant discussion on the multi-lingual approach. They also have some code samples out there. Might be worth a look
Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson <jer...@thomersonfamily.com> wrote: > I recently deployed Solr to back the site search feature of a site I work > on. The site itself is available in hundreds of languages. With the initial > release of site search we have enabled the feature for ten of those > languages. This is distributed across eight cores, with two Chinese > languages plus Korean combined into one CJK core and each of the other > seven languages in their own individual cores. The reason for splitting > these into separate cores was so that we could have the same field names > across all cores but have different configuration for analyzers, etc, per > core. > > Now I have some questions on this approach. > > 1) Scalability: Considering I need to scale this to many dozens more > languages, perhaps hundreds more, is there a better way so that I don't end > up needing dozens or hundreds of cores? My initial plan was that many > languages that didn't have special support within Solr would simply get > lumped into a single "default" core that has some default analyzers that > are applicable to the majority of languages. > > 1b) Related to this: is there a practical limit to the number of cores that > can be run on one instance of Lucene? > > 2) Auto Suggest: In phase two I intend to add auto-suggestions as a user > types a query. In reviewing how this is implemented and how the suggestion > dictionary is built I have concerns. If I have more than one language in a > single core (and I keep the same field name for suggestions on all > languages within a core) then it seems that I could get suggestions from > another language returned with a suggest query. Is there a way to build a > separate dictionary for each language, but keep these languages within the > same core? > > If it's helpful to know: I have a field in every core for "Locale". Values > will be the locale of the language of that document, i.e. "en", "es", > "zh_hans", etc. I'd like to be able to: 1) when building a suggestion > dictionary, divide it into multiple dictionaries, grouping them by locale, > and 2) supply a parameter to the suggest query that allows the suggest > component to only return suggestions from the appropriate dictionary for > that locale. > > If the answer to #1 is "keep splitting groups of languages that have > different analyzers into their own cores" and the answer to #2 is "that's > not supported", then I'd be curious: where would I start to write my own > extension that supported #2? I looked last night at the suggest lookup > classes, dictionary classes, etc. But I didn't see a clear point where it > would be clean to implement something like I'm suggesting above. > > Best Regards, > Jeremy Thomerson