Re: Multiple Languages in Same Core

Alexandre Rafalovitch Mon, 24 Mar 2014 18:17:48 -0700

Solr In Action has a significant discussion on the multi-lingual
approach. They also have some code samples out there. Might be worth a
look


Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson
<jer...@thomersonfamily.com> wrote:
> I recently deployed Solr to back the site search feature of a site I work
> on. The site itself is available in hundreds of languages. With the initial
> release of site search we have enabled the feature for ten of those
> languages. This is distributed across eight cores, with two Chinese
> languages plus Korean combined into one CJK core and each of the other
> seven languages in their own individual cores. The reason for splitting
> these into separate cores was so that we could have the same field names
> across all cores but have different configuration for analyzers, etc, per
> core.
>
> Now I have some questions on this approach.
>
> 1) Scalability: Considering I need to scale this to many dozens more
> languages, perhaps hundreds more, is there a better way so that I don't end
> up needing dozens or hundreds of cores? My initial plan was that many
> languages that didn't have special support within Solr would simply get
> lumped into a single "default" core that has some default analyzers that
> are applicable to the majority of languages.
>
> 1b) Related to this: is there a practical limit to the number of cores that
> can be run on one instance of Lucene?
>
> 2) Auto Suggest: In phase two I intend to add auto-suggestions as a user
> types a query. In reviewing how this is implemented and how the suggestion
> dictionary is built I have concerns. If I have more than one language in a
> single core (and I keep the same field name for suggestions on all
> languages within a core) then it seems that I could get suggestions from
> another language returned with a suggest query. Is there a way to build a
> separate dictionary for each language, but keep these languages within the
> same core?
>
> If it's helpful to know: I have a field in every core for "Locale". Values
> will be the locale of the language of that document, i.e. "en", "es",
> "zh_hans", etc. I'd like to be able to: 1) when building a suggestion
> dictionary, divide it into multiple dictionaries, grouping them by locale,
> and 2) supply a parameter to the suggest query that allows the suggest
> component to only return suggestions from the appropriate dictionary for
> that locale.
>
> If the answer to #1 is "keep splitting groups of languages that have
> different analyzers into their own cores" and the answer to #2 is "that's
> not supported", then I'd be curious: where would I start to write my own
> extension that supported #2? I looked last night at the suggest lookup
> classes, dictionary classes, etc. But I didn't see a clear point where it
> would be clean to implement something like I'm suggesting above.
>
> Best Regards,
> Jeremy Thomerson

Re: Multiple Languages in Same Core

Reply via email to