RE: looking for multilanguage indexing best practice/hint

Daniel Alheiros Thu, 18 Dec 2008 08:57:24 -0800

Hi Sujatha.

I've developed a search system for 6 different languages and as it was
implemented on Solr 1.2 all those languages are part of the same index,
using different fields for each so I can have different analyzers for
each one.

Like:
content_chinese
content_english
content_russian
content_arabic

I've also defined a language field that I use to be able to separate
those on query time.

As you are going to implement it using Solr 1.3 I would rather create
one core per language and keep my schema simpler without the _language
suffix. Each schema (one per language) would have only, say, content
which depending on its language will use a proper analyzer and filters.

Having a separate core per language is also good as the scores for a
language won't be affected by the indexing of documents in other
languages.

Do you have any requirement for searching in any language, say q=test
and this term should be found in any language? If so, you may think of
distributed search to combine your results or even to take the same
approach I've taken as I couldn't use multi-core.

I'm also using the Dismax request handler, that's worth to have a look
so you can pre-define some base query parts and also do score boosting
behind the scenes.

I hope it helps.

Regards,
Daniel 

-----Original Message-----
From: Sujatha Arun [mailto:suja.a...@gmail.com] 
Sent: 18 December 2008 04:15
To: solr-user@lucene.apache.org
Subject: Re: looking for multilanguage indexing best practice/hint

Hi,

I am prototyping lanuage search using solr 1.3 .I  have 3 fields in the
schema -id,content and language.

I am indexing 3 pdf files ,the languages are foroyo,chinese and
japanese.

I use xpdf to convert the content of pdf to text and push the text to
solr in the content field.

What is the analyzer  that i need to use for the above.

By using the default text analyzer and posting this content to solr, i
am not getting any  results.

Does solr support stemmin for the above languages.

Regards
Sujatha

On 12/18/08, Feak, Todd <todd.f...@smss.sony.com> wrote:
>
> Don't forget to consider scaling concerns (if there are any). There 
> are strong differences in the number of searches we receive for each 
> language. We chose to create separate schema and config per language 
> so that we can throw servers at a particular language (or set of 
> languages) if we needed to. We see 2 orders of magnitude difference 
> between our most popular language and our least popular.
>
> -Todd Feak
>
> -----Original Message-----
> From: Julian Davchev [mailto:j...@drun.net]
> Sent: Wednesday, December 17, 2008 11:31 AM
> To: solr-user@lucene.apache.org
> Subject: looking for multilanguage indexing best practice/hint
>
> Hi,
> From my study on solr and lucene so far it seems that I will use 
> single scheme.....at least don't see scenario where I'd need more than
that.
> So question is how do I approach multilanguage indexing and multilang 
> searching. Will it really make sense for just searching word..or 
> rather I should supply lang param to search as well.
>
> I see there are those filters and already advised on them but I guess 
> question is more of a best practice.
> solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory
>
> So solution I see is using copyField I have same field in different 
> langs or something using distinct filter.
> Cheers
>
>
>
>

http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

RE: looking for multilanguage indexing best practice/hint

Reply via email to