Re: looking for multilanguage indexing best practice/hint

Julian Davchev Thu, 18 Dec 2008 11:58:33 -0800

Thanks Erick,
I think I will go with different language fields as I want to give
different stop words, analyzers etc.
I might also consider scheme per language so scaling is more flexible as
I was already advised but this will really make sense if I have more
than one server I guess, else just all other data is duplicated for no
reason.
We already made decision that language will be passed each time in
search so won't make sense to search quert in any lang.


As of CJKAnalyzer from first look doesn't seem to be in solr (haven't
tried yet) and since I am noob in java will check how it's done.
Will definately give a try.

Thanks alot for help.

Erick Erickson wrote:
> See the CJKAnalyzer for a start, StandardAnalyzer won't
> help you much.
>
> Also, tell us a little more about your requirements. For instance,
> if a user submits a query in Japanese, do you want to search
> across documents in the other languages too? And will you want
> to associate different analyzers with the content from different
> languages? You really have two options:
>
> if you want different analyzers used with the different languages,
> you probably have to index the content in different fields. That is
> a Chinese document would have a chinese_content field, a Japanese
> document would have a japanese_content field etc. Now you can
> associate a different analyzer with each *_content field.
>
> If the same analyzer would work for all three languages, you
> can just index all the content in a "content" field, and if you
> need to restrict searching to the language in which the query
> was submitted, you could always add a clause on the
> language, e.g. AND language:chinese
>
> Hope this helps
> Erick
>
> On Wed, Dec 17, 2008 at 11:15 PM, Sujatha Arun <suja.a...@gmail.com> wrote:
>
>   
>> Hi,
>>
>> I am prototyping lanuage search using solr 1.3 .I  have 3 fields in the
>> schema -id,content and language.
>>
>> I am indexing 3 pdf files ,the languages are foroyo,chinese and japanese.
>>
>> I use xpdf to convert the content of pdf to text and push the text to solr
>> in the content field.
>>
>> What is the analyzer  that i need to use for the above.
>>
>> By using the default text analyzer and posting this content to solr, i am
>> not getting any  results.
>>
>> Does solr support stemmin for the above languages.
>>
>> Regards
>> Sujatha
>>
>>
>>
>>
>> On 12/18/08, Feak, Todd <todd.f...@smss.sony.com> wrote:
>>     
>>> Don't forget to consider scaling concerns (if there are any). There are
>>> strong differences in the number of searches we receive for each
>>> language. We chose to create separate schema and config per language so
>>> that we can throw servers at a particular language (or set of languages)
>>> if we needed to. We see 2 orders of magnitude difference between our
>>> most popular language and our least popular.
>>>
>>> -Todd Feak
>>>
>>> -----Original Message-----
>>> From: Julian Davchev [mailto:j...@drun.net]
>>> Sent: Wednesday, December 17, 2008 11:31 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: looking for multilanguage indexing best practice/hint
>>>
>>> Hi,
>>> From my study on solr and lucene so far it seems that I will use single
>>> scheme.....at least don't see scenario where I'd need more than that.
>>> So question is how do I approach multilanguage indexing and multilang
>>> searching. Will it really make sense for just searching word..or rather
>>> I should supply lang param to search as well.
>>>
>>> I see there are those filters and already advised on them but I guess
>>> question is more of a best practice.
>>> solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory
>>>
>>> So solution I see is using copyField I have same field in different
>>> langs or something using distinct filter.
>>> Cheers
>>>
>>>
>>>
>>>
>>>       
>
>

Re: looking for multilanguage indexing best practice/hint

Reply via email to