Hi Sujatha. I've developed a search system for 6 different languages and as it was implemented on Solr 1.2 all those languages are part of the same index, using different fields for each so I can have different analyzers for each one.
Like: content_chinese content_english content_russian content_arabic I've also defined a language field that I use to be able to separate those on query time. As you are going to implement it using Solr 1.3 I would rather create one core per language and keep my schema simpler without the _language suffix. Each schema (one per language) would have only, say, content which depending on its language will use a proper analyzer and filters. Having a separate core per language is also good as the scores for a language won't be affected by the indexing of documents in other languages. Do you have any requirement for searching in any language, say q=test and this term should be found in any language? If so, you may think of distributed search to combine your results or even to take the same approach I've taken as I couldn't use multi-core. I'm also using the Dismax request handler, that's worth to have a look so you can pre-define some base query parts and also do score boosting behind the scenes. I hope it helps. Regards, Daniel -----Original Message----- From: Sujatha Arun [mailto:suja.a...@gmail.com] Sent: 18 December 2008 04:15 To: solr-user@lucene.apache.org Subject: Re: looking for multilanguage indexing best practice/hint Hi, I am prototyping lanuage search using solr 1.3 .I have 3 fields in the schema -id,content and language. I am indexing 3 pdf files ,the languages are foroyo,chinese and japanese. I use xpdf to convert the content of pdf to text and push the text to solr in the content field. What is the analyzer that i need to use for the above. By using the default text analyzer and posting this content to solr, i am not getting any results. Does solr support stemmin for the above languages. Regards Sujatha On 12/18/08, Feak, Todd <todd.f...@smss.sony.com> wrote: > > Don't forget to consider scaling concerns (if there are any). There > are strong differences in the number of searches we receive for each > language. We chose to create separate schema and config per language > so that we can throw servers at a particular language (or set of > languages) if we needed to. We see 2 orders of magnitude difference > between our most popular language and our least popular. > > -Todd Feak > > -----Original Message----- > From: Julian Davchev [mailto:j...@drun.net] > Sent: Wednesday, December 17, 2008 11:31 AM > To: solr-user@lucene.apache.org > Subject: looking for multilanguage indexing best practice/hint > > Hi, > From my study on solr and lucene so far it seems that I will use > single scheme.....at least don't see scenario where I'd need more than that. > So question is how do I approach multilanguage indexing and multilang > searching. Will it really make sense for just searching word..or > rather I should supply lang param to search as well. > > I see there are those filters and already advised on them but I guess > question is more of a best practice. > solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory > > So solution I see is using copyField I have same field in different > langs or something using distinct filter. > Cheers > > > > http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.