Agree with the approach Jack suggested to use same source text in multiple fields for each language and then doing a dismax query. Would love to hear if it works for you?
Thanks, Susheel -----Original Message----- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, September 05, 2014 10:21 AM To: solr-user@lucene.apache.org Subject: Re: How to implement multilingual word components fields schema? It comes down to how you personally want to value compromises between conflicting requirements, such as relative weighting of false positives and false negatives. Provide a few use cases that illustrate the boundary cases that you care most about. For example field values that have snippets in one language embedded within larger values in a different language. And, whether your fields are always long or sometimes short - the former can work well for language detection, but not the latter, unless all fields of a given document are always in the same language. Otherwise simply index the same source text in multiple fields, one for each language. You can then do a dismax query on that set of fields. -- Jack Krupansky -----Original Message----- From: Ilia Sretenskii Sent: Friday, September 5, 2014 10:06 AM To: solr-user@lucene.apache.org Subject: How to implement multilingual word components fields schema? Hello. We have documents with multilingual words which consist of different languages parts and seach queries of the same complexity, and it is a worldwide used online application, so users generate content in all the possible world languages. For example: 言語-aware Løgismose-alike ຄໍາຮ້ອງສະຫມັກ-dependent So I guess our schema requires a single field with universal analyzers. Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. But then it requires stemming and lemmatization. How to implement a schema with universal stemming/lemmatization which would probably utilize the ICU generated token script attribute? http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html By the way, I have already examined the Basistech schema of their commercial plugins and it defines tokenizer/filter language per field type, which is not a universal solution for such complex multilingual texts. Please advise how to address this task. Sincerely, Ilia Sretenskii. This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.