Agree with the approach Jack suggested to use same source text in multiple 
fields for each language and then doing a dismax query.  Would love to hear if 
it works for you?

Thanks,
Susheel

-----Original Message-----
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Friday, September 05, 2014 10:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to implement multilingual word components fields schema?

It comes down to how you personally want to value compromises between 
conflicting requirements, such as relative weighting of false positives and 
false negatives. Provide a few use cases that illustrate the boundary cases 
that you care most about. For example field values that have snippets in one 
language embedded within larger values in a different language. And, whether 
your fields are always long or sometimes short - the former can work well for 
language detection, but not the latter, unless all fields of a given document 
are always in the same language.

Otherwise simply index the same source text in multiple fields, one for each 
language. You can then do a dismax query on that set of fields.

-- Jack Krupansky

-----Original Message-----
From: Ilia Sretenskii
Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different languages 
parts and seach queries of the same complexity, and it is a worldwide used 
online application, so users generate content in all the possible world 
languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would 
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their commercial 
plugins and it defines tokenizer/filter language per field type, which is not a 
universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.

This e-mail message may contain confidential or legally privileged information 
and is intended only for the use of the intended recipient(s). Any unauthorized 
disclosure, dissemination, distribution, copying or the taking of any action in 
reliance on the information herein is prohibited. E-mails are not secure and 
cannot be guaranteed to be error free as they can be intercepted, amended, or 
contain viruses. Anyone who communicates with us by e-mail is deemed to have 
accepted these risks. The Digital Group is not responsible for errors or 
omissions in this message and denies any responsibility for any damage arising 
from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  
any material which could be reasonably branded to be a species of plagiarism 
and other statements contained in this message and any attachment are solely 
those of the author and do not necessarily represent those of the company.

Reply via email to