It comes down to how you personally want to value compromises between conflicting requirements, such as relative weighting of false positives and false negatives. Provide a few use cases that illustrate the boundary cases that you care most about. For example field values that have snippets in one language embedded within larger values in a different language. And, whether your fields are always long or sometimes short - the former can work well for language detection, but not the latter, unless all fields of a given document are always in the same language.

Otherwise simply index the same source text in multiple fields, one for each language. You can then do a dismax query on that set of fields.

-- Jack Krupansky

-----Original Message----- From: Ilia Sretenskii
Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different
languages parts and seach queries of the same complexity, and it is a
worldwide used online application, so users generate content in all the
possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their
commercial plugins and it defines tokenizer/filter language per field type,
which is not a universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.

Reply via email to