It comes down to how you personally want to value compromises between
conflicting requirements, such as relative weighting of false positives and
false negatives. Provide a few use cases that illustrate the boundary cases
that you care most about. For example field values that have snippets in one
language embedded within larger values in a different language. And, whether
your fields are always long or sometimes short - the former can work well
for language detection, but not the latter, unless all fields of a given
document are always in the same language.
Otherwise simply index the same source text in multiple fields, one for each
language. You can then do a dismax query on that set of fields.
-- Jack Krupansky
-----Original Message-----
From: Ilia Sretenskii
Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?
Hello.
We have documents with multilingual words which consist of different
languages parts and seach queries of the same complexity, and it is a
worldwide used online application, so users generate content in all the
possible world languages.
For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent
So I guess our schema requires a single field with universal analyzers.
Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.
But then it requires stemming and lemmatization.
How to implement a schema with universal stemming/lemmatization which would
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html
By the way, I have already examined the Basistech schema of their
commercial plugins and it defines tokenizer/filter language per field type,
which is not a universal solution for such complex multilingual texts.
Please advise how to address this task.
Sincerely, Ilia Sretenskii.