Re: How to implement multilingual word components fields schema?

Jack Krupansky Fri, 05 Sep 2014 07:21:20 -0700

It comes down to how you personally want to value compromises betweenconflicting requirements, such as relative weighting of false positives andfalse negatives. Provide a few use cases that illustrate the boundary casesthat you care most about. For example field values that have snippets in onelanguage embedded within larger values in a different language. And, whetheryour fields are always long or sometimes short - the former can work wellfor language detection, but not the latter, unless all fields of a givendocument are always in the same language.

Otherwise simply index the same source text in multiple fields, one for eachlanguage. You can then do a dismax query on that set of fields.


-- Jack Krupansky

-----Original Message-----From: Ilia Sretenskii

Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different
languages parts and seach queries of the same complexity, and it is a
worldwide used online application, so users generate content in all the
possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their
commercial plugins and it defines tokenizer/filter language per field type,
which is not a universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.

Re: How to implement multilingual word components fields schema?

Reply via email to