Hi Ilia,

I don't know if it would be helpful but below I've listed  some academic
papers on this issue of how best to deal with mixed language/mixed script
queries and documents.  They are probably taking a more complex approach
than you will want to use, but perhaps they will help to think about the
various ways of approaching the problem.

We haven't tackled this problem yet. We have over 200 languages.  Currently
we are using the ICUTokenizer and ICUFolding filter but don't do any
stemming due to a concern with overstemming (we have very high recall, so
don't want to hurt precision by stemming)  and the difficulty of correct
language identification of short queries.

If you have languages where there is only one language per script however,
you might be able to do much more.  I'm not sure if I'm remembering
correctly but I believe some of the stemmers such as the Greek stemmer will
pass through any strings that don't contain characters in the Greek script.
  So it might be possible to at least do stemming on some of your
languages/scripts.

 I'll be very interested to learn what approach you end up using.

Tom

------

Some papers:

Mohammed Mustafa, Izzedin Osman, and Hussein Suleman. 2011. Indexing and
weighting of multilingual and mixed documents. In *Proceedings of the South
African Institute of Computer Scientists and Information Technologists
Conference on Knowledge, Innovation and Leadership in a Diverse,
Multidisciplinary Environment* (SAICSIT '11). ACM, New York, NY, USA,
161-170. DOI=10.1145/2072221.2072240
http://doi.acm.org/10.1145/2072221.2072240

That paper and some others are here:
http://www.husseinsspace.com/research/students/mohammedmustafaali.html

There is also some code from this article:

Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo
Rosso. 2014. Query expansion for mixed-script information retrieval.
In *Proceedings
of the 37th international ACM SIGIR conference on Research & development in
information retrieval* (SIGIR '14). ACM, New York, NY, USA, 677-686.
DOI=10.1145/2600428.2609622 http://doi.acm.org/10.1145/2600428.2609622

Code:
http://users.dsic.upv.es/~pgupta/mixed-script-ir.html

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search


On Fri, Sep 5, 2014 at 10:06 AM, Ilia Sretenskii <sreten...@multivi.ru>
wrote:

> Hello.
> We have documents with multilingual words which consist of different
> languages parts and seach queries of the same complexity, and it is a
> worldwide used online application, so users generate content in all the
> possible world languages.
>
> For example:
> 言語-aware
> Løgismose-alike
> ຄໍາຮ້ອງສະຫມັກ-dependent
>
> So I guess our schema requires a single field with universal analyzers.
>
> Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.
>
> But then it requires stemming and lemmatization.
>
> How to implement a schema with universal stemming/lemmatization which would
> probably utilize the ICU generated token script attribute?
>
> http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html
>
> By the way, I have already examined the Basistech schema of their
> commercial plugins and it defines tokenizer/filter language per field type,
> which is not a universal solution for such complex multilingual texts.
>
> Please advise how to address this task.
>
> Sincerely, Ilia Sretenskii.
>

Reply via email to