Hi Ilia,

I see that Trey answered your question about how you might stack
language specific filters in one field.  If I remember correctly, his
approach assumes you have identified the language of the query.  That
is not the same as detecting the script of the query and is much
harder.

Trying to do language-specific processing on multiple languages,
especially a large number such as the 200 you mention or the 400 in
HathiTrust is a very difficult problem.  Detecting language (rather
than script) in short queries is an open problem in the research
literature.  As others have suggested, you might want to start with
something less ambitious that meets most of your business needs.

You also might want to consider whether the errors a stemmer might
make on some queries will be worth the increase in recall that you
will get on others. Concern about getting results that can confuse
users is the one of the  main reason we haven't seriously pursued
stemming in HathiTrust full-text search.

Regarding the papers listed in my previous e-mail, you can get the
first paper at the link I gave and the second paper (although on
re-reading it, I don't think it will be very useful) is available if
you go to the link for the code and follow the link on that page for
the paper.

I suspect you might want to think about the differences between
scripts and  languages.  Most of the Solr/Lucene stemmers either
assume you are only giving them the language they are designed for, or
work on the basis of script.  This works well when there is only one
language per script, but breaks if you have many languages using the
same script such as the Latin-1 languages.

(Because of an issue with the Solr-user spam filter and an issue with
my e-mail client all the URLs except the one below have
http[s] removed and/or spaces added.  See this gist for all the URLS
with context:  https://gist.github.com/anonymous/2e1233d80f37354001a3)

That PolyGlotStemming filter uses the ICUTokenizer's script
identification, but there are at least 12 different languages that use
the Arabic script (www omniglot com writing arabic)  and over 100 that
use Latin-1.  Please see the list of languages and scripts at
aspell. net/ man-html /Supported.   html#Supported. or www. omniglot
.com /writing/langalph .htm#latin

As a simple example, German and English both use the Latin-1 character
set.  Using an English stemmer for German or a German stemmer for
English is unlikely to work very well.  If you try to use stop words
for multiple languages you will run into difficulties where a stop
word in one language is a content word in another.  For example  if
you use German stop words such as "die", you will eliminate the
English content word "die".

Identifying languages in short texts such as queries is a hard
problem. About half the papers looking at query language
identification cheat, and look at things such as the language of the
pages that a user has clicked on.  If all you have to make a guess is
the text of the query, language identification is very difficult.  I
suspect that mixed script queries are even harder  (see www .transacl.
org/wp-content/uploads/2014/02/38.pdf).

 See the papers by Marco Lui and Tim Baldwin on Marco's web page:
ww2  .cs. mu. oz. au /~mlui/
In this paper they explain why short text language identification is a
hard problem "Language Identification: the Long and the Short of the
Matter"  www  .aclweb.  org/anthology/N10-1027

Other papers available on Marco's page describe the design and
implementation of langid.py which is a state-of-the-art language
identification program.

 I've tried several  language guessers  designed for short texts and
at least on queries from our query logs,  the results were unusable.
Both langid.py  with the defaults (noEnglish.langid.gz  pipe
delimited) and ldig with the most recent latin.model
(NonEnglish.ldig.gz tab delimited) did not work well for our queries.

However, both of these have parameters that can be tweaked and also
facilities for training if you have labeled data.

ldig is specifically designed to run on short text like queries or twitter.
It can be configured to spit out the scores for each language instead
of only the highest score (default).  Also we didn't try to limit the
list of languages it looks for, and that might give better results.

github .com  /shuyo/ldig
langdetect looks like its by the same programmer and is in Java, but I
haven't tried it:
code .google. com/p/language-detection/

langid is designed by linguistic experts, but may need to be trained
on short queries.
github .com/saffsd/langid.py

There is also Mike McCandless' port of the Google CLD

blog. mikemccandless .com/2013/08/a-new-version-of-compact-language  .html
code .google .com/p/chromium-compact-language-detector/source/browse/README
However here is this comment:
"Indeed I see the same results as you; I think CLD2 is just not
designed for short text."
and a similar comment was made in this talk:
videolectures .net/russir2012_grigoriev_language/


If you aren't worried about false drops and your documents are
relatively short and your use case favors recall over precision you
might want to look at McNamee and Mayfield's work on
language-independent stemming.   I don't know if their n-gram approach
would be feasible for your use case, but they also  got good results
on TREC/CLEF newswire article datasets with just truncating words.
We can't use their approach because we already have a high recall/low
precision situation and because our documents are several orders of
magnitude larger than the TREC/CLEF/FIRE newswire articles they tested
with.

Paul McNamee, Charles Nicholas, and James Mayfield. 2009. Addressing
morphological variation in alphabetic languages. In Proceedings of the
32nd international ACM SIGIR conference on Research and development in
information retrieval (SIGIR '09). ACM, New York, NY, USA, 75-82.
DOI=10.1145/1571941.1571957 http://     doi.acm.
org/10.1145/1571941.1571957


Paul McNamee, Charles Nicholas, and James Mayfield. 2008. Don't have a
stemmer?: be un+concern+ed. In Proceedings of the 31st annual
international ACM SIGIR conference on Research and development in
information retrieval (SIGIR '08). ACM, New York, NY, USA, 813-814.
DOI=10.1145/1390334.1390518 http://
doi. acm  .org/10.1145/1390334.1390518

I hope this helps.

Tom

On Mon, Sep 8, 2014 at 1:33 AM, Ilia Sretenskii <sreten...@multivi.ru> wrote:
> Thank you for the replies, guys!
>
> Using field-per-language approach for multilingual content is the last
> thing I would try since my actual task is to implement a search
> functionality which would implement relatively the same possibilities for
> every known world language.
> The closest references are those popular web search engines, they seem to
> serve worldwide users with their different languages and even
> cross-language queries as well.
> Thus, a field-per-language approach would be a sure waste of storage
> resources due to the high number of duplicates, since there are over 200
> known languages.
> I really would like to keep single field for cross-language searchable text
> content, witout splitting it into specific language fields or specific
> language cores.
>
> So my current choice will be to stay with just the ICUTokenizer and
> ICUFoldingFilter as they are without any language specific
> stemmers/lemmatizers yet at all.
>
> Probably I will put the most popular languages stop words filters and
> stemmers into the same one searchable text field to give it a try and see
> if it works correctly in a stack.
> Does specific language related filters stacking work correctly in one field?
>
> Further development will most likely involve some advanced custom analyzers
> like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
> ScriptAttribute.
>

> So I would like to know more about those "academic papers on this issue of
> how best to deal with mixed language/mixed script queries and documents".
> Tom, could you please share them?

Reply via email to