In one of the talks by Trey Grainger (author of Solr in Action) it touches how on CareerBuilder are dealing with multilingual with payloads, its a little more of work but I think it would payoff.
On Sep 8, 2014, at 7:58 AM, Jack Krupansky <j...@basetechnology.com> wrote: > You also need to take a stance as to whether you wish to auto-detect the > language at query time vs. have a UI selection of language vs. attempt to > perform the same query for each available language and then "determine" which > has the best "relevancy". The latter two options are very sensitive to short > queries. Keep in mind that auto-detection for indexing full documents is a > different problem that auto-detection for very short queries. > > -- Jack Krupansky > > -----Original Message----- From: Ilia Sretenskii > Sent: Sunday, September 7, 2014 10:33 PM > To: solr-user@lucene.apache.org > Subject: Re: How to implement multilingual word components fields schema? > > Thank you for the replies, guys! > > Using field-per-language approach for multilingual content is the last > thing I would try since my actual task is to implement a search > functionality which would implement relatively the same possibilities for > every known world language. > The closest references are those popular web search engines, they seem to > serve worldwide users with their different languages and even > cross-language queries as well. > Thus, a field-per-language approach would be a sure waste of storage > resources due to the high number of duplicates, since there are over 200 > known languages. > I really would like to keep single field for cross-language searchable text > content, witout splitting it into specific language fields or specific > language cores. > > So my current choice will be to stay with just the ICUTokenizer and > ICUFoldingFilter as they are without any language specific > stemmers/lemmatizers yet at all. > > Probably I will put the most popular languages stop words filters and > stemmers into the same one searchable text field to give it a try and see > if it works correctly in a stack. > Does specific language related filters stacking work correctly in one field? > > Further development will most likely involve some advanced custom analyzers > like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated > ScriptAttribute. > http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236 > https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java > > So I would like to know more about those "academic papers on this issue of > how best to deal with mixed language/mixed script queries and documents". > Tom, could you please share them? Concurso "Mi selfie por los 5". Detalles en http://justiciaparaloscinco.wordpress.com