Re: How to implement multilingual word components fields schema?

Jorge Luis Betancourt Gonzalez Mon, 08 Sep 2014 07:35:51 -0700

In one of the talks by Trey Grainger (author of Solr in Action) it touches how 
on CareerBuilder are dealing with multilingual with payloads, its a little more 
of work but I think it would payoff.


On Sep 8, 2014, at 7:58 AM, Jack Krupansky <j...@basetechnology.com> wrote:

> You also need to take a stance as to whether you wish to auto-detect the 
> language at query time vs. have a UI selection of language vs. attempt to 
> perform the same query for each available language and then "determine" which 
> has the best "relevancy". The latter two options are very sensitive to short 
> queries. Keep in mind that auto-detection for indexing full documents is a 
> different problem that auto-detection for very short queries.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Ilia Sretenskii
> Sent: Sunday, September 7, 2014 10:33 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to implement multilingual word components fields schema?
> 
> Thank you for the replies, guys!
> 
> Using field-per-language approach for multilingual content is the last
> thing I would try since my actual task is to implement a search
> functionality which would implement relatively the same possibilities for
> every known world language.
> The closest references are those popular web search engines, they seem to
> serve worldwide users with their different languages and even
> cross-language queries as well.
> Thus, a field-per-language approach would be a sure waste of storage
> resources due to the high number of duplicates, since there are over 200
> known languages.
> I really would like to keep single field for cross-language searchable text
> content, witout splitting it into specific language fields or specific
> language cores.
> 
> So my current choice will be to stay with just the ICUTokenizer and
> ICUFoldingFilter as they are without any language specific
> stemmers/lemmatizers yet at all.
> 
> Probably I will put the most popular languages stop words filters and
> stemmers into the same one searchable text field to give it a try and see
> if it works correctly in a stack.
> Does specific language related filters stacking work correctly in one field?
> 
> Further development will most likely involve some advanced custom analyzers
> like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
> ScriptAttribute.
> http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
> https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java
> 
> So I would like to know more about those "academic papers on this issue of
> how best to deal with mixed language/mixed script queries and documents".
> Tom, could you please share them? 

Concurso "Mi selfie por los 5". Detalles en 
http://justiciaparaloscinco.wordpress.com

Re: How to implement multilingual word components fields schema?

Reply via email to