Hi, We are using google translate to do something like what you (onlinespending) want to do, so maybe it will help.
During indexing, we store the searchable fields from documents into a fields named _en, _fr, _es, etc. So assuming we capture title and body from each document, the fields are (title_en, body_en), (title_fr, body_fr), etc, with their own analyzer chains. These documents come from a controlled source (ie not the web), so we know the language they are authored in. During searching, a custom component intercepts the client language and the query. The query is sent to google translate for language detection. The largest amount of docs in the corpus is english, so if the detected language is either english or the client language, then we call google translate again to find the translated query in the other (english or client) language. Another custom component constructs an OR query between the two languages one component of which is aimed at the _en field set and the other aimed at the _xx (client language) field set. -sujit On Oct 9, 2012, at 11:24 PM, Bernd Fehling wrote: > > As far as I know, there is no built-in functionality for language translation. > I would propose to write one, but there are many many pitfalls. > If you want to translate from one language to another you might have to > know the "starting" language. Otherwise you get problems with translation. > > Not (german) -> distress (english), affliction (english) > > - you might have words in one language which are stopwords in other language > "not" > - you don't have a one to one mapping, it's more like "1 to n+x" > toilette (french) -> bathroom, rest room / restroom, powder room > > This are just two points which jump into my mind but there are tons of > pitfalls. > > We use the solution of a multilingual thesaurus as synonym dictionary. > http://en.wikipedia.org/wiki/Eurovoc > It holds translations of 22 official languages of the European Union. > > So a search for "europäischer währungsfonds" gives also results with > "european monetary fund", "fonds monétaire européen", ... > > Regards > Bernd > > > > Am 10.10.2012 04:54, schrieb onlinespend...@gmail.com: >> Hi, >> >> English is going to be the predominant language used in my documents, but >> there may be a spattering of words in other languages (such as Spanish or >> French). What I'd like is to initiate a query for something like "bathroom" >> for example and for Solr to return documents that not only contain >> "bathroom" but also "baño" (Spanish). And the same goes when searching for " >> baño". I'd like Solr to return documents that contain either "bathroom" or " >> baño". >> >> One possibility is to pre-translate all indexed documents to a common >> language, in this case English. And if someone were to search using a >> foreign word, I'd need to translate that to English before issuing a query >> to Solr. This appears to be problematic, since I'd have to know whether the >> indexed words and the query are even in a foreign language, which is not >> trivial. >> >> Another possibility is to pre-build a list of foreign word synonyms. So baño >> would be listed as a synonym for bathroom. But I'd need to include other >> languages (such as toilette in French) and other words. This requires that >> I know in advance all possible words I'd need to include foreign language >> versions of (not to mention needing to know which languages to include). >> This isn't trivial either. >> >> I'm assuming there's no built-in functionality that supports the foreign >> language translation on the fly, so what do people propose? >> >> Thanks! >> > > -- > ************************************************************* > Bernd Fehling Universitätsbibliothek Bielefeld > Dipl.-Inform. (FH) LibTec - Bibliothekstechnologie > Universitätsstr. 25 und Wissensmanagement > 33615 Bielefeld > Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de > > BASE - Bielefeld Academic Search Engine - www.base-search.net > *************************************************************