I've done this. There are five cases for the tokens in the search index: 1. Tokens that are unique after stemming (this is good). 2. Tokens that are common after stemming (usually trademarks, like LaserJet). 3. Tokens with collisions after stemming: German "mit", "MIT" the university German "Boot" (boat), English "boot" (a heavy shoe) 4. Tokens with collisions in the surface form: Dutch "mobile" (plural of furniture), English "mobile" German "die" (stemmed to "das"), English "die"
You cannot fix every spurious match, but you can do OK with stemmed fields for each language and a raw (unstemmed surface token) field. I won't recommend weights, but you could have fields for text_en, text_de, and text_raw, for example. You really cannot automatically determine the language of a query, mostly because of proper nouns, especially trademarks. Identify the language of these queries: * Google * LaserJet * Obama * Las Vegas * Paris HTTP supports an Accept-Language header, but I have no idea how often that is sent. We honored that in Ultraseek, mostly because it was standard. Finally, if you are working with localization, please take the time to understand the difference between ISO language codes and ISO country codes. wunder On 1/28/09 4:47 PM, "Erick Erickson" <erickerick...@gmail.com> wrote: > I'm not entirely sure about the fine points, but consider the > filters that are available that fold all the diacritics into their > low-ascii equivalents. Perhaps using that filter at *both* index > and search time on the English index would do the trick. > > In your example, both would be 'munchen'. Straight English > would be unaffected by the filter, but any German words with > diacritics that crept in would be folded into their low-ascii > "equivalents". This would also work at index time, just in case > you indexed English text that had some German words. > > NOTE: My experience is more on the Lucene side than the SOLR > side, but I'm sure the filters are available. > > Best > Erick > > On Wed, Jan 28, 2009 at 5:21 PM, Julian Davchev <j...@drun.net> wrote: > >> Hi, >> I currently have two indexes with solr. One for english version and one >> with german version. They use respectively english/german2 snowball >> factory. >> Right now depending on which language is website currently I query >> corresponding index. >> There is requirement though that stuff is found regardless in which >> language is found. >> So for example if searching for muenchen (will be caught correctly by >> german snowball factory as münchen) in english index it should be found. >> Right now >> it is not as I suppose english factory doesn't really care about umlauts. >> >> Any pointers are more than welcome. I am considering synonyms but this >> will be kinda to heavy to follow/create. >> Cheers, >> JD >>