I've done this. There are five cases for the tokens in the search
index:

1. Tokens that are unique after stemming (this is good).
2. Tokens that are common after stemming (usually trademarks,
   like LaserJet).
3. Tokens with collisions after stemming:
   German "mit", "MIT" the university
   German "Boot" (boat), English "boot" (a heavy shoe)
4. Tokens with collisions in the surface form:
   Dutch "mobile" (plural of furniture), English "mobile"
   German "die" (stemmed to "das"), English "die"

You cannot fix every spurious match, but you can do OK with
stemmed fields for each language and a raw (unstemmed surface
token) field.

I won't recommend weights, but you could have fields for
text_en, text_de, and text_raw, for example.

You really cannot automatically determine the language of a
query, mostly because of proper nouns, especially trademarks.
Identify the language of these queries:

* Google
* LaserJet
* Obama
* Las Vegas
* Paris

HTTP supports an Accept-Language header, but I have no idea
how often that is sent. We honored that in Ultraseek, mostly
because it was standard.

Finally, if you are working with localization, please take the
time to understand the difference between ISO language codes
and ISO country codes.

wunder

On 1/28/09 4:47 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

> I'm not entirely sure about the fine points, but consider the
> filters that are available that fold all the diacritics into their
> low-ascii equivalents. Perhaps using that filter at *both* index
> and search time on the English index would do the trick.
> 
> In your example, both would be 'munchen'. Straight English
> would be unaffected by the filter, but any German words with
> diacritics that crept in would be folded into their low-ascii
> "equivalents". This would also work at index time, just in case
> you indexed English text that had some German words.
> 
> NOTE: My experience is more on the Lucene side than the SOLR
> side, but I'm sure the filters are available.
> 
> Best
> Erick
> 
> On Wed, Jan 28, 2009 at 5:21 PM, Julian Davchev <j...@drun.net> wrote:
> 
>> Hi,
>> I currently have two indexes with solr. One for english version and one
>> with german version. They use respectively english/german2 snowball
>> factory.
>> Right now depending on which language is website currently I query
>> corresponding index.
>> There is requirement though that stuff is found regardless in which
>> language is found.
>> So for example if searching for muenchen (will be caught correctly by
>> german snowball factory as münchen) in english index it should be found.
>> Right now
>> it is not as I suppose english factory doesn't really care about umlauts.
>> 
>> Any pointers are more than welcome. I am considering synonyms  but this
>> will be kinda to heavy to follow/create.
>> Cheers,
>> JD
>> 

Reply via email to