Re: multilanguage + howto search in all languages?

Julian Davchev Thu, 29 Jan 2009 06:29:57 -0800

Thank you both for points. For now I am hanlding with fuzzy search.
Let's hope this will do for sometime :)



Walter Underwood wrote:
> I've done this. There are five cases for the tokens in the search
> index:
>
> 1. Tokens that are unique after stemming (this is good).
> 2. Tokens that are common after stemming (usually trademarks,
>    like LaserJet).
> 3. Tokens with collisions after stemming:
>    German "mit", "MIT" the university
>    German "Boot" (boat), English "boot" (a heavy shoe)
> 4. Tokens with collisions in the surface form:
>    Dutch "mobile" (plural of furniture), English "mobile"
>    German "die" (stemmed to "das"), English "die"
>
> You cannot fix every spurious match, but you can do OK with
> stemmed fields for each language and a raw (unstemmed surface
> token) field.
>
> I won't recommend weights, but you could have fields for
> text_en, text_de, and text_raw, for example.
>
> You really cannot automatically determine the language of a
> query, mostly because of proper nouns, especially trademarks.
> Identify the language of these queries:
>
> * Google
> * LaserJet
> * Obama
> * Las Vegas
> * Paris
>
> HTTP supports an Accept-Language header, but I have no idea
> how often that is sent. We honored that in Ultraseek, mostly
> because it was standard.
>
> Finally, if you are working with localization, please take the
> time to understand the difference between ISO language codes
> and ISO country codes.
>
> wunder
>
> On 1/28/09 4:47 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>
>   
>> I'm not entirely sure about the fine points, but consider the
>> filters that are available that fold all the diacritics into their
>> low-ascii equivalents. Perhaps using that filter at *both* index
>> and search time on the English index would do the trick.
>>
>> In your example, both would be 'munchen'. Straight English
>> would be unaffected by the filter, but any German words with
>> diacritics that crept in would be folded into their low-ascii
>> "equivalents". This would also work at index time, just in case
>> you indexed English text that had some German words.
>>
>> NOTE: My experience is more on the Lucene side than the SOLR
>> side, but I'm sure the filters are available.
>>
>> Best
>> Erick
>>
>> On Wed, Jan 28, 2009 at 5:21 PM, Julian Davchev <j...@drun.net> wrote:
>>
>>     
>>> Hi,
>>> I currently have two indexes with solr. One for english version and one
>>> with german version. They use respectively english/german2 snowball
>>> factory.
>>> Right now depending on which language is website currently I query
>>> corresponding index.
>>> There is requirement though that stuff is found regardless in which
>>> language is found.
>>> So for example if searching for muenchen (will be caught correctly by
>>> german snowball factory as münchen) in english index it should be found.
>>> Right now
>>> it is not as I suppose english factory doesn't really care about umlauts.
>>>
>>> Any pointers are more than welcome. I am considering synonyms  but this
>>> will be kinda to heavy to follow/create.
>>> Cheers,
>>> JD
>>>
>>>       
>
>

Re: multilanguage + howto search in all languages?

Reply via email to