You might try a German stemmer. English gets a small benefit from stemming,
maybe 5%. German is more heavily inflected than English, so may get a bigger
improvement.

German search usually needs wordbreaking, so that Orgelmusik can be split
into Orgel and Musik. To get that, you will probably need a commercial
stemmer.

wunder

On 7/2/09 8:42 AM, "Michael Lackhoff" <mich...@lackhoff.de> wrote:

> On 02.07.2009 16:34 Walter Underwood wrote:
> 
>> First, don't use an English stemmer on German text. It will give some odd
>> results.
> 
> I know but at the moment I only have the choice between no stemmer at
> all and one stemmer and since more than half of the records are English
> (about 60% English, 30% German, some Italian, French and others) the
> results are not too bad.
> 
>> Are you using the same conversions on the index and query side?
> 
> Yes, index and query look exactly the same. That is what I don't
> understand. I am not complaining about a misbehaving stemmer, unless it
> does already something odd with the umlauts.
> 
>> The German stemmer might already handle "typewriter umlauts". If it doesn't,
>> use the pattern replace factory. You will also need to convert "ß" to "ss".
> 
> That is what I tried. And yes I also have a filter for "ß" to "ss". It
> just doesn't work as expected.
> 
>> You really do need separate fields for each language.
> 
> Eventually. But now I have to get ready really soon with a small
> application and people don't find what they expect.
> 
>> Handling these characters is language-specific. The typewriter umlaut
>> conversion is wrong for English. It is correct, but rare, to see a diaresis
>> in English when vowels are pronounced separately, like "coöperate". In
>> Swedish, it is not OK to convert "ö" to another letter or combination
>> of letters.
> 
> It is just for German users and at the moment it would be totally ok to
> have "coöperate" indexed as "cooeperate", I know it is wrong and it will
> be fixed but given the tight schedule all I want at the moment is the
> combination of some stemming (perhaps 70% right or more) and "typewriter
> umlauts" (perhaps 90% correct, you gave examples for the missing 10%).
> 
> Do I have any chance?
> 
> -Michael
> 

Reply via email to