On 02.07.2009 16:34 Walter Underwood wrote:

> First, don't use an English stemmer on German text. It will give some odd
> results.

I know but at the moment I only have the choice between no stemmer at
all and one stemmer and since more than half of the records are English
(about 60% English, 30% German, some Italian, French and others) the
results are not too bad.

> Are you using the same conversions on the index and query side?

Yes, index and query look exactly the same. That is what I don't
understand. I am not complaining about a misbehaving stemmer, unless it
does already something odd with the umlauts.

> The German stemmer might already handle "typewriter umlauts". If it doesn't,
> use the pattern replace factory. You will also need to convert "ß" to "ss".

That is what I tried. And yes I also have a filter for "ß" to "ss". It
just doesn't work as expected.

> You really do need separate fields for each language.

Eventually. But now I have to get ready really soon with a small
application and people don't find what they expect.

> Handling these characters is language-specific. The typewriter umlaut
> conversion is wrong for English. It is correct, but rare, to see a diaresis
> in English when vowels are pronounced separately, like "coöperate". In
> Swedish, it is not OK to convert "ö" to another letter or combination
> of letters.

It is just for German users and at the moment it would be totally ok to
have "coöperate" indexed as "cooeperate", I know it is wrong and it will
be fixed but given the tight schedule all I want at the moment is the
combination of some stemming (perhaps 70% right or more) and "typewriter
umlauts" (perhaps 90% correct, you gave examples for the missing 10%).

Do I have any chance?

-Michael

Reply via email to