You might try a German stemmer. English gets a small benefit from stemming, maybe 5%. German is more heavily inflected than English, so may get a bigger improvement.
German search usually needs wordbreaking, so that Orgelmusik can be split into Orgel and Musik. To get that, you will probably need a commercial stemmer. wunder On 7/2/09 8:42 AM, "Michael Lackhoff" <mich...@lackhoff.de> wrote: > On 02.07.2009 16:34 Walter Underwood wrote: > >> First, don't use an English stemmer on German text. It will give some odd >> results. > > I know but at the moment I only have the choice between no stemmer at > all and one stemmer and since more than half of the records are English > (about 60% English, 30% German, some Italian, French and others) the > results are not too bad. > >> Are you using the same conversions on the index and query side? > > Yes, index and query look exactly the same. That is what I don't > understand. I am not complaining about a misbehaving stemmer, unless it > does already something odd with the umlauts. > >> The German stemmer might already handle "typewriter umlauts". If it doesn't, >> use the pattern replace factory. You will also need to convert "ß" to "ss". > > That is what I tried. And yes I also have a filter for "ß" to "ss". It > just doesn't work as expected. > >> You really do need separate fields for each language. > > Eventually. But now I have to get ready really soon with a small > application and people don't find what they expect. > >> Handling these characters is language-specific. The typewriter umlaut >> conversion is wrong for English. It is correct, but rare, to see a diaresis >> in English when vowels are pronounced separately, like "coöperate". In >> Swedish, it is not OK to convert "ö" to another letter or combination >> of letters. > > It is just for German users and at the moment it would be totally ok to > have "coöperate" indexed as "cooeperate", I know it is wrong and it will > be fixed but given the tight schedule all I want at the moment is the > combination of some stemming (perhaps 70% right or more) and "typewriter > umlauts" (perhaps 90% correct, you gave examples for the missing 10%). > > Do I have any chance? > > -Michael >