On 02.07.2009 16:34 Walter Underwood wrote: > First, don't use an English stemmer on German text. It will give some odd > results.
I know but at the moment I only have the choice between no stemmer at all and one stemmer and since more than half of the records are English (about 60% English, 30% German, some Italian, French and others) the results are not too bad. > Are you using the same conversions on the index and query side? Yes, index and query look exactly the same. That is what I don't understand. I am not complaining about a misbehaving stemmer, unless it does already something odd with the umlauts. > The German stemmer might already handle "typewriter umlauts". If it doesn't, > use the pattern replace factory. You will also need to convert "ß" to "ss". That is what I tried. And yes I also have a filter for "ß" to "ss". It just doesn't work as expected. > You really do need separate fields for each language. Eventually. But now I have to get ready really soon with a small application and people don't find what they expect. > Handling these characters is language-specific. The typewriter umlaut > conversion is wrong for English. It is correct, but rare, to see a diaresis > in English when vowels are pronounced separately, like "coöperate". In > Swedish, it is not OK to convert "ö" to another letter or combination > of letters. It is just for German users and at the moment it would be totally ok to have "coöperate" indexed as "cooeperate", I know it is wrong and it will be fixed but given the tight schedule all I want at the moment is the combination of some stemming (perhaps 70% right or more) and "typewriter umlauts" (perhaps 90% correct, you gave examples for the missing 10%). Do I have any chance? -Michael