Hi Tom, On Wed, 2007-10-10 at 12:28 +0200, Thomas Traeger wrote: > in short: use stemming ok :)
> > Try the SnowballPorterFilterFactory with German2 as language attribute > first and use synonyms for combined words i.e. "Herrenhose" => "Herren", > "Hose". so you use a combined approach? > > By using stemming you will maybe have some "interesting" results, but it > is much better living with them than having no or much less results ;o) Do you have an example what "interesting" results I can expect, just to get an idea? > > Find more infos on the Snowball stemming algorithms here: > > http://snowball.tartarus.org/ Thanx! I also had a look at this site already, but what is missing is a demo where one can see what's happening. I think I'll play a little with stemming to get a feeling for this. > > Also have a look at the StopFilterFactory, here is a sample stopwordlist > for the german language: > > http://snowball.tartarus.org/algorithms/german/stop.txt Our application handles products, do you think such stopwords are useful in this scenario also? I wouldn't expect a user to search for "keine hose" or s.th. like this :) Thanx && cheers, Martin > > Good luck, > > Tom > > > Martin Grotzke schrieb: > > Hello, > > > > with our application we have the issue, that we get different > > results for singular and plural searches (german language). > > > > E.g. for "hose" we get 1.000 documents back, but for "hosen" > > we get 10.000 docs. The same applies to "t-shirt" or "t-shirts", > > of e.g. "hut" and "hüte" - lots of cases :) > > > > This is absolutely correct according to the schema.xml, as right > > now we do not have any stemming or synonyms included. > > > > Now we want to have similar search results for these singular/plural > > searches. I'm thinking of a solution for this, and want to ask, what > > are your experiences with this. > > > > Basically I see two options: stemming and the usage of synonyms. Are > > there others? > > > > My concern with stemming is, that it might produce unexpected results, > > so that docs are found that do not match the query from the users point > > of view. I asume that this needs a lot of testing with different data. > > > > The issue with synonyms is, that we would have to create a file > > containing all synonyms, so we would have to figure out all cases, in > > contrast to a solutions that is based on an algorithm. > > The advantage of this approach is IMHO, that it is very predictable > > which results will be returned for a certain query. > > > > Some background information: > > Our documents contain products (id, name, brand, category, producttype, > > description, color etc). The singular/plural issue basically applied to > > the fields name, category and producttype, so we would like to restrict > > the solution to these fields. > > > > Do you have suggestions how to handle this? > > > > Thanx in advance for sharing your experiences, > > cheers, > > Martin > > > > > > > -- Martin Grotzke http://www.javakaffee.de/blog/
signature.asc
Description: This is a digitally signed message part