Hi Tanguy, I looked at the code, and I can see where the problem you describe is happening.
I think it's a bug: if numbers are search terms, "stemming" them by compressing repeated digits makes little sense. Could you file a bug in JIRA? Please include the examples you gave in your earlier email when you describe the problem on the issue. I checked all of the other *LightStemmer's on trunk, and FrenchLightStemmer is the only one of them that does this arbitrary duplicate sequence compression. (FinnishLightStemmer does repetition compression too, but restricts the operation to chars 'k', 'p', and 't'.) Thanks, Steve -----Original Message----- From: Tanguy Moal [mailto:tanguy.m...@gmail.com] Sent: Wednesday, May 16, 2012 8:29 AM To: solr-user@lucene.apache.org Subject: Re: FrenchLightStemFilterFactory : normalizing tokens longer than 4 characters and having repeated characters in it Any idea someone ? I think this is important since this could produce weird results on collections with numbers mixed in text. >From my understanding, there are a few options to address the issue : 1) Make *LightStemmer token type aware and don't try to stem on things that are not text (alpha/alphanum whatever :)) : this pulls the fix into the faulty component but makes it dependant on the StandardTokenizer which may not be what people want... 2) Enable StandardTokenizer to mark NUM tokens with the keyword attribute so that NUM tokens are not stemmed by contract (this could be a configuration flag markNumbersWithKeywordAttribute=true, false by default) 3) Use a custom processor to mark NUM tokens as keywords (the solution I chose since it doesn't require modifying lucene/solr's code base, it's a very simply contrib module) I chose solution #3. Maybe #2 is the way to go since most people using FrenchLightStemFilterFactory will also want to use StandardTokenizer... Any advice is welcome -- Tanguy -- View this message in context: http://lucene.472066.n3.nabble.com/FrenchLightStemFilterFactory-normalizing-tokens-longer-than-4-characters-and-having-repeated-charactt-tp3974148p3984080.html Sent from the Solr - User mailing list archive at Nabble.com.