Hi Tanguy,

I looked at the code, and I can see where the problem you describe is happening.

I think it's a bug: if numbers are search terms, "stemming" them by compressing 
repeated digits makes little sense.

Could you file a bug in JIRA?  Please include the examples you gave in your 
earlier email when you describe the problem on the issue.

I checked all of the other *LightStemmer's on trunk, and FrenchLightStemmer is 
the only one of them that does this arbitrary duplicate sequence compression.  
(FinnishLightStemmer does repetition compression too, but restricts the 
operation to chars 'k', 'p', and 't'.)

Thanks,
Steve

-----Original Message-----
From: Tanguy Moal [mailto:tanguy.m...@gmail.com] 
Sent: Wednesday, May 16, 2012 8:29 AM
To: solr-user@lucene.apache.org
Subject: Re: FrenchLightStemFilterFactory : normalizing tokens longer than 4 
characters and having repeated characters in it

Any idea someone ?

I think this is important since this could produce weird results on collections 
with numbers mixed in text.

>From my understanding, there are a few options to address the issue :
1) Make *LightStemmer token type aware and don't try to stem on things that are 
not text (alpha/alphanum whatever :)) : this pulls the fix into the faulty 
component but makes it dependant on the StandardTokenizer which may not be what 
people want...
2) Enable StandardTokenizer to mark NUM tokens with the keyword attribute so 
that NUM tokens are not stemmed by contract (this could be a configuration flag 
markNumbersWithKeywordAttribute=true, false by default)
3) Use a custom processor to mark NUM tokens as keywords (the solution I chose 
since it doesn't require modifying lucene/solr's code base, it's a very simply 
contrib module)

I chose solution #3.

Maybe #2 is the way to go since most people using FrenchLightStemFilterFactory 
will also want to use StandardTokenizer...

Any advice is welcome

--
Tanguy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/FrenchLightStemFilterFactory-normalizing-tokens-longer-than-4-characters-and-having-repeated-charactt-tp3974148p3984080.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to