Dear list,
I recently figured out that the FrenchLightStemFilterFactory performs
some interestingly undocumented normalization on tokens...
There's a norm() helper called for each produced token that performs,
amongst other things, deletions on repeated characters... Only for
tokens with more than 4 characters... Examples :
aabb => aabb
aabbcc => abc
aaaaaa => a
aaaaab => ab
1 => 1
11 => 11
111 => 111
1111 => 1111
11111 => 1
111111 => 1
12355 => 1235
121221 => 12121
Although it might be interesting for real words in order to hopefully
correct common typographic errors, I'm not so sure of the correctness of
doing so on numbers.
Can anyone confirm it is normal behaviour ?
I use a StandardTokenizer marking number tokens as NUM... Should
FrenchLightStemmer use this information to avoid unnecessary stemming ?
Thanks in advance for your help.
--
Tanguy