FrenchLightStemFilterFactory : normalizing tokens made of a single character repeated more than 5 times

Tanguy Moal Wed, 09 May 2012 07:23:13 -0700

Dear list,

I recently figured out that the FrenchLightStemFilterFactory performssome interestingly undocumented normalization on tokens...

There's a norm() helper called for each produced token that performs,amongst other things, deletions on repeated characters... Only fortokens with more than 4 characters... Examples :


aabb   => aabb
aabbcc => abc
aaaaaa => a
aaaaab => ab
1      => 1
11     => 11
111    => 111
1111   => 1111
11111  => 1
111111 => 1
12355  => 1235
121221 => 12121

Although it might be interesting for real words in order to hopefullycorrect common typographic errors, I'm not so sure of the correctness ofdoing so on numbers.


Can anyone confirm it is normal behaviour ?

I use a StandardTokenizer marking number tokens as NUM... ShouldFrenchLightStemmer use this information to avoid unnecessary stemming ?


Thanks in advance for your help.

--
Tanguy

FrenchLightStemFilterFactory : normalizing tokens made of a single character repeated more than 5 times

Reply via email to