StandardTokenizer behaviour with apostrophe and colon

Vincenzo D'Amore Wed, 01 Jun 2016 16:14:37 -0700

Hi all,

StandardTokenizer don't split the text with an apostrophe (punctuation mark
' ) and with a colon (punctuation mark : ).


Just to be clear looking at documentation all punctation marks are
delimiters, with an exception for periods (dots), so I suppose that a pair
of Italian word like "nell'aria" should be split in two words "nell" and
"aria".

So I have bypassed the problem using a WordDelimiterFilterFactory.

Is this a bug or an undocumented behaviour? In any case, what to do next?

Best regards,
Vincenzo


-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251

StandardTokenizer behaviour with apostrophe and colon

Reply via email to