Re: WhitespaceTokenizer to consider incorrectly encoded c2a0?

Alexandre Rafalovitch Wed, 08 Oct 2014 07:13:00 -0700

Is this a suggestion for JIRA ticket? Or a question on how to solve
it? If the later, you could probably stick a RegEx replacement in the
UpdateRequestProcessor chain and be done with it.


As to why? I would look for the rest of the MSWord-generated
artifacts, such as "smart" quotes, extra-long dashes, etc.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 8 October 2014 09:59, Markus Jelsma <markus.jel...@openindex.io> wrote:
> Hi,
>
> For some crazy reason, some users somehow manage to substitute a perfectly 
> normal space with a badly encoded non-breaking space, properly URL encoded 
> this then becomes %c2a0 and depending on the encoding you use to view you 
> probably see Â followed by a space. For example:
>
> Because c2a0 is not considered whitespace (indeed, it is not real whitespace, 
> that is 00a0) by the Java Character class, the WhitespaceTokenizer won't 
> split on it, but the WordDelimiterFilter still does, somehow mitigating the 
> problem as it becomes:
>
> HTMLSCF een abonnement
> WT een abonnement
> WDF een eenabonnement abonnement
>
> Should the WhitespaceTokenizer not include this weird edge case?
>
> Cheers,
> Markus

Re: WhitespaceTokenizer to consider incorrectly encoded c2a0?

Reply via email to