The source code uses that Java Character.isWhitespace method which
specifically excludes the non-breaking white space characters.
The Javadoc contract for WhitespaceTokenizer is too vague, especially since
Unicode has so many... subtleties.
Personally, I'd go along with treating non-breaking white space as white
space here.
And update the Lucene Javadoc contract to be more explicit.
-- Jack Krupansky
-----Original Message-----
From: Markus Jelsma
Sent: Wednesday, October 8, 2014 10:16 AM
To: solr-user@lucene.apache.org ; solr-user
Subject: RE: WhitespaceTokenizer to consider incorrectly encoded c2a0?
Alexandre - i am sorry if i was not clear, this is about queries, this all
happens at query time. Yes we can do the substitution in with the regex
replace filter, but i would propose this weird exception to be added to
WhitespaceTokenizer so Lucene deals with this by itself.
Markus
-----Original message-----
From:Alexandre Rafalovitch <arafa...@gmail.com>
Sent: Wednesday 8th October 2014 16:12
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: WhitespaceTokenizer to consider incorrectly encoded c2a0?
Is this a suggestion for JIRA ticket? Or a question on how to solve
it? If the later, you could probably stick a RegEx replacement in the
UpdateRequestProcessor chain and be done with it.
As to why? I would look for the rest of the MSWord-generated
artifacts, such as "smart" quotes, extra-long dashes, etc.
Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
On 8 October 2014 09:59, Markus Jelsma <markus.jel...@openindex.io> wrote:
> Hi,
>
> For some crazy reason, some users somehow manage to substitute a
> perfectly normal space with a badly encoded non-breaking space, properly
> URL encoded this then becomes %c2a0 and depending on the encoding you
> use to view you probably see  followed by a space. For example:
>
> Because c2a0 is not considered whitespace (indeed, it is not real
> whitespace, that is 00a0) by the Java Character class, the
> WhitespaceTokenizer won't split on it, but the WordDelimiterFilter still
> does, somehow mitigating the problem as it becomes:
>
> HTMLSCF een abonnement
> WT een abonnement
> WDF een eenabonnement abonnement
>
> Should the WhitespaceTokenizer not include this weird edge case?
>
> Cheers,
> Markus