Is this a suggestion for JIRA ticket? Or a question on how to solve it? If the later, you could probably stick a RegEx replacement in the UpdateRequestProcessor chain and be done with it.
As to why? I would look for the rest of the MSWord-generated artifacts, such as "smart" quotes, extra-long dashes, etc. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 8 October 2014 09:59, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hi, > > For some crazy reason, some users somehow manage to substitute a perfectly > normal space with a badly encoded non-breaking space, properly URL encoded > this then becomes %c2a0 and depending on the encoding you use to view you > probably see  followed by a space. For example: > > Because c2a0 is not considered whitespace (indeed, it is not real whitespace, > that is 00a0) by the Java Character class, the WhitespaceTokenizer won't > split on it, but the WordDelimiterFilter still does, somehow mitigating the > problem as it becomes: > > HTMLSCF een abonnement > WT een abonnement > WDF een eenabonnement abonnement > > Should the WhitespaceTokenizer not include this weird edge case? > > Cheers, > Markus