On 4/15/2018 5:42 AM, Nikolay Khitrin wrote:
Given example is <analyzer> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/></analyzer> but Lucene's RegExp constructor consumes raw unicode characters instead of \t\r\n form, so correct configuration is <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ 	& #xA;
]+"/>
Looks like you're right about that example not working. I also tried it with double backslashes -- something that would be required if the string were found in actual java code. Your suggested replacement DOES work -- the characters are encoded with XML syntax and passed as ascii/unicode to the constructor for the tokenizer.
I cannot make any sense out of the Lucene RegExp javadoc. I think it needs some full string examples to illustrate what it is trying to say.
I don't think this is a good example for this particular tokenizer, even if it's changed to your replacement that does work. For what the example is TRYING to do, WhitespaceTokenizerFactory is a better choice. It will match more whitespace characters than spaces, tabs, and newlines.
Here's an example using that tokenizer that will split on semicolon and eliminate leading/trailing whitespace from each token:
<analyzer> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern=";"/> <filter class="solr.TrimFilterFactory"/> </analyzer> Thanks, Shawn