On 4/15/2018 5:42 AM, Nikolay Khitrin wrote:
Given example is <analyzer> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/></analyzer> but Lucene's RegExp constructor consumes raw unicode characters instead of \t\r\n form, so correct configuration is <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ &#x9;& #xA;&#xD;]+"/>

Looks like you're right about that example not working.  I also tried it with double backslashes -- something that would be required if the string were found in actual java code.  Your suggested replacement DOES work -- the characters are encoded with XML syntax and passed as ascii/unicode to the constructor for the tokenizer.

I cannot make any sense out of the Lucene RegExp javadoc.  I think it needs some full string examples to illustrate what it is trying to say.

I don't think this is a good example for this particular tokenizer, even if it's changed to your replacement that does work.  For what the example is TRYING to do, WhitespaceTokenizerFactory is a better choice.  It will match more whitespace characters than spaces, tabs, and newlines.

Here's an example using that tokenizer that will split on semicolon and eliminate leading/trailing whitespace from each token:

<analyzer>
  <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern=";"/>
  <filter class="solr.TrimFilterFactory"/>
</analyzer>

Thanks,
Shawn

Reply via email to