Re: Solr Reference Guide issue for simplified tokenizers

Shawn Heisey Sun, 15 Apr 2018 11:09:06 -0700

On 4/15/2018 5:42 AM, Nikolay Khitrin wrote:

Given example is <analyzer> <tokenizerclass="solr.SimplePatternSplitTokenizerFactory" pattern="[\t\r\n]+"/></analyzer> but Lucene's RegExp constructor consumes rawunicode characters instead of \t\r\n form, so correct configuration is<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[	& #xA;]+"/>

Looks like you're right about that example not working. I also tried itwith double backslashes -- something that would be required if thestring were found in actual java code. Your suggested replacement DOESwork -- the characters are encoded with XML syntax and passed asascii/unicode to the constructor for the tokenizer.

I cannot make any sense out of the Lucene RegExp javadoc. I think itneeds some full string examples to illustrate what it is trying to say.

I don't think this is a good example for this particular tokenizer, evenif it's changed to your replacement that does work. For what theexample is TRYING to do, WhitespaceTokenizerFactory is a better choice. It will match more whitespace characters than spaces, tabs, and newlines.

Here's an example using that tokenizer that will split on semicolon andeliminate leading/trailing whitespace from each token:


<analyzer>
  <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern=";"/>
  <filter class="solr.TrimFilterFactory"/>
</analyzer>

Thanks,
Shawn

Re: Solr Reference Guide issue for simplified tokenizers

Reply via email to