Re: Solr Reference Guide issue for simplified tokenizers

2018-04-16 Thread Nikolay Khitrin
Yes, Lucene RegExp javadoc seems a bit complicated and even tests do not cover all syntax variants. But the whole point is: parser doesn't mangle any characters and using backslashes only for distinguish syntax symbols from raw characters. The example might be not a best possible, but I think refe

Re: Solr Reference Guide issue for simplified tokenizers

2018-04-15 Thread Shawn Heisey
On 4/15/2018 5:42 AM, Nikolay Khitrin wrote: Given example is class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/> but Lucene's RegExp constructor consumes raw unicode characters instead of \t\r\n form, so correct configuration is Looks like you're right about that exampl

Solr Reference Guide issue for simplified tokenizers

2018-04-15 Thread Nikolay Khitrin
I'm feeling I found an issue in Solr Reference Guide for Simplified Regular Expression Pattern [Splitting ]Tokenizer (https://lucene.apache.org/ solr/guide/7_3/tokenizers.html#simplified-regular- expression-pattern-splitting-tokenizer). Given example is but Lucene's RegExp constructor consu