Yes, Lucene RegExp javadoc seems a bit complicated and even tests do not cover all syntax variants. But the whole point is: parser doesn't mangle any characters and using backslashes only for distinguish syntax symbols from raw characters.
The example might be not a best possible, but I think reference guide should be corrected (may be with additional note about character escape) because it is difficult to find out correct solution by end users those not familiar with Lucene codebase. Unfortunately, sometimes fine grained tokenizing control is the one workaround for weird issues like LUCENE-7766. For example I have to strip quotes on tokenizer stage to obtain WDGF offsets on parts (for strings like "Foo-Bar" and HTMLStripCharFilter before tokenizer) as temporary solution. 2018-04-15 21:08 GMT+03:00 Shawn Heisey <apa...@elyograg.org>: > On 4/15/2018 5:42 AM, Nikolay Khitrin wrote: > >> Given example is <analyzer> <tokenizer >> class="solr.SimplePatternSplitTokenizerFactory" >> pattern="[ \t\r\n]+"/></analyzer> but Lucene's RegExp constructor consumes >> raw unicode characters instead of \t\r\n form, so correct configuration is >> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ >> 	& #xA;
]+"/> >> > > Looks like you're right about that example not working. I also tried it > with double backslashes -- something that would be required if the string > were found in actual java code. Your suggested replacement DOES work -- > the characters are encoded with XML syntax and passed as ascii/unicode to > the constructor for the tokenizer. > > I cannot make any sense out of the Lucene RegExp javadoc. I think it > needs some full string examples to illustrate what it is trying to say. > > I don't think this is a good example for this particular tokenizer, even > if it's changed to your replacement that does work. For what the example > is TRYING to do, WhitespaceTokenizerFactory is a better choice. It will > match more whitespace characters than spaces, tabs, and newlines. > > Here's an example using that tokenizer that will split on semicolon and > eliminate leading/trailing whitespace from each token: > > <analyzer> > <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern=";"/> > <filter class="solr.TrimFilterFactory"/> > </analyzer> > > Thanks, > Shawn > > -- Николай Хитрин