RE: Strange regex behavior in solr.PatternReplaceCharFilterFactory

Webster Homer Fri, 27 Sep 2019 07:42:49 -0700

I forgot to mention that I'm using Solr 7.2. I also found that if instead of 
\p{L} I use the long form \p{Letter} then when I reload the collection after 
updating the schema, Solr will not load the collection. I think that Solr's 
regex support is not standard  Java 8

-----Original Message-----
From: Webster Homer <webster.ho...@milliporesigma.com>
Sent: Friday, September 27, 2019 9:09 AM
To: solr-user@lucene.apache.org
Subject: Strange regex behavior in solr.PatternReplaceCharFilterFactory

I am developing a new version of a fieldtype that we’ve been using for several 
years. This fieldtype is to be used as a part of an autocomplete code. The 
original version handled standard ascii characters well, but I wanted it to be 
able to handle any Unicode letter, not just A-Za-z but Greek and Chinese as 
well. The analysis chain is supposed to remove any character that is not a 
letter, digit or space.
I settled on this fieldType. The main changes from the old version is that I 
moved the character removal from a PatternReplaceFilterFactory call to a 
PatternReplaceCharFilterFactory. The problem I’m seeing is in how the two 
filter factories handle this regex:
([^\p{L}\p{M}\p{Digit} ])
Here is the fieldtype
   <fieldType name="autocomplete_edge_v2" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
         <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
         <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="([\.,;:-_])" replacement=" "/>
         <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" 
words="lang/stopwords_en.txt"/>
          <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" 
minGramSize="1"/>
       </analyzer>
      <analyzer type="query">
         <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
         <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="([\.,;:-_])" replacement=" "/>
         <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" 
words="lang/stopwords_en.txt"/>
         <filter class="solr.PatternReplaceFilterFactory" 
pattern="^(.{30})(.*)?" replacement="$1" replace="all"/>
     </analyzer>
    </fieldType>

The problem I’m seeing is that the call:
         <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />

Strips out letters that match A-Z  It will leave digits, lowercase letters and 
Chinese characters. I tested my regex with a couple of online regex testers and 
it works. It seems that only the solr.PatternReplaceCharFilterFactory has this 
behavior. Here is what I see in the Analyzer Using this test term: 
12水3-23-ER1:abc
After the PRCF I see this: 12水323 1 abc
The “ER” is removed. I think this is a bug, or am I doing something wrong.
I used this link as the source for my regex: 
https://www.regular-expressions.info/unicode.html
It seems that Solr is treating \p{L} as matching lower case ascii characters, 
but is correct for other Unicode characters. For letters in the A-Z range it is 
behaving as if the regex was \p{Ll}. I tried explicitly adding \p{Lu} in and it 
made no difference capital letters were still stripped.

This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer 
to access the German, French, Spanish and Portuguese versions of this 
disclaimer.
This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer 
to access the German, French, Spanish and Portuguese versions of this 
disclaimer.

RE: Strange regex behavior in solr.PatternReplaceCharFilterFactory

Reply via email to