I forgot to mention that I'm using Solr 7.2. I also found that if instead of \p{L} I use the long form \p{Letter} then when I reload the collection after updating the schema, Solr will not load the collection. I think that Solr's regex support is not standard Java 8
-----Original Message----- From: Webster Homer <webster.ho...@milliporesigma.com> Sent: Friday, September 27, 2019 9:09 AM To: solr-user@lucene.apache.org Subject: Strange regex behavior in solr.PatternReplaceCharFilterFactory I am developing a new version of a fieldtype that we’ve been using for several years. This fieldtype is to be used as a part of an autocomplete code. The original version handled standard ascii characters well, but I wanted it to be able to handle any Unicode letter, not just A-Za-z but Greek and Chinese as well. The analysis chain is supposed to remove any character that is not a letter, digit or space. I settled on this fieldType. The main changes from the old version is that I moved the character removal from a PatternReplaceFilterFactory call to a PatternReplaceCharFilterFactory. The problem I’m seeing is in how the two filter factories handle this regex: ([^\p{L}\p{M}\p{Digit} ]) Here is the fieldtype <fieldType name="autocomplete_edge_v2" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\.,;:-_])" replacement=" "/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" /> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/> <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/> </analyzer> <analyzer type="query"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\.,;:-_])" replacement=" "/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" /> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/> <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{30})(.*)?" replacement="$1" replace="all"/> </analyzer> </fieldType> The problem I’m seeing is that the call: <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" /> Strips out letters that match A-Z It will leave digits, lowercase letters and Chinese characters. I tested my regex with a couple of online regex testers and it works. It seems that only the solr.PatternReplaceCharFilterFactory has this behavior. Here is what I see in the Analyzer Using this test term: 12水3-23-ER1:abc After the PRCF I see this: 12水323 1 abc The “ER” is removed. I think this is a bug, or am I doing something wrong. I used this link as the source for my regex: https://www.regular-expressions.info/unicode.html It seems that Solr is treating \p{L} as matching lower case ascii characters, but is correct for other Unicode characters. For letters in the A-Z range it is behaving as if the regex was \p{Ll}. I tried explicitly adding \p{Lu} in and it made no difference capital letters were still stripped. This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer. This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer.