Solr’s pattern replace _is_ Java’s. See PatternReplaceCharFilter. You’ll see:
private final Pattern pattern; and later: final Matcher m = pattern.matcher(input); That said, there’s some manipulation after that, so there’s always room for issues. But I’d try just a standard Java program with your regex to verify rather than online sources. Best, Erick > On Sep 27, 2019, at 2:24 PM, Jörn Franke <jornfra...@gmail.com> wrote: > > Check the log files on the collection reload. > About your regex: check a web page that checks Java regexes - there can be > subtle differences between Java, JavaScript, php etc. > Then it could be that your original text is not UTF-8 encoded, but Windows or > similar. > Check also if you have special characters in the text (line breaks, tabs > etc.). > >> Am 27.09.2019 um 16:42 schrieb Webster Homer >> <webster.ho...@milliporesigma.com>: >> >> I forgot to mention that I'm using Solr 7.2. I also found that if instead >> of \p{L} I use the long form \p{Letter} then when I reload the collection >> after updating the schema, Solr will not load the collection. I think that >> Solr's regex support is not standard Java 8 >> >> -----Original Message----- >> From: Webster Homer <webster.ho...@milliporesigma.com> >> Sent: Friday, September 27, 2019 9:09 AM >> To: solr-user@lucene.apache.org >> Subject: Strange regex behavior in solr.PatternReplaceCharFilterFactory >> >> I am developing a new version of a fieldtype that we’ve been using for >> several years. This fieldtype is to be used as a part of an autocomplete >> code. The original version handled standard ascii characters well, but I >> wanted it to be able to handle any Unicode letter, not just A-Za-z but Greek >> and Chinese as well. The analysis chain is supposed to remove any character >> that is not a letter, digit or space. >> I settled on this fieldType. The main changes from the old version is that I >> moved the character removal from a PatternReplaceFilterFactory call to a >> PatternReplaceCharFilterFactory. The problem I’m seeing is in how the two >> filter factories handle this regex: >> ([^\p{L}\p{M}\p{Digit} ]) >> Here is the fieldtype >> <fieldType name="autocomplete_edge_v2" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <charFilter class="solr.MappingCharFilterFactory" >> mapping="mapping-ISOLatin1Accent.txt"/> >> <charFilter class="solr.PatternReplaceCharFilterFactory" >> pattern="([\.,;:-_])" replacement=" "/> >> <charFilter class="solr.PatternReplaceCharFilterFactory" >> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" /> >> <tokenizer class="solr.KeywordTokenizerFactory"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" >> words="lang/stopwords_en.txt"/> >> <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" >> minGramSize="1"/> >> </analyzer> >> <analyzer type="query"> >> <charFilter class="solr.MappingCharFilterFactory" >> mapping="mapping-ISOLatin1Accent.txt"/> >> <charFilter class="solr.PatternReplaceCharFilterFactory" >> pattern="([\.,;:-_])" replacement=" "/> >> <charFilter class="solr.PatternReplaceCharFilterFactory" >> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" /> >> <tokenizer class="solr.KeywordTokenizerFactory"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" >> words="lang/stopwords_en.txt"/> >> <filter class="solr.PatternReplaceFilterFactory" >> pattern="^(.{30})(.*)?" replacement="$1" replace="all"/> >> </analyzer> >> </fieldType> >> >> The problem I’m seeing is that the call: >> <charFilter class="solr.PatternReplaceCharFilterFactory" >> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" /> >> >> Strips out letters that match A-Z It will leave digits, lowercase letters >> and Chinese characters. I tested my regex with a couple of online regex >> testers and it works. It seems that only the >> solr.PatternReplaceCharFilterFactory has this behavior. Here is what I see >> in the Analyzer Using this test term: 12水3-23-ER1:abc >> After the PRCF I see this: 12水323 1 abc >> The “ER” is removed. I think this is a bug, or am I doing something wrong. >> I used this link as the source for my regex: >> https://www.regular-expressions.info/unicode.html >> It seems that Solr is treating \p{L} as matching lower case ascii >> characters, but is correct for other Unicode characters. For letters in the >> A-Z range it is behaving as if the regex was \p{Ll}. I tried explicitly >> adding \p{Lu} in and it made no difference capital letters were still >> stripped. >> >> This message and any attachment are confidential and may be privileged or >> otherwise protected from disclosure. If you are not the intended recipient, >> you must not copy this message or attachment or disclose the contents to any >> other person. If you have received this transmission in error, please notify >> the sender immediately and delete the message and any attachment from your >> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not >> accept liability for any omissions or errors in this message which may arise >> as a result of E-Mail-transmission or for damages resulting from any >> unauthorized changes of the content of this message and any attachment >> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not >> guarantee that this message is free of viruses and does not accept liability >> for any damages caused by any virus transmitted therewith. Click >> http://www.merckgroup.com/disclaimer to access the German, French, Spanish >> and Portuguese versions of this disclaimer. >> This message and any attachment are confidential and may be privileged or >> otherwise protected from disclosure. If you are not the intended recipient, >> you must not copy this message or attachment or disclose the contents to any >> other person. If you have received this transmission in error, please notify >> the sender immediately and delete the message and any attachment from your >> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not >> accept liability for any omissions or errors in this message which may arise >> as a result of E-Mail-transmission or for damages resulting from any >> unauthorized changes of the content of this message and any attachment >> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not >> guarantee that this message is free of viruses and does not accept liability >> for any damages caused by any virus transmitted therewith. Click >> http://www.merckgroup.com/disclaimer to access the German, French, Spanish >> and Portuguese versions of this disclaimer.