Check the log files on the collection reload. About your regex: check a web page that checks Java regexes - there can be subtle differences between Java, JavaScript, php etc. Then it could be that your original text is not UTF-8 encoded, but Windows or similar. Check also if you have special characters in the text (line breaks, tabs etc.).
> Am 27.09.2019 um 16:42 schrieb Webster Homer > <webster.ho...@milliporesigma.com>: > > I forgot to mention that I'm using Solr 7.2. I also found that if instead of > \p{L} I use the long form \p{Letter} then when I reload the collection after > updating the schema, Solr will not load the collection. I think that Solr's > regex support is not standard Java 8 > > -----Original Message----- > From: Webster Homer <webster.ho...@milliporesigma.com> > Sent: Friday, September 27, 2019 9:09 AM > To: solr-user@lucene.apache.org > Subject: Strange regex behavior in solr.PatternReplaceCharFilterFactory > > I am developing a new version of a fieldtype that we’ve been using for > several years. This fieldtype is to be used as a part of an autocomplete > code. The original version handled standard ascii characters well, but I > wanted it to be able to handle any Unicode letter, not just A-Za-z but Greek > and Chinese as well. The analysis chain is supposed to remove any character > that is not a letter, digit or space. > I settled on this fieldType. The main changes from the old version is that I > moved the character removal from a PatternReplaceFilterFactory call to a > PatternReplaceCharFilterFactory. The problem I’m seeing is in how the two > filter factories handle this regex: > ([^\p{L}\p{M}\p{Digit} ]) > Here is the fieldtype > <fieldType name="autocomplete_edge_v2" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <charFilter class="solr.MappingCharFilterFactory" > mapping="mapping-ISOLatin1Accent.txt"/> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="([\.,;:-_])" replacement=" "/> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" /> > <tokenizer class="solr.KeywordTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" > words="lang/stopwords_en.txt"/> > <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" > minGramSize="1"/> > </analyzer> > <analyzer type="query"> > <charFilter class="solr.MappingCharFilterFactory" > mapping="mapping-ISOLatin1Accent.txt"/> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="([\.,;:-_])" replacement=" "/> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" /> > <tokenizer class="solr.KeywordTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" > words="lang/stopwords_en.txt"/> > <filter class="solr.PatternReplaceFilterFactory" > pattern="^(.{30})(.*)?" replacement="$1" replace="all"/> > </analyzer> > </fieldType> > > The problem I’m seeing is that the call: > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" /> > > Strips out letters that match A-Z It will leave digits, lowercase letters > and Chinese characters. I tested my regex with a couple of online regex > testers and it works. It seems that only the > solr.PatternReplaceCharFilterFactory has this behavior. Here is what I see in > the Analyzer Using this test term: 12水3-23-ER1:abc > After the PRCF I see this: 12水323 1 abc > The “ER” is removed. I think this is a bug, or am I doing something wrong. > I used this link as the source for my regex: > https://www.regular-expressions.info/unicode.html > It seems that Solr is treating \p{L} as matching lower case ascii characters, > but is correct for other Unicode characters. For letters in the A-Z range it > is behaving as if the regex was \p{Ll}. I tried explicitly adding \p{Lu} in > and it made no difference capital letters were still stripped. > > This message and any attachment are confidential and may be privileged or > otherwise protected from disclosure. If you are not the intended recipient, > you must not copy this message or attachment or disclose the contents to any > other person. If you have received this transmission in error, please notify > the sender immediately and delete the message and any attachment from your > system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not > accept liability for any omissions or errors in this message which may arise > as a result of E-Mail-transmission or for damages resulting from any > unauthorized changes of the content of this message and any attachment > thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not > guarantee that this message is free of viruses and does not accept liability > for any damages caused by any virus transmitted therewith. Click > http://www.merckgroup.com/disclaimer to access the German, French, Spanish > and Portuguese versions of this disclaimer. > This message and any attachment are confidential and may be privileged or > otherwise protected from disclosure. If you are not the intended recipient, > you must not copy this message or attachment or disclose the contents to any > other person. If you have received this transmission in error, please notify > the sender immediately and delete the message and any attachment from your > system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not > accept liability for any omissions or errors in this message which may arise > as a result of E-Mail-transmission or for damages resulting from any > unauthorized changes of the content of this message and any attachment > thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not > guarantee that this message is free of viruses and does not accept liability > for any damages caused by any virus transmitted therewith. Click > http://www.merckgroup.com/disclaimer to access the German, French, Spanish > and Portuguese versions of this disclaimer.