Re: Strange regex behavior in solr.PatternReplaceCharFilterFactory

Erick Erickson Fri, 27 Sep 2019 12:47:28 -0700

Solr’s pattern replace _is_  Java’s. See PatternReplaceCharFilter. You’ll see:


private final Pattern pattern;

and later:
final Matcher m = pattern.matcher(input);

That said, there’s some manipulation after that, so there’s always room for 
issues. But I’d try just a standard Java program with your regex to verify 
rather than online sources.

Best,
Erick

> On Sep 27, 2019, at 2:24 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> Check the log files on the collection reload.
> About your regex: check a web page that checks Java regexes - there can be 
> subtle differences between Java, JavaScript, php etc.
> Then it could be that your original text is not UTF-8 encoded, but Windows or 
> similar. 
> Check also if you have special characters in the text (line breaks, tabs 
> etc.).
> 
>> Am 27.09.2019 um 16:42 schrieb Webster Homer 
>> <webster.ho...@milliporesigma.com>:
>> 
>> I forgot to mention that I'm using Solr 7.2. I also found that if instead 
>> of \p{L} I use the long form \p{Letter} then when I reload the collection 
>> after updating the schema, Solr will not load the collection. I think that 
>> Solr's regex support is not standard  Java 8
>> 
>> -----Original Message-----
>> From: Webster Homer <webster.ho...@milliporesigma.com>
>> Sent: Friday, September 27, 2019 9:09 AM
>> To: solr-user@lucene.apache.org
>> Subject: Strange regex behavior in solr.PatternReplaceCharFilterFactory
>> 
>> I am developing a new version of a fieldtype that we’ve been using for 
>> several years. This fieldtype is to be used as a part of an autocomplete 
>> code. The original version handled standard ascii characters well, but I 
>> wanted it to be able to handle any Unicode letter, not just A-Za-z but Greek 
>> and Chinese as well. The analysis chain is supposed to remove any character 
>> that is not a letter, digit or space.
>> I settled on this fieldType. The main changes from the old version is that I 
>> moved the character removal from a PatternReplaceFilterFactory call to a 
>> PatternReplaceCharFilterFactory. The problem I’m seeing is in how the two 
>> filter factories handle this regex:
>> ([^\p{L}\p{M}\p{Digit} ])
>> Here is the fieldtype
>>  <fieldType name="autocomplete_edge_v2" class="solr.TextField" 
>> positionIncrementGap="100">
>>     <analyzer type="index">
>>        <charFilter class="solr.MappingCharFilterFactory" 
>> mapping="mapping-ISOLatin1Accent.txt"/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="([\.,;:-_])" replacement=" "/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" 
>> words="lang/stopwords_en.txt"/>
>>         <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" 
>> minGramSize="1"/>
>>      </analyzer>
>>     <analyzer type="query">
>>        <charFilter class="solr.MappingCharFilterFactory" 
>> mapping="mapping-ISOLatin1Accent.txt"/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="([\.,;:-_])" replacement=" "/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" 
>> words="lang/stopwords_en.txt"/>
>>        <filter class="solr.PatternReplaceFilterFactory" 
>> pattern="^(.{30})(.*)?" replacement="$1" replace="all"/>
>>    </analyzer>
>>   </fieldType>
>> 
>> The problem I’m seeing is that the call:
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
>> 
>> Strips out letters that match A-Z  It will leave digits, lowercase letters 
>> and Chinese characters. I tested my regex with a couple of online regex 
>> testers and it works. It seems that only the 
>> solr.PatternReplaceCharFilterFactory has this behavior. Here is what I see 
>> in the Analyzer Using this test term: 12水3-23-ER1:abc
>> After the PRCF I see this: 12水323 1 abc
>> The “ER” is removed. I think this is a bug, or am I doing something wrong.
>> I used this link as the source for my regex: 
>> https://www.regular-expressions.info/unicode.html
>> It seems that Solr is treating \p{L} as matching lower case ascii 
>> characters, but is correct for other Unicode characters. For letters in the 
>> A-Z range it is behaving as if the regex was \p{Ll}. I tried explicitly 
>> adding \p{Lu} in and it made no difference capital letters were still 
>> stripped.
>> 
>> This message and any attachment are confidential and may be privileged or 
>> otherwise protected from disclosure. If you are not the intended recipient, 
>> you must not copy this message or attachment or disclose the contents to any 
>> other person. If you have received this transmission in error, please notify 
>> the sender immediately and delete the message and any attachment from your 
>> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
>> accept liability for any omissions or errors in this message which may arise 
>> as a result of E-Mail-transmission or for damages resulting from any 
>> unauthorized changes of the content of this message and any attachment 
>> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
>> guarantee that this message is free of viruses and does not accept liability 
>> for any damages caused by any virus transmitted therewith. Click 
>> http://www.merckgroup.com/disclaimer to access the German, French, Spanish 
>> and Portuguese versions of this disclaimer.
>> This message and any attachment are confidential and may be privileged or 
>> otherwise protected from disclosure. If you are not the intended recipient, 
>> you must not copy this message or attachment or disclose the contents to any 
>> other person. If you have received this transmission in error, please notify 
>> the sender immediately and delete the message and any attachment from your 
>> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
>> accept liability for any omissions or errors in this message which may arise 
>> as a result of E-Mail-transmission or for damages resulting from any 
>> unauthorized changes of the content of this message and any attachment 
>> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
>> guarantee that this message is free of viruses and does not accept liability 
>> for any damages caused by any virus transmitted therewith. Click 
>> http://www.merckgroup.com/disclaimer to access the German, French, Spanish 
>> and Portuguese versions of this disclaimer.

Re: Strange regex behavior in solr.PatternReplaceCharFilterFactory

Reply via email to