RE: Strange regex behavior in solr.PatternReplaceCharFilterFactory

Webster Homer Fri, 27 Sep 2019 14:03:50 -0700

I had some examples already, but I wrote a unit test, and solr is not handling 
\p{L} correctly.


Also saw some vague discussion in Oracle's documentation around \p{L}

    @Test
    public void testUnicodeLetter() {
    Pattern pattern = Pattern.compile("[\\p{L}\\p{M}\\p{Digit}]+");
    String matchChina = "乙醇";
    Matcher match = pattern.matcher(matchChina);
    assertTrue(match.matches());
    String matchLower = "aaa";
    match = pattern.matcher(matchLower);
    assertTrue(match.matches());
    String matchUppeer = "AAA";
    match = pattern.matcher(matchUppeer);
    assertTrue(match.matches()); // should fail if Solr is correct
    }
My dev Java is jdk1.8.0_162 which isn't real current...
So this could be an issue with a version of Java or solr is doing something more

-----Original Message-----
From: Erick Erickson <erickerick...@gmail.com>
Sent: Friday, September 27, 2019 2:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Strange regex behavior in solr.PatternReplaceCharFilterFactory

Solr’s pattern replace _is_  Java’s. See PatternReplaceCharFilter. You’ll see:

private final Pattern pattern;

and later:
final Matcher m = pattern.matcher(input);

That said, there’s some manipulation after that, so there’s always room for 
issues. But I’d try just a standard Java program with your regex to verify 
rather than online sources.

Best,
Erick

> On Sep 27, 2019, at 2:24 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> Check the log files on the collection reload.
> About your regex: check a web page that checks Java regexes - there can be 
> subtle differences between Java, JavaScript, php etc.
> Then it could be that your original text is not UTF-8 encoded, but Windows or 
> similar.
> Check also if you have special characters in the text (line breaks, tabs 
> etc.).
>
>> Am 27.09.2019 um 16:42 schrieb Webster Homer 
>> <webster.ho...@milliporesigma.com>:
>>
>> I forgot to mention that I'm using Solr 7.2. I also found that if
>> instead of \p{L} I use the long form \p{Letter} then when I reload
>> the collection after updating the schema, Solr will not load the
>> collection. I think that Solr's regex support is not standard  Java 8
>>
>> -----Original Message-----
>> From: Webster Homer <webster.ho...@milliporesigma.com>
>> Sent: Friday, September 27, 2019 9:09 AM
>> To: solr-user@lucene.apache.org
>> Subject: Strange regex behavior in
>> solr.PatternReplaceCharFilterFactory
>>
>> I am developing a new version of a fieldtype that we’ve been using for 
>> several years. This fieldtype is to be used as a part of an autocomplete 
>> code. The original version handled standard ascii characters well, but I 
>> wanted it to be able to handle any Unicode letter, not just A-Za-z but Greek 
>> and Chinese as well. The analysis chain is supposed to remove any character 
>> that is not a letter, digit or space.
>> I settled on this fieldType. The main changes from the old version is that I 
>> moved the character removal from a PatternReplaceFilterFactory call to a 
>> PatternReplaceCharFilterFactory. The problem I’m seeing is in how the two 
>> filter factories handle this regex:
>> ([^\p{L}\p{M}\p{Digit} ])
>> Here is the fieldtype
>>  <fieldType name="autocomplete_edge_v2" class="solr.TextField" 
>> positionIncrementGap="100">
>>     <analyzer type="index">
>>        <charFilter class="solr.MappingCharFilterFactory" 
>> mapping="mapping-ISOLatin1Accent.txt"/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="([\.,;:-_])" replacement=" "/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" 
>> words="lang/stopwords_en.txt"/>
>>         <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" 
>> minGramSize="1"/>
>>      </analyzer>
>>     <analyzer type="query">
>>        <charFilter class="solr.MappingCharFilterFactory" 
>> mapping="mapping-ISOLatin1Accent.txt"/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="([\.,;:-_])" replacement=" "/>
>>        <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" 
>> words="lang/stopwords_en.txt"/>
>>        <filter class="solr.PatternReplaceFilterFactory" 
>> pattern="^(.{30})(.*)?" replacement="$1" replace="all"/>
>>    </analyzer>
>>   </fieldType>
>>
>> The problem I’m seeing is that the call:
>>        <charFilter class="solr.PatternReplaceCharFilterFactory"
>> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
>>
>> Strips out letters that match A-Z  It will leave digits, lowercase
>> letters and Chinese characters. I tested my regex with a couple of
>> online regex testers and it works. It seems that only the
>> solr.PatternReplaceCharFilterFactory has this behavior. Here is what
>> I see in the Analyzer Using this test term: 12水3-23-ER1:abc
>> After the PRCF I see this: 12水323 1 abc The “ER” is removed. I think
>> this is a bug, or am I doing something wrong.
>> I used this link as the source for my regex:
>> https://www.regular-expressions.info/unicode.html
>> It seems that Solr is treating \p{L} as matching lower case ascii 
>> characters, but is correct for other Unicode characters. For letters in the 
>> A-Z range it is behaving as if the regex was \p{Ll}. I tried explicitly 
>> adding \p{Lu} in and it made no difference capital letters were still 
>> stripped.
>>
>> This message and any attachment are confidential and may be privileged or 
>> otherwise protected from disclosure. If you are not the intended recipient, 
>> you must not copy this message or attachment or disclose the contents to any 
>> other person. If you have received this transmission in error, please notify 
>> the sender immediately and delete the message and any attachment from your 
>> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
>> accept liability for any omissions or errors in this message which may arise 
>> as a result of E-Mail-transmission or for damages resulting from any 
>> unauthorized changes of the content of this message and any attachment 
>> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
>> guarantee that this message is free of viruses and does not accept liability 
>> for any damages caused by any virus transmitted therewith. Click 
>> http://www.merckgroup.com/disclaimer to access the German, French, Spanish 
>> and Portuguese versions of this disclaimer.
>> This message and any attachment are confidential and may be privileged or 
>> otherwise protected from disclosure. If you are not the intended recipient, 
>> you must not copy this message or attachment or disclose the contents to any 
>> other person. If you have received this transmission in error, please notify 
>> the sender immediately and delete the message and any attachment from your 
>> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
>> accept liability for any omissions or errors in this message which may arise 
>> as a result of E-Mail-transmission or for damages resulting from any 
>> unauthorized changes of the content of this message and any attachment 
>> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
>> guarantee that this message is free of viruses and does not accept liability 
>> for any damages caused by any virus transmitted therewith. Click 
>> http://www.merckgroup.com/disclaimer to access the German, French, Spanish 
>> and Portuguese versions of this disclaimer.

This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer 
to access the German, French, Spanish and Portuguese versions of this 
disclaimer.

RE: Strange regex behavior in solr.PatternReplaceCharFilterFactory

Reply via email to