Re: Strange regex behavior in solr.PatternReplaceCharFilterFactory

Jörn Franke Fri, 27 Sep 2019 11:26:07 -0700

Check the log files on the collection reload.
About your regex: check a web page that checks Java regexes - there can be 
subtle differences between Java, JavaScript, php etc.
Then it could be that your original text is not UTF-8 encoded, but Windows or 
similar. 
Check also if you have special characters in the text (line breaks, tabs etc.).


> Am 27.09.2019 um 16:42 schrieb Webster Homer 
> <[email protected]>:
> 
> I forgot to mention that I'm using Solr 7.2. I also found that if instead of 
> \p{L} I use the long form \p{Letter} then when I reload the collection after 
> updating the schema, Solr will not load the collection. I think that Solr's 
> regex support is not standard  Java 8
> 
> -----Original Message-----
> From: Webster Homer <[email protected]>
> Sent: Friday, September 27, 2019 9:09 AM
> To: [email protected]
> Subject: Strange regex behavior in solr.PatternReplaceCharFilterFactory
> 
> I am developing a new version of a fieldtype that we’ve been using for 
> several years. This fieldtype is to be used as a part of an autocomplete 
> code. The original version handled standard ascii characters well, but I 
> wanted it to be able to handle any Unicode letter, not just A-Za-z but Greek 
> and Chinese as well. The analysis chain is supposed to remove any character 
> that is not a letter, digit or space.
> I settled on this fieldType. The main changes from the old version is that I 
> moved the character removal from a PatternReplaceFilterFactory call to a 
> PatternReplaceCharFilterFactory. The problem I’m seeing is in how the two 
> filter factories handle this regex:
> ([^\p{L}\p{M}\p{Digit} ])
> Here is the fieldtype
>   <fieldType name="autocomplete_edge_v2" class="solr.TextField" 
> positionIncrementGap="100">
>      <analyzer type="index">
>         <charFilter class="solr.MappingCharFilterFactory" 
> mapping="mapping-ISOLatin1Accent.txt"/>
>         <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="([\.,;:-_])" replacement=" "/>
>         <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" 
> words="lang/stopwords_en.txt"/>
>          <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" 
> minGramSize="1"/>
>       </analyzer>
>      <analyzer type="query">
>         <charFilter class="solr.MappingCharFilterFactory" 
> mapping="mapping-ISOLatin1Accent.txt"/>
>         <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="([\.,;:-_])" replacement=" "/>
>         <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.SuggestStopFilterFactory" ignoreCase="true" 
> words="lang/stopwords_en.txt"/>
>         <filter class="solr.PatternReplaceFilterFactory" 
> pattern="^(.{30})(.*)?" replacement="$1" replace="all"/>
>     </analyzer>
>    </fieldType>
> 
> The problem I’m seeing is that the call:
>         <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="([^\p{L}\p{M}\p{Digit} ])" replacement="" />
> 
> Strips out letters that match A-Z  It will leave digits, lowercase letters 
> and Chinese characters. I tested my regex with a couple of online regex 
> testers and it works. It seems that only the 
> solr.PatternReplaceCharFilterFactory has this behavior. Here is what I see in 
> the Analyzer Using this test term: 12水3-23-ER1:abc
> After the PRCF I see this: 12水323 1 abc
> The “ER” is removed. I think this is a bug, or am I doing something wrong.
> I used this link as the source for my regex: 
> https://www.regular-expressions.info/unicode.html
> It seems that Solr is treating \p{L} as matching lower case ascii characters, 
> but is correct for other Unicode characters. For letters in the A-Z range it 
> is behaving as if the regex was \p{Ll}. I tried explicitly adding \p{Lu} in 
> and it made no difference capital letters were still stripped.
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith. Click 
> http://www.merckgroup.com/disclaimer to access the German, French, Spanish 
> and Portuguese versions of this disclaimer.
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith. Click 
> http://www.merckgroup.com/disclaimer to access the German, French, Spanish 
> and Portuguese versions of this disclaimer.

Re: Strange regex behavior in solr.PatternReplaceCharFilterFactory

Reply via email to