Solr uses Java regex matching, so i doubt there is a bug - it would then be in the JDK. Try out in a regex online Tool that supports Java regex for your solution.
I believe you want to have 2 regex process factories: One that deals with single \n and one that deals with more than one \n > Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <edwinye...@gmail.com>: > > Hi, > > We have tried with the following pattern ([ \t]*\r?\n){2,} and > configuration: > > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">([ \t]*\r?\n){2,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > > However, the issue is still occurring. > > Anyone else is able to help? > > Regards, > Edwin > > On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > >> Hi, >> >> For your info, this issue is occurring in Solr 7.7.0 as well. >> >> Regards, >> Edwin >> >> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> wrote: >> >>> Hi, >>> >>> Should we report this as a bug in Solr? >>> >>> Regards, >>> Edwin >>> >>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >>> wrote: >>> >>>> Hi Paul, >>>> >>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on >>>> https://regex101.com/, it is able to give us the correct result for all >>>> the examples (ie: All of them will only have <br><br>, and not more than >>>> that like what we are getting in Solr in our earlier examples). >>>> >>>> Could there be a possibility of a bug in Solr? >>>> >>>> Regards, >>>> Edwin >>>> >>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >>>> wrote: >>>> >>>>> Hi Paul, >>>>> >>>>> We have tried it with the space preceeding the \n i.e. <str >>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern: >>>>> >>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>>>> <str name="fieldName">content</str> >>>>> <str name="pattern">(\s*\n){2,}</str> >>>>> <str name="replacement"><br><br></str> >>>>> </processor> >>>>> >>>>> However, we are also getting the exact same results as the earlier >>>>> Example 1, 2 and 3. >>>>> >>>>> As for your point 2 on perhaps in the data you have other (non >>>>> printing) characters than \n, we have find that there are no non printing >>>>> characters. It is just next line with a space. You can refer to the >>>>> original content in the same examples below. >>>>> >>>>> >>>>> Example 1: The sentence that the above regex pattern is working >>>>> correctly >>>>> *Original content in EML file:* >>>>> Dear Sir, >>>>> >>>>> >>>>> I am terminating >>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>>>> *Index content: * Dear Sir, <br><br>I am terminating >>>>> >>>>> Example 2: The sentence that the above regex pattern is partially >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) >>>>> *Original content in EML file:* >>>>> >>>>> *exalted* >>>>> >>>>> *Psalm 89:17* >>>>> >>>>> >>>>> 3 Choa Chu Kang Avenue 4 >>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 >>>>> Choa Chu Kang Avenue 4, Singapore >>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 >>>>> Choa Chu Kang Avenue 4, Singapore >>>>> >>>>> Example 3: The sentence that the above regex pattern is partially >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) >>>>> *Original content in EML file:* >>>>> >>>>> http://www.concordpri.moe.edu.sg/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Dec 18, 2018 at 10:07 AM >>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n >>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec >>>>> 18, >>>>> 2018 at 10:07 AM >>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM >>>>> >>>>> >>>>> Appreciate any other ideas or suggestions that you may have. >>>>> >>>>> Thank you. >>>>> >>>>> Regards, >>>>> Edwin >>>>> >>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> wrote: >>>>>> >>>>>> Hi Edwin >>>>>> >>>>>> >>>>>> >>>>>> 1. Sorry, the pattern was wrong, the space should preceed the \n >>>>>> i.e. <str name="pattern">(\s*\n){2,}</str> >>>>>> 2. Perhaps in the data you have other (non printing) characters >>>>>> than \n? >>>>>> >>>>>> >>>>>> >>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >>>>>> Windows 10 >>>>>> >>>>>> >>>>>> >>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 >>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >>>>>> >>>>>> >>>>>> >>>>>> Hi Paul, >>>>>> >>>>>> We have tried this suggested regex pattern as follow: >>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>>>>> <str name="fieldName">content</str> >>>>>> <str name="pattern">(\n\s*){2,}</str> >>>>>> <str name="replacement"><br><br></str> >>>>>> </processor> >>>>>> >>>>>> But we still have exactly the same problem of Example 1,2 and 3 below. >>>>>> >>>>>> Example 1: The sentence that the above regex pattern is working >>>>>> correctly >>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>>>>> *Index content: * Dear Sir, <br><br>I am terminating >>>>>> >>>>>> Example 2: The sentence that the above regex pattern is partially >>>>>> working >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 >>>>>> Choa >>>>>> Chu Kang Avenue 4, Singapore >>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 >>>>>> Choa >>>>>> Chu Kang Avenue 4, Singapore >>>>>> >>>>>> Example 3: The sentence that the above regex pattern is partially >>>>>> working >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n >>>>>> \n \n\n >>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >>>>>> 2018 >>>>>> at 10:07 AM >>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>>>>> <br><br>On >>>>>> Tue, Dec 18, 2018 at 10:07 AM >>>>>> >>>>>> Any further suggestion? >>>>>> >>>>>> Thank you. >>>>>> >>>>>> Regards, >>>>>> Edwin >>>>>> >>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> wrote: >>>>>>> >>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on the >>>>>> {2,} >>>>>>> part you could try >>>>>>> >>>>>>> >>>>>>> >>>>>>> <str name="pattern">(\n\s*){2,}</str> >>>>>>> >>>>>>> >>>>>>> >>>>>>> If you also want to match CRLF then >>>>>>> >>>>>>> <str name="pattern">(\r?\n\s*){2,}</str> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> >>>>>> für >>>>>>> Windows 10 >>>>>>> >>>>>>> >>>>>>> >>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 >>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple >>>>>> \n >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Paul, >>>>>>> >>>>>>> Thanks for your reply. >>>>>>> >>>>>>> When I use this pattern: >>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>>>>>> <str name="fieldName">content</str> >>>>>>> <str name="pattern">(\n+\s*){2,}</str> >>>>>>> <str name="replacement"><br><br></str> >>>>>>> </processor> >>>>>>> >>>>>>> It is working for some sentence within the same content and not >>>>>> working for >>>>>>> some sentences. Please see below for the one that is working and >>>>>> another >>>>>>> that is not working (partially working): >>>>>>> >>>>>>> Example 1: The sentence that the above regex pattern is working >>>>>> correctly >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>>>>>> *Index content: * Dear Sir, <br><br>I am terminating >>>>>>> >>>>>>> Example 2: The sentence that the above regex pattern is partially >>>>>> working >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 >>>>>> Choa >>>>>>> Chu Kang Avenue 4, Singapore >>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 >>>>>> Choa >>>>>>> Chu Kang Avenue 4, Singapore >>>>>>> >>>>>>> Example 3: The sentence that the above regex pattern is partially >>>>>> working >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n >>>>>> \n >>>>>>> \n\n >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec >>>>>> 18, 2018 >>>>>>> at 10:07 AM >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>>>>> <br><br>On >>>>>>> Tue, Dec 18, 2018 at 10:07 AM >>>>>>> >>>>>>> We would appreciate your help to see what is wrong? >>>>>>> >>>>>>> Thank you. >>>>>>> >>>>>>> Regards, >>>>>>> Edwin >>>>>>> >>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> wrote: >>>>>>>> >>>>>>>> You don’t say what happens, just that it is not working. I assume >>>>>> nothing >>>>>>>> is replaced? Perhaps the pattern should be >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> <str name="pattern">"(\n\s*){2,}"</str> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ?? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> >>>>>> für >>>>>>>> Windows 10 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 >>>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org >>>>>>> >>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove more >>>>>> than >>>>>>> two >>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n >>>>>> \n >>>>>>> \n), >>>>>>>> and replace it with two <br>. >>>>>>>> >>>>>>>> I use the following regex pattern and it is working when I test it >>>>>> in >>>>>>>> regex101.com. But it is not working when I put it inside the >>>>>>>> RegexReplaceProcessorFactory as below: >>>>>>>> >>>>>>>> <updateRequestProcessorChain name="removeCode"> >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>>>>>>> <str name="fieldName">content</str> >>>>>>>> <str name="pattern">"(\\n\s*){2,}"</str> >>>>>>>> <str name="replacement"><br><br></str> >>>>>>>> </processor> >>>>>>>> </updateRequestProcessorChain> >>>>>>>> >>>>>>>> To explain further about my regex pattern, \s* is instructing the >>>>>> regex >>>>>>> to >>>>>>>> match any \n that have space after and {2,} is instructing the >>>>>> regex to >>>>>>>> match 2 or more occurrence of such pattern (\n). >>>>>>>> >>>>>>>> Please kindly let me know what is wrong and how should I do it? >>>>>>>> >>>>>>>> I am using Solr 7.6.0. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Edwin >>>>>>>> >>>>>>> >>>>>> >>>>>