Hi, Thanks for the reply.
Do you know of any regex online tool that works correctly for Java regex? I tried to find some, but they are not working properly. Yes, our plan is to replace more than one \n with <br><br>, and single \n with single <br>. Regards, Edwin On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfra...@gmail.com> wrote: > Solr uses Java regex matching, so i doubt there is a bug - it would then > be in the JDK. Try out in a regex online Tool that supports Java regex for > your solution. > > I believe you want to have 2 regex process factories: > One that deals with single \n and one that deals with more than one \n > > > Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <edwinye...@gmail.com > >: > > > > Hi, > > > > We have tried with the following pattern ([ \t]*\r?\n){2,} and > > configuration: > > > > <processor class="solr.RegexReplaceProcessorFactory"> > > <str name="fieldName">content</str> > > <str name="pattern">([ \t]*\r?\n){2,}</str> > > <str name="replacement"><br><br></str> > > <bool name="literalReplacement">true</bool> > > </processor> > > > > However, the issue is still occurring. > > > > Anyone else is able to help? > > > > Regards, > > Edwin > > > > On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > > wrote: > > > >> Hi, > >> > >> For your info, this issue is occurring in Solr 7.7.0 as well. > >> > >> Regards, > >> Edwin > >> > >> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinye...@gmail.com > > > >> wrote: > >> > >>> Hi, > >>> > >>> Should we report this as a bug in Solr? > >>> > >>> Regards, > >>> Edwin > >>> > >>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinye...@gmail.com > > > >>> wrote: > >>> > >>>> Hi Paul, > >>>> > >>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on > >>>> https://regex101.com/, it is able to give us the correct result for > all > >>>> the examples (ie: All of them will only have <br><br>, and not more > than > >>>> that like what we are getting in Solr in our earlier examples). > >>>> > >>>> Could there be a possibility of a bug in Solr? > >>>> > >>>> Regards, > >>>> Edwin > >>>> > >>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo < > edwinye...@gmail.com> > >>>> wrote: > >>>> > >>>>> Hi Paul, > >>>>> > >>>>> We have tried it with the space preceeding the \n i.e. <str > >>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern: > >>>>> > >>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>>> <str name="fieldName">content</str> > >>>>> <str name="pattern">(\s*\n){2,}</str> > >>>>> <str name="replacement"><br><br></str> > >>>>> </processor> > >>>>> > >>>>> However, we are also getting the exact same results as the earlier > >>>>> Example 1, 2 and 3. > >>>>> > >>>>> As for your point 2 on perhaps in the data you have other (non > >>>>> printing) characters than \n, we have find that there are no non > printing > >>>>> characters. It is just next line with a space. You can refer to the > >>>>> original content in the same examples below. > >>>>> > >>>>> > >>>>> Example 1: The sentence that the above regex pattern is working > >>>>> correctly > >>>>> *Original content in EML file:* > >>>>> Dear Sir, > >>>>> > >>>>> > >>>>> I am terminating > >>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating > >>>>> *Index content: * Dear Sir, <br><br>I am terminating > >>>>> > >>>>> Example 2: The sentence that the above regex pattern is partially > >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>> *Original content in EML file:* > >>>>> > >>>>> *exalted* > >>>>> > >>>>> *Psalm 89:17* > >>>>> > >>>>> > >>>>> 3 Choa Chu Kang Avenue 4 > >>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 > >>>>> Choa Chu Kang Avenue 4, Singapore > >>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 > >>>>> Choa Chu Kang Avenue 4, Singapore > >>>>> > >>>>> Example 3: The sentence that the above regex pattern is partially > >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>> *Original content in EML file:* > >>>>> > >>>>> http://www.concordpri.moe.edu.sg/ > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Tue, Dec 18, 2018 at 10:07 AM > >>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n > \n > >>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, > Dec 18, > >>>>> 2018 at 10:07 AM > >>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> > >>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM > >>>>> > >>>>> > >>>>> Appreciate any other ideas or suggestions that you may have. > >>>>> > >>>>> Thank you. > >>>>> > >>>>> Regards, > >>>>> Edwin > >>>>> > >>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> wrote: > >>>>>> > >>>>>> Hi Edwin > >>>>>> > >>>>>> > >>>>>> > >>>>>> 1. Sorry, the pattern was wrong, the space should preceed the \n > >>>>>> i.e. <str name="pattern">(\s*\n){2,}</str> > >>>>>> 2. Perhaps in the data you have other (non printing) characters > >>>>>> than \n? > >>>>>> > >>>>>> > >>>>>> > >>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> > für > >>>>>> Windows 10 > >>>>>> > >>>>>> > >>>>>> > >>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 > >>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > >>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect > multiple \n > >>>>>> > >>>>>> > >>>>>> > >>>>>> Hi Paul, > >>>>>> > >>>>>> We have tried this suggested regex pattern as follow: > >>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>>>> <str name="fieldName">content</str> > >>>>>> <str name="pattern">(\n\s*){2,}</str> > >>>>>> <str name="replacement"><br><br></str> > >>>>>> </processor> > >>>>>> > >>>>>> But we still have exactly the same problem of Example 1,2 and 3 > below. > >>>>>> > >>>>>> Example 1: The sentence that the above regex pattern is working > >>>>>> correctly > >>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating > >>>>>> *Index content: * Dear Sir, <br><br>I am terminating > >>>>>> > >>>>>> Example 2: The sentence that the above regex pattern is partially > >>>>>> working > >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 > >>>>>> Choa > >>>>>> Chu Kang Avenue 4, Singapore > >>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 > >>>>>> Choa > >>>>>> Chu Kang Avenue 4, Singapore > >>>>>> > >>>>>> Example 3: The sentence that the above regex pattern is partially > >>>>>> working > >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n > >>>>>> \n \n\n > >>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec > 18, > >>>>>> 2018 > >>>>>> at 10:07 AM > >>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> > >>>>>> <br><br>On > >>>>>> Tue, Dec 18, 2018 at 10:07 AM > >>>>>> > >>>>>> Any further suggestion? > >>>>>> > >>>>>> Thank you. > >>>>>> > >>>>>> Regards, > >>>>>> Edwin > >>>>>> > >>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> wrote: > >>>>>>> > >>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on the > >>>>>> {2,} > >>>>>>> part you could try > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> <str name="pattern">(\n\s*){2,}</str> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> If you also want to match CRLF then > >>>>>>> > >>>>>>> <str name="pattern">(\r?\n\s*){2,}</str> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> > >>>>>> für > >>>>>>> Windows 10 > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 > >>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org > > > >>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect > multiple > >>>>>> \n > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Hi Paul, > >>>>>>> > >>>>>>> Thanks for your reply. > >>>>>>> > >>>>>>> When I use this pattern: > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>>>>> <str name="fieldName">content</str> > >>>>>>> <str name="pattern">(\n+\s*){2,}</str> > >>>>>>> <str name="replacement"><br><br></str> > >>>>>>> </processor> > >>>>>>> > >>>>>>> It is working for some sentence within the same content and not > >>>>>> working for > >>>>>>> some sentences. Please see below for the one that is working and > >>>>>> another > >>>>>>> that is not working (partially working): > >>>>>>> > >>>>>>> Example 1: The sentence that the above regex pattern is working > >>>>>> correctly > >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating > >>>>>>> *Index content: * Dear Sir, <br><br>I am terminating > >>>>>>> > >>>>>>> Example 2: The sentence that the above regex pattern is partially > >>>>>> working > >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 > >>>>>> Choa > >>>>>>> Chu Kang Avenue 4, Singapore > >>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 > >>>>>> Choa > >>>>>>> Chu Kang Avenue 4, Singapore > >>>>>>> > >>>>>>> Example 3: The sentence that the above regex pattern is partially > >>>>>> working > >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n > \n\n > >>>>>> \n > >>>>>>> \n\n > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec > >>>>>> 18, 2018 > >>>>>>> at 10:07 AM > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> > >>>>>> <br><br>On > >>>>>>> Tue, Dec 18, 2018 at 10:07 AM > >>>>>>> > >>>>>>> We would appreciate your help to see what is wrong? > >>>>>>> > >>>>>>> Thank you. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Edwin > >>>>>>> > >>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> wrote: > >>>>>>>> > >>>>>>>> You don’t say what happens, just that it is not working. I assume > >>>>>> nothing > >>>>>>>> is replaced? Perhaps the pattern should be > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> <str name="pattern">"(\n\s*){2,}"</str> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> ?? > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> > >>>>>> für > >>>>>>>> Windows 10 > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 > >>>>>>>> An: solr-user@lucene.apache.org<mailto: > solr-user@lucene.apache.org > >>>>>>> > >>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple > \n > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove more > >>>>>> than > >>>>>>> two > >>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n > >>>>>> \n > >>>>>>> \n), > >>>>>>>> and replace it with two <br>. > >>>>>>>> > >>>>>>>> I use the following regex pattern and it is working when I test it > >>>>>> in > >>>>>>>> regex101.com. But it is not working when I put it inside the > >>>>>>>> RegexReplaceProcessorFactory as below: > >>>>>>>> > >>>>>>>> <updateRequestProcessorChain name="removeCode"> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > >>>>>>>> <str name="fieldName">content</str> > >>>>>>>> <str name="pattern">"(\\n\s*){2,}"</str> > >>>>>>>> <str name="replacement"><br><br></str> > >>>>>>>> </processor> > >>>>>>>> </updateRequestProcessorChain> > >>>>>>>> > >>>>>>>> To explain further about my regex pattern, \s* is instructing the > >>>>>> regex > >>>>>>> to > >>>>>>>> match any \n that have space after and {2,} is instructing the > >>>>>> regex to > >>>>>>>> match 2 or more occurrence of such pattern (\n). > >>>>>>>> > >>>>>>>> Please kindly let me know what is wrong and how should I do it? > >>>>>>>> > >>>>>>>> I am using Solr 7.6.0. > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Edwin > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> >