Hi, We have tried with the following pattern ([ \t]*\r?\n){2,} and configuration:
<processor class="solr.RegexReplaceProcessorFactory"> <str name="fieldName">content</str> <str name="pattern">([ \t]*\r?\n){2,}</str> <str name="replacement"><br><br></str> <bool name="literalReplacement">true</bool> </processor> However, the issue is still occurring. Anyone else is able to help? Regards, Edwin On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi, > > For your info, this issue is occurring in Solr 7.7.0 as well. > > Regards, > Edwin > > On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > >> Hi, >> >> Should we report this as a bug in Solr? >> >> Regards, >> Edwin >> >> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> wrote: >> >>> Hi Paul, >>> >>> Regarding the regex (\n\s*){2,} that we are using, when we try in on >>> https://regex101.com/, it is able to give us the correct result for all >>> the examples (ie: All of them will only have <br><br>, and not more than >>> that like what we are getting in Solr in our earlier examples). >>> >>> Could there be a possibility of a bug in Solr? >>> >>> Regards, >>> Edwin >>> >>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >>> wrote: >>> >>>> Hi Paul, >>>> >>>> We have tried it with the space preceeding the \n i.e. <str >>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern: >>>> >>>> <processor class="solr.RegexReplaceProcessorFactory"> >>>> <str name="fieldName">content</str> >>>> <str name="pattern">(\s*\n){2,}</str> >>>> <str name="replacement"><br><br></str> >>>> </processor> >>>> >>>> However, we are also getting the exact same results as the earlier >>>> Example 1, 2 and 3. >>>> >>>> As for your point 2 on perhaps in the data you have other (non >>>> printing) characters than \n, we have find that there are no non printing >>>> characters. It is just next line with a space. You can refer to the >>>> original content in the same examples below. >>>> >>>> >>>> Example 1: The sentence that the above regex pattern is working >>>> correctly >>>> *Original content in EML file:* >>>> Dear Sir, >>>> >>>> >>>> I am terminating >>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>>> *Index content: * Dear Sir, <br><br>I am terminating >>>> >>>> Example 2: The sentence that the above regex pattern is partially >>>> working (as you can see, instead of 2 <br>, there are 4 <br>) >>>> *Original content in EML file:* >>>> >>>> *exalted* >>>> >>>> *Psalm 89:17* >>>> >>>> >>>> 3 Choa Chu Kang Avenue 4 >>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 >>>> Choa Chu Kang Avenue 4, Singapore >>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 >>>> Choa Chu Kang Avenue 4, Singapore >>>> >>>> Example 3: The sentence that the above regex pattern is partially >>>> working (as you can see, instead of 2 <br>, there are 4 <br>) >>>> *Original content in EML file:* >>>> >>>> http://www.concordpri.moe.edu.sg/ >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Dec 18, 2018 at 10:07 AM >>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n >>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >>>> 2018 at 10:07 AM >>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM >>>> >>>> >>>> Appreciate any other ideas or suggestions that you may have. >>>> >>>> Thank you. >>>> >>>> Regards, >>>> Edwin >>>> >>>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> wrote: >>>> >>>>> Hi Edwin >>>>> >>>>> >>>>> >>>>> 1. Sorry, the pattern was wrong, the space should preceed the \n >>>>> i.e. <str name="pattern">(\s*\n){2,}</str> >>>>> 2. Perhaps in the data you have other (non printing) characters >>>>> than \n? >>>>> >>>>> >>>>> >>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >>>>> Windows 10 >>>>> >>>>> >>>>> >>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 >>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >>>>> >>>>> >>>>> >>>>> Hi Paul, >>>>> >>>>> We have tried this suggested regex pattern as follow: >>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>>>> <str name="fieldName">content</str> >>>>> <str name="pattern">(\n\s*){2,}</str> >>>>> <str name="replacement"><br><br></str> >>>>> </processor> >>>>> >>>>> But we still have exactly the same problem of Example 1,2 and 3 below. >>>>> >>>>> Example 1: The sentence that the above regex pattern is working >>>>> correctly >>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>>>> *Index content: * Dear Sir, <br><br>I am terminating >>>>> >>>>> Example 2: The sentence that the above regex pattern is partially >>>>> working >>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 >>>>> Choa >>>>> Chu Kang Avenue 4, Singapore >>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 >>>>> Choa >>>>> Chu Kang Avenue 4, Singapore >>>>> >>>>> Example 3: The sentence that the above regex pattern is partially >>>>> working >>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n >>>>> \n \n\n >>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >>>>> 2018 >>>>> at 10:07 AM >>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>>>> <br><br>On >>>>> Tue, Dec 18, 2018 at 10:07 AM >>>>> >>>>> Any further suggestion? >>>>> >>>>> Thank you. >>>>> >>>>> Regards, >>>>> Edwin >>>>> >>>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> wrote: >>>>> >>>>> > To avoid the «\n+\s*» matching too many \n and then failing on the >>>>> {2,} >>>>> > part you could try >>>>> > >>>>> > >>>>> > >>>>> > <str name="pattern">(\n\s*){2,}</str> >>>>> > >>>>> > >>>>> > >>>>> > If you also want to match CRLF then >>>>> > >>>>> > <str name="pattern">(\r?\n\s*){2,}</str> >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> >>>>> für >>>>> > Windows 10 >>>>> > >>>>> > >>>>> > >>>>> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>>>> > Gesendet: Donnerstag, 7. Februar 2019 15:10 >>>>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >>>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple >>>>> \n >>>>> > >>>>> > >>>>> > >>>>> > Hi Paul, >>>>> > >>>>> > Thanks for your reply. >>>>> > >>>>> > When I use this pattern: >>>>> > <processor class="solr.RegexReplaceProcessorFactory"> >>>>> > <str name="fieldName">content</str> >>>>> > <str name="pattern">(\n+\s*){2,}</str> >>>>> > <str name="replacement"><br><br></str> >>>>> > </processor> >>>>> > >>>>> > It is working for some sentence within the same content and not >>>>> working for >>>>> > some sentences. Please see below for the one that is working and >>>>> another >>>>> > that is not working (partially working): >>>>> > >>>>> > Example 1: The sentence that the above regex pattern is working >>>>> correctly >>>>> > *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>>>> > *Index content: * Dear Sir, <br><br>I am terminating >>>>> > >>>>> > Example 2: The sentence that the above regex pattern is partially >>>>> working >>>>> > (as you can see, instead of 2 <br>, there are 4 <br>) >>>>> > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 >>>>> Choa >>>>> > Chu Kang Avenue 4, Singapore >>>>> > *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 >>>>> Choa >>>>> > Chu Kang Avenue 4, Singapore >>>>> > >>>>> > Example 3: The sentence that the above regex pattern is partially >>>>> working >>>>> > (as you can see, instead of 2 <br>, there are 4 <br>) >>>>> > *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n >>>>> \n >>>>> > \n\n >>>>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec >>>>> 18, 2018 >>>>> > at 10:07 AM >>>>> > *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>>>> <br><br>On >>>>> > Tue, Dec 18, 2018 at 10:07 AM >>>>> > >>>>> > We would appreciate your help to see what is wrong? >>>>> > >>>>> > Thank you. >>>>> > >>>>> > Regards, >>>>> > Edwin >>>>> > >>>>> > On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> wrote: >>>>> > >>>>> > > You don’t say what happens, just that it is not working. I assume >>>>> nothing >>>>> > > is replaced? Perhaps the pattern should be >>>>> > > >>>>> > > >>>>> > > >>>>> > > <str name="pattern">"(\n\s*){2,}"</str> >>>>> > > >>>>> > > >>>>> > > >>>>> > > ?? >>>>> > > >>>>> > > >>>>> > > >>>>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> >>>>> für >>>>> > > Windows 10 >>>>> > > >>>>> > > >>>>> > > >>>>> > > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>>>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08 >>>>> > > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org >>>>> > >>>>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n >>>>> > > >>>>> > > >>>>> > > >>>>> > > Hi, >>>>> > > >>>>> > > I am trying to use the RegexReplaceProcessorFactory to remove more >>>>> than >>>>> > two >>>>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n >>>>> \n >>>>> > \n), >>>>> > > and replace it with two <br>. >>>>> > > >>>>> > > I use the following regex pattern and it is working when I test it >>>>> in >>>>> > > regex101.com. But it is not working when I put it inside the >>>>> > > RegexReplaceProcessorFactory as below: >>>>> > > >>>>> > > <updateRequestProcessorChain name="removeCode"> >>>>> > > <processor class="solr.RegexReplaceProcessorFactory"> >>>>> > > <str name="fieldName">content</str> >>>>> > > <str name="pattern">"(\\n\s*){2,}"</str> >>>>> > > <str name="replacement"><br><br></str> >>>>> > > </processor> >>>>> > > </updateRequestProcessorChain> >>>>> > > >>>>> > > To explain further about my regex pattern, \s* is instructing the >>>>> regex >>>>> > to >>>>> > > match any \n that have space after and {2,} is instructing the >>>>> regex to >>>>> > > match 2 or more occurrence of such pattern (\n). >>>>> > > >>>>> > > Please kindly let me know what is wrong and how should I do it? >>>>> > > >>>>> > > I am using Solr 7.6.0. >>>>> > > >>>>> > > Regards, >>>>> > > Edwin >>>>> > > >>>>> > >>>>> >>>>