Hi Paul, We have tried it with the space preceeding the \n i.e. <str name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
<processor class="solr.RegexReplaceProcessorFactory"> <str name="fieldName">content</str> <str name="pattern">(\s*\n){2,}</str> <str name="replacement"><br><br></str> </processor> However, we are also getting the exact same results as the earlier Example 1, 2 and 3. As for your point 2 on perhaps in the data you have other (non printing) characters than \n, we have find that there are no non printing characters. It is just next line with a space. You can refer to the original content in the same examples below. Example 1: The sentence that the above regex pattern is working correctly *Original content in EML file:* Dear Sir, I am terminating *Original content:* Dear Sir, \n\n \n \n\n I am terminating *Index content: * Dear Sir, <br><br>I am terminating Example 2: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>) *Original content in EML file:* *exalted* *Psalm 89:17* 3 Choa Chu Kang Avenue 4 *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa Chu Kang Avenue 4, Singapore *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa Chu Kang Avenue 4, Singapore Example 3: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>) *Original content in EML file:* http://www.concordpri.moe.edu.sg/ On Tue, Dec 18, 2018 at 10:07 AM *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, 2018 at 10:07 AM *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> <br><br>On Tue, Dec 18, 2018 at 10:07 AM Appreciate any other ideas or suggestions that you may have. Thank you. Regards, Edwin On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> wrote: > Hi Edwin > > > > 1. Sorry, the pattern was wrong, the space should preceed the \n i.e. > <str name="pattern">(\s*\n){2,}</str> > 2. Perhaps in the data you have other (non printing) characters than \n? > > > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für > Windows 10 > > > > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > Gesendet: Donnerstag, 7. Februar 2019 15:23 > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n > > > > Hi Paul, > > We have tried this suggested regex pattern as follow: > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(\n\s*){2,}</str> > <str name="replacement"><br><br></str> > </processor> > > But we still have exactly the same problem of Example 1,2 and 3 below. > > Example 1: The sentence that the above regex pattern is working correctly > *Original content:* Dear Sir, \n\n \n \n\n I am terminating > *Index content: * Dear Sir, <br><br>I am terminating > > Example 2: The sentence that the above regex pattern is partially working > (as you can see, instead of 2 <br>, there are 4 <br>) > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa > Chu Kang Avenue 4, Singapore > *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa > Chu Kang Avenue 4, Singapore > > Example 3: The sentence that the above regex pattern is partially working > (as you can see, instead of 2 <br>, there are 4 <br>) > *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n > \n\n > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, 2018 > at 10:07 AM > *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> <br><br>On > Tue, Dec 18, 2018 at 10:07 AM > > Any further suggestion? > > Thank you. > > Regards, > Edwin > > On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> wrote: > > > To avoid the «\n+\s*» matching too many \n and then failing on the {2,} > > part you could try > > > > > > > > <str name="pattern">(\n\s*){2,}</str> > > > > > > > > If you also want to match CRLF then > > > > <str name="pattern">(\r?\n\s*){2,}</str> > > > > > > > > > > > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für > > Windows 10 > > > > > > > > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > > Gesendet: Donnerstag, 7. Februar 2019 15:10 > > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n > > > > > > > > Hi Paul, > > > > Thanks for your reply. > > > > When I use this pattern: > > <processor class="solr.RegexReplaceProcessorFactory"> > > <str name="fieldName">content</str> > > <str name="pattern">(\n+\s*){2,}</str> > > <str name="replacement"><br><br></str> > > </processor> > > > > It is working for some sentence within the same content and not working > for > > some sentences. Please see below for the one that is working and another > > that is not working (partially working): > > > > Example 1: The sentence that the above regex pattern is working correctly > > *Original content:* Dear Sir, \n\n \n \n\n I am terminating > > *Index content: * Dear Sir, <br><br>I am terminating > > > > Example 2: The sentence that the above regex pattern is partially working > > (as you can see, instead of 2 <br>, there are 4 <br>) > > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa > > Chu Kang Avenue 4, Singapore > > *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa > > Chu Kang Avenue 4, Singapore > > > > Example 3: The sentence that the above regex pattern is partially working > > (as you can see, instead of 2 <br>, there are 4 <br>) > > *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n > > \n\n > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, > 2018 > > at 10:07 AM > > *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> > <br><br>On > > Tue, Dec 18, 2018 at 10:07 AM > > > > We would appreciate your help to see what is wrong? > > > > Thank you. > > > > Regards, > > Edwin > > > > On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> wrote: > > > > > You don’t say what happens, just that it is not working. I assume > nothing > > > is replaced? Perhaps the pattern should be > > > > > > > > > > > > <str name="pattern">"(\n\s*){2,}"</str> > > > > > > > > > > > > ?? > > > > > > > > > > > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für > > > Windows 10 > > > > > > > > > > > > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > > > Gesendet: Donnerstag, 7. Februar 2019 14:08 > > > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n > > > > > > > > > > > > Hi, > > > > > > I am trying to use the RegexReplaceProcessorFactory to remove more than > > two > > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n \n > > \n), > > > and replace it with two <br>. > > > > > > I use the following regex pattern and it is working when I test it in > > > regex101.com. But it is not working when I put it inside the > > > RegexReplaceProcessorFactory as below: > > > > > > <updateRequestProcessorChain name="removeCode"> > > > <processor class="solr.RegexReplaceProcessorFactory"> > > > <str name="fieldName">content</str> > > > <str name="pattern">"(\\n\s*){2,}"</str> > > > <str name="replacement"><br><br></str> > > > </processor> > > > </updateRequestProcessorChain> > > > > > > To explain further about my regex pattern, \s* is instructing the regex > > to > > > match any \n that have space after and {2,} is instructing the regex to > > > match 2 or more occurrence of such pattern (\n). > > > > > > Please kindly let me know what is wrong and how should I do it? > > > > > > I am using Solr 7.6.0. > > > > > > Regards, > > > Edwin > > > > > >