Hi Paul, Regarding the regex (\n\s*){2,} that we are using, when we try in on https://regex101.com/, it is able to give us the correct result for all the examples (ie: All of them will only have <br><br>, and not more than that like what we are getting in Solr in our earlier examples).
Could there be a possibility of a bug in Solr? Regards, Edwin On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi Paul, > > We have tried it with the space preceeding the \n i.e. <str > name="pattern">(\s*\n){2,}</str>, with the following regex pattern: > > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(\s*\n){2,}</str> > <str name="replacement"><br><br></str> > </processor> > > However, we are also getting the exact same results as the earlier Example > 1, 2 and 3. > > As for your point 2 on perhaps in the data you have other (non printing) > characters than \n, we have find that there are no non printing characters. > It is just next line with a space. You can refer to the original content in > the same examples below. > > > Example 1: The sentence that the above regex pattern is working correctly > *Original content in EML file:* > Dear Sir, > > > I am terminating > *Original content:* Dear Sir, \n\n \n \n\n I am terminating > *Index content: * Dear Sir, <br><br>I am terminating > > Example 2: The sentence that the above regex pattern is partially working > (as you can see, instead of 2 <br>, there are 4 <br>) > *Original content in EML file:* > > *exalted* > > *Psalm 89:17* > > > 3 Choa Chu Kang Avenue 4 > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa > Chu Kang Avenue 4, Singapore > *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa > Chu Kang Avenue 4, Singapore > > Example 3: The sentence that the above regex pattern is partially working > (as you can see, instead of 2 <br>, there are 4 <br>) > *Original content in EML file:* > > http://www.concordpri.moe.edu.sg/ > > > > > > > > > On Tue, Dec 18, 2018 at 10:07 AM > *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, > 2018 at 10:07 AM > *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> <br><br>On > Tue, Dec 18, 2018 at 10:07 AM > > > Appreciate any other ideas or suggestions that you may have. > > Thank you. > > Regards, > Edwin > > On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> wrote: > >> Hi Edwin >> >> >> >> 1. Sorry, the pattern was wrong, the space should preceed the \n i.e. >> <str name="pattern">(\s*\n){2,}</str> >> 2. Perhaps in the data you have other (non printing) characters than >> \n? >> >> >> >> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >> Windows 10 >> >> >> >> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> Gesendet: Donnerstag, 7. Februar 2019 15:23 >> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >> >> >> >> Hi Paul, >> >> We have tried this suggested regex pattern as follow: >> <processor class="solr.RegexReplaceProcessorFactory"> >> <str name="fieldName">content</str> >> <str name="pattern">(\n\s*){2,}</str> >> <str name="replacement"><br><br></str> >> </processor> >> >> But we still have exactly the same problem of Example 1,2 and 3 below. >> >> Example 1: The sentence that the above regex pattern is working correctly >> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >> *Index content: * Dear Sir, <br><br>I am terminating >> >> Example 2: The sentence that the above regex pattern is partially working >> (as you can see, instead of 2 <br>, there are 4 <br>) >> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa >> Chu Kang Avenue 4, Singapore >> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa >> Chu Kang Avenue 4, Singapore >> >> Example 3: The sentence that the above regex pattern is partially working >> (as you can see, instead of 2 <br>, there are 4 <br>) >> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n >> \n\n >> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >> 2018 >> at 10:07 AM >> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> <br><br>On >> Tue, Dec 18, 2018 at 10:07 AM >> >> Any further suggestion? >> >> Thank you. >> >> Regards, >> Edwin >> >> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> wrote: >> >> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,} >> > part you could try >> > >> > >> > >> > <str name="pattern">(\n\s*){2,}</str> >> > >> > >> > >> > If you also want to match CRLF then >> > >> > <str name="pattern">(\r?\n\s*){2,}</str> >> > >> > >> > >> > >> > >> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >> > Windows 10 >> > >> > >> > >> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> > Gesendet: Donnerstag, 7. Februar 2019 15:10 >> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >> > >> > >> > >> > Hi Paul, >> > >> > Thanks for your reply. >> > >> > When I use this pattern: >> > <processor class="solr.RegexReplaceProcessorFactory"> >> > <str name="fieldName">content</str> >> > <str name="pattern">(\n+\s*){2,}</str> >> > <str name="replacement"><br><br></str> >> > </processor> >> > >> > It is working for some sentence within the same content and not working >> for >> > some sentences. Please see below for the one that is working and another >> > that is not working (partially working): >> > >> > Example 1: The sentence that the above regex pattern is working >> correctly >> > *Original content:* Dear Sir, \n\n \n \n\n I am terminating >> > *Index content: * Dear Sir, <br><br>I am terminating >> > >> > Example 2: The sentence that the above regex pattern is partially >> working >> > (as you can see, instead of 2 <br>, there are 4 <br>) >> > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa >> > Chu Kang Avenue 4, Singapore >> > *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa >> > Chu Kang Avenue 4, Singapore >> > >> > Example 3: The sentence that the above regex pattern is partially >> working >> > (as you can see, instead of 2 <br>, there are 4 <br>) >> > *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n >> > \n\n >> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >> 2018 >> > at 10:07 AM >> > *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >> <br><br>On >> > Tue, Dec 18, 2018 at 10:07 AM >> > >> > We would appreciate your help to see what is wrong? >> > >> > Thank you. >> > >> > Regards, >> > Edwin >> > >> > On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> wrote: >> > >> > > You don’t say what happens, just that it is not working. I assume >> nothing >> > > is replaced? Perhaps the pattern should be >> > > >> > > >> > > >> > > <str name="pattern">"(\n\s*){2,}"</str> >> > > >> > > >> > > >> > > ?? >> > > >> > > >> > > >> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >> > > Windows 10 >> > > >> > > >> > > >> > > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> > > Gesendet: Donnerstag, 7. Februar 2019 14:08 >> > > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n >> > > >> > > >> > > >> > > Hi, >> > > >> > > I am trying to use the RegexReplaceProcessorFactory to remove more >> than >> > two >> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n \n >> > \n), >> > > and replace it with two <br>. >> > > >> > > I use the following regex pattern and it is working when I test it in >> > > regex101.com. But it is not working when I put it inside the >> > > RegexReplaceProcessorFactory as below: >> > > >> > > <updateRequestProcessorChain name="removeCode"> >> > > <processor class="solr.RegexReplaceProcessorFactory"> >> > > <str name="fieldName">content</str> >> > > <str name="pattern">"(\\n\s*){2,}"</str> >> > > <str name="replacement"><br><br></str> >> > > </processor> >> > > </updateRequestProcessorChain> >> > > >> > > To explain further about my regex pattern, \s* is instructing the >> regex >> > to >> > > match any \n that have space after and {2,} is instructing the regex >> to >> > > match 2 or more occurrence of such pattern (\n). >> > > >> > > Please kindly let me know what is wrong and how should I do it? >> > > >> > > I am using Solr 7.6.0. >> > > >> > > Regards, >> > > Edwin >> > > >> > >> >