Hi, Should we report this as a bug in Solr?
Regards, Edwin On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <[email protected]> wrote: > Hi Paul, > > Regarding the regex (\n\s*){2,} that we are using, when we try in on > https://regex101.com/, it is able to give us the correct result for all > the examples (ie: All of them will only have <br><br>, and not more than > that like what we are getting in Solr in our earlier examples). > > Could there be a possibility of a bug in Solr? > > Regards, > Edwin > > On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <[email protected]> > wrote: > >> Hi Paul, >> >> We have tried it with the space preceeding the \n i.e. <str >> name="pattern">(\s*\n){2,}</str>, with the following regex pattern: >> >> <processor class="solr.RegexReplaceProcessorFactory"> >> <str name="fieldName">content</str> >> <str name="pattern">(\s*\n){2,}</str> >> <str name="replacement"><br><br></str> >> </processor> >> >> However, we are also getting the exact same results as the earlier >> Example 1, 2 and 3. >> >> As for your point 2 on perhaps in the data you have other (non printing) >> characters than \n, we have find that there are no non printing characters. >> It is just next line with a space. You can refer to the original content in >> the same examples below. >> >> >> Example 1: The sentence that the above regex pattern is working correctly >> *Original content in EML file:* >> Dear Sir, >> >> >> I am terminating >> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >> *Index content: * Dear Sir, <br><br>I am terminating >> >> Example 2: The sentence that the above regex pattern is partially working >> (as you can see, instead of 2 <br>, there are 4 <br>) >> *Original content in EML file:* >> >> *exalted* >> >> *Psalm 89:17* >> >> >> 3 Choa Chu Kang Avenue 4 >> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa >> Chu Kang Avenue 4, Singapore >> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa >> Chu Kang Avenue 4, Singapore >> >> Example 3: The sentence that the above regex pattern is partially working >> (as you can see, instead of 2 <br>, there are 4 <br>) >> *Original content in EML file:* >> >> http://www.concordpri.moe.edu.sg/ >> >> >> >> >> >> >> >> >> On Tue, Dec 18, 2018 at 10:07 AM >> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n >> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >> 2018 at 10:07 AM >> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >> <br><br>On Tue, Dec 18, 2018 at 10:07 AM >> >> >> Appreciate any other ideas or suggestions that you may have. >> >> Thank you. >> >> Regards, >> Edwin >> >> On Thu, 7 Feb 2019 at 22:49, <[email protected]> wrote: >> >>> Hi Edwin >>> >>> >>> >>> 1. Sorry, the pattern was wrong, the space should preceed the \n i.e. >>> <str name="pattern">(\s*\n){2,}</str> >>> 2. Perhaps in the data you have other (non printing) characters than >>> \n? >>> >>> >>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >>> Windows 10 >>> >>> >>> >>> Von: Zheng Lin Edwin Yeo<mailto:[email protected]> >>> Gesendet: Donnerstag, 7. Februar 2019 15:23 >>> An: [email protected]<mailto:[email protected]> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >>> >>> >>> >>> Hi Paul, >>> >>> We have tried this suggested regex pattern as follow: >>> <processor class="solr.RegexReplaceProcessorFactory"> >>> <str name="fieldName">content</str> >>> <str name="pattern">(\n\s*){2,}</str> >>> <str name="replacement"><br><br></str> >>> </processor> >>> >>> But we still have exactly the same problem of Example 1,2 and 3 below. >>> >>> Example 1: The sentence that the above regex pattern is working correctly >>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>> *Index content: * Dear Sir, <br><br>I am terminating >>> >>> Example 2: The sentence that the above regex pattern is partially working >>> (as you can see, instead of 2 <br>, there are 4 <br>) >>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa >>> Chu Kang Avenue 4, Singapore >>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa >>> Chu Kang Avenue 4, Singapore >>> >>> Example 3: The sentence that the above regex pattern is partially working >>> (as you can see, instead of 2 <br>, there are 4 <br>) >>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n >>> \n\n >>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >>> 2018 >>> at 10:07 AM >>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>> <br><br>On >>> Tue, Dec 18, 2018 at 10:07 AM >>> >>> Any further suggestion? >>> >>> Thank you. >>> >>> Regards, >>> Edwin >>> >>> On Thu, 7 Feb 2019 at 22:20, <[email protected]> wrote: >>> >>> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,} >>> > part you could try >>> > >>> > >>> > >>> > <str name="pattern">(\n\s*){2,}</str> >>> > >>> > >>> > >>> > If you also want to match CRLF then >>> > >>> > <str name="pattern">(\r?\n\s*){2,}</str> >>> > >>> > >>> > >>> > >>> > >>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >>> > Windows 10 >>> > >>> > >>> > >>> > Von: Zheng Lin Edwin Yeo<mailto:[email protected]> >>> > Gesendet: Donnerstag, 7. Februar 2019 15:10 >>> > An: [email protected]<mailto:[email protected]> >>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >>> > >>> > >>> > >>> > Hi Paul, >>> > >>> > Thanks for your reply. >>> > >>> > When I use this pattern: >>> > <processor class="solr.RegexReplaceProcessorFactory"> >>> > <str name="fieldName">content</str> >>> > <str name="pattern">(\n+\s*){2,}</str> >>> > <str name="replacement"><br><br></str> >>> > </processor> >>> > >>> > It is working for some sentence within the same content and not >>> working for >>> > some sentences. Please see below for the one that is working and >>> another >>> > that is not working (partially working): >>> > >>> > Example 1: The sentence that the above regex pattern is working >>> correctly >>> > *Original content:* Dear Sir, \n\n \n \n\n I am terminating >>> > *Index content: * Dear Sir, <br><br>I am terminating >>> > >>> > Example 2: The sentence that the above regex pattern is partially >>> working >>> > (as you can see, instead of 2 <br>, there are 4 <br>) >>> > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 >>> Choa >>> > Chu Kang Avenue 4, Singapore >>> > *Index content: *exalted <br><br>Psalm 89:17 <br><br> <br><br>3 >>> Choa >>> > Chu Kang Avenue 4, Singapore >>> > >>> > Example 3: The sentence that the above regex pattern is partially >>> working >>> > (as you can see, instead of 2 <br>, there are 4 <br>) >>> > *Original content:* http://www.concordpri.moe.edu.sg/ \n\n \n\n \n >>> > \n\n >>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >>> 2018 >>> > at 10:07 AM >>> > *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >>> <br><br>On >>> > Tue, Dec 18, 2018 at 10:07 AM >>> > >>> > We would appreciate your help to see what is wrong? >>> > >>> > Thank you. >>> > >>> > Regards, >>> > Edwin >>> > >>> > On Thu, 7 Feb 2019 at 21:24, <[email protected]> wrote: >>> > >>> > > You don’t say what happens, just that it is not working. I assume >>> nothing >>> > > is replaced? Perhaps the pattern should be >>> > > >>> > > >>> > > >>> > > <str name="pattern">"(\n\s*){2,}"</str> >>> > > >>> > > >>> > > >>> > > ?? >>> > > >>> > > >>> > > >>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> >>> für >>> > > Windows 10 >>> > > >>> > > >>> > > >>> > > Von: Zheng Lin Edwin Yeo<mailto:[email protected]> >>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08 >>> > > An: [email protected]<mailto:[email protected]> >>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n >>> > > >>> > > >>> > > >>> > > Hi, >>> > > >>> > > I am trying to use the RegexReplaceProcessorFactory to remove more >>> than >>> > two >>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n \n >>> > \n), >>> > > and replace it with two <br>. >>> > > >>> > > I use the following regex pattern and it is working when I test it in >>> > > regex101.com. But it is not working when I put it inside the >>> > > RegexReplaceProcessorFactory as below: >>> > > >>> > > <updateRequestProcessorChain name="removeCode"> >>> > > <processor class="solr.RegexReplaceProcessorFactory"> >>> > > <str name="fieldName">content</str> >>> > > <str name="pattern">"(\\n\s*){2,}"</str> >>> > > <str name="replacement"><br><br></str> >>> > > </processor> >>> > > </updateRequestProcessorChain> >>> > > >>> > > To explain further about my regex pattern, \s* is instructing the >>> regex >>> > to >>> > > match any \n that have space after and {2,} is instructing the regex >>> to >>> > > match 2 or more occurrence of such pattern (\n). >>> > > >>> > > Please kindly let me know what is wrong and how should I do it? >>> > > >>> > > I am using Solr 7.6.0. >>> > > >>> > > Regards, >>> > > Edwin >>> > > >>> > >>> >>
