Hi, Anyone else has other suggestions or have faced the same problem?
Regards, Edwin On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi Paul, > > If I tried to execute the second step first, then I will only get a single > <br> for those with 2 <br>. > For those that we originally get 4 <br>, there will be 2 <br> with a space > in between. > > This is just changing the 2 <br> to be a single <br>, since the second > step is to replace with a single <br>. > But it has not solved the underlying problem yet. > > Regards, > Edwin > > > On Wed, 20 Feb 2019 at 16:41, <paul.d...@ub.unibe.ch> wrote: > >> If the second step is executed first, then you will get the unwanted 4 >> <br> >> >> >> >> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >> Windows 10 >> >> >> >> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> Gesendet: Mittwoch, 20. Februar 2019 09:29 >> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >> >> >> >> Hi Jörn , >> >> Do you mean the regex is not correct? >> >> We are already using two RegexReplaceProcessorFactory steps, like the one >> shown below. The output that we get is still the same. >> >> <processor class="solr.RegexReplaceProcessorFactory"> >> <str name="fieldName">content</str> >> <str name="pattern">([ \t]*\r?\n){2,}</str> >> <str name="replacement"><br><br></str> >> <bool name="literalReplacement">true</bool> >> <processor> >> >> <processor class="solr.RegexReplaceProcessorFactory"> >> <str name="fieldName">content</str> >> <str name="pattern">([ \t]*\r?\n){1,}</str> >> <str name="replacement"><br></str> >> <bool name="literalReplacement">true</bool> >> <processor> >> >> Regards, >> Edwin >> >> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfra...@gmail.com> wrote: >> >> > Then you need two regexprocessfactory steps >> > >> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo < >> edwinye...@gmail.com >> > >: >> > > >> > > Hi, >> > > >> > > Thanks for the reply. >> > > >> > > Do you know of any regex online tool that works correctly for Java >> regex? >> > > I tried to find some, but they are not working properly. >> > > >> > > Yes, our plan is to replace more than one \n with <br><br>, and >> single \n >> > > with single <br>. >> > > >> > > Regards, >> > > Edwin >> > > >> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfra...@gmail.com> >> wrote: >> > >> >> > >> Solr uses Java regex matching, so i doubt there is a bug - it would >> then >> > >> be in the JDK. Try out in a regex online Tool that supports Java >> regex >> > for >> > >> your solution. >> > >> >> > >> I believe you want to have 2 regex process factories: >> > >> One that deals with single \n and one that deals with more than one >> \n >> > >> >> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo < >> > edwinye...@gmail.com >> > >>> : >> > >>> >> > >>> Hi, >> > >>> >> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and >> > >>> configuration: >> > >>> >> > >>> <processor class="solr.RegexReplaceProcessorFactory"> >> > >>> <str name="fieldName">content</str> >> > >>> <str name="pattern">([ \t]*\r?\n){2,}</str> >> > >>> <str name="replacement"><br><br></str> >> > >>> <bool name="literalReplacement">true</bool> >> > >>> </processor> >> > >>> >> > >>> However, the issue is still occurring. >> > >>> >> > >>> Anyone else is able to help? >> > >>> >> > >>> Regards, >> > >>> Edwin >> > >>> >> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo < >> > edwinye...@gmail.com> >> > >>> wrote: >> > >>> >> > >>>> Hi, >> > >>>> >> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well. >> > >>>> >> > >>>> Regards, >> > >>>> Edwin >> > >>>> >> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo < >> > edwinye...@gmail.com >> > >>> >> > >>>> wrote: >> > >>>> >> > >>>>> Hi, >> > >>>>> >> > >>>>> Should we report this as a bug in Solr? >> > >>>>> >> > >>>>> Regards, >> > >>>>> Edwin >> > >>>>> >> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo < >> > edwinye...@gmail.com >> > >>> >> > >>>>> wrote: >> > >>>>> >> > >>>>>> Hi Paul, >> > >>>>>> >> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try >> in on >> > >>>>>> https://regex101.com/, it is able to give us the correct result >> for >> > >> all >> > >>>>>> the examples (ie: All of them will only have <br><br>, and not >> more >> > >> than >> > >>>>>> that like what we are getting in Solr in our earlier examples). >> > >>>>>> >> > >>>>>> Could there be a possibility of a bug in Solr? >> > >>>>>> >> > >>>>>> Regards, >> > >>>>>> Edwin >> > >>>>>> >> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo < >> > >> edwinye...@gmail.com> >> > >>>>>> wrote: >> > >>>>>> >> > >>>>>>> Hi Paul, >> > >>>>>>> >> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str >> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex >> pattern: >> > >>>>>>> >> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >> > >>>>>>> <str name="fieldName">content</str> >> > >>>>>>> <str name="pattern">(\s*\n){2,}</str> >> > >>>>>>> <str name="replacement"><br><br></str> >> > >>>>>>> </processor> >> > >>>>>>> >> > >>>>>>> However, we are also getting the exact same results as the >> earlier >> > >>>>>>> Example 1, 2 and 3. >> > >>>>>>> >> > >>>>>>> As for your point 2 on perhaps in the data you have other (non >> > >>>>>>> printing) characters than \n, we have find that there are no non >> > >> printing >> > >>>>>>> characters. It is just next line with a space. You can refer to >> the >> > >>>>>>> original content in the same examples below. >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> Example 1: The sentence that the above regex pattern is working >> > >>>>>>> correctly >> > >>>>>>> *Original content in EML file:* >> > >>>>>>> Dear Sir, >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> I am terminating >> > >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >> > >>>>>>> *Index content: * Dear Sir, <br><br>I am terminating >> > >>>>>>> >> > >>>>>>> Example 2: The sentence that the above regex pattern is >> partially >> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) >> > >>>>>>> *Original content in EML file:* >> > >>>>>>> >> > >>>>>>> *exalted* >> > >>>>>>> >> > >>>>>>> *Psalm 89:17* >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> 3 Choa Chu Kang Avenue 4 >> > >>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n >> \n\n 3 >> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >> > >>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> >> <br><br>3 >> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >> > >>>>>>> >> > >>>>>>> Example 3: The sentence that the above regex pattern is >> partially >> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) >> > >>>>>>> *Original content in EML file:* >> > >>>>>>> >> > >>>>>>> http://www.concordpri.moe.edu.sg/ >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM >> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n >> > \n\n >> > >> \n >> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On >> Tue, >> > >> Dec 18, >> > >>>>>>> 2018 at 10:07 AM >> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> Appreciate any other ideas or suggestions that you may have. >> > >>>>>>> >> > >>>>>>> Thank you. >> > >>>>>>> >> > >>>>>>> Regards, >> > >>>>>>> Edwin >> > >>>>>>> >> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> wrote: >> > >>>>>>>> >> > >>>>>>>> Hi Edwin >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> 1. Sorry, the pattern was wrong, the space should preceed the >> \n >> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str> >> > >>>>>>>> 2. Perhaps in the data you have other (non printing) >> characters >> > >>>>>>>> than \n? >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> Gesendet von Mail< >> https://go.microsoft.com/fwlink/?LinkId=550986> >> > >> für >> > >>>>>>>> Windows 10 >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 >> > >>>>>>>> An: solr-user@lucene.apache.org<mailto: >> > solr-user@lucene.apache.org> >> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect >> > >> multiple \n >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> Hi Paul, >> > >>>>>>>> >> > >>>>>>>> We have tried this suggested regex pattern as follow: >> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >> > >>>>>>>> <str name="fieldName">content</str> >> > >>>>>>>> <str name="pattern">(\n\s*){2,}</str> >> > >>>>>>>> <str name="replacement"><br><br></str> >> > >>>>>>>> </processor> >> > >>>>>>>> >> > >>>>>>>> But we still have exactly the same problem of Example 1,2 and 3 >> > >> below. >> > >>>>>>>> >> > >>>>>>>> Example 1: The sentence that the above regex pattern is working >> > >>>>>>>> correctly >> > >>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating >> > >>>>>>>> *Index content: * Dear Sir, <br><br>I am terminating >> > >>>>>>>> >> > >>>>>>>> Example 2: The sentence that the above regex pattern is >> partially >> > >>>>>>>> working >> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >> > >>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n >> \n\n >> > 3 >> > >>>>>>>> Choa >> > >>>>>>>> Chu Kang Avenue 4, Singapore >> > >>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> >> > <br><br>3 >> > >>>>>>>> Choa >> > >>>>>>>> Chu Kang Avenue 4, Singapore >> > >>>>>>>> >> > >>>>>>>> Example 3: The sentence that the above regex pattern is >> partially >> > >>>>>>>> working >> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n >> > \n\n >> > >>>>>>>> \n \n\n >> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, >> Dec >> > >> 18, >> > >>>>>>>> 2018 >> > >>>>>>>> at 10:07 AM >> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >> > >>>>>>>> <br><br>On >> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >> > >>>>>>>> >> > >>>>>>>> Any further suggestion? >> > >>>>>>>> >> > >>>>>>>> Thank you. >> > >>>>>>>> >> > >>>>>>>> Regards, >> > >>>>>>>> Edwin >> > >>>>>>>> >> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> wrote: >> > >>>>>>>>> >> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on >> > the >> > >>>>>>>> {2,} >> > >>>>>>>>> part you could try >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> If you also want to match CRLF then >> > >>>>>>>>> >> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> Gesendet von Mail< >> https://go.microsoft.com/fwlink/?LinkId=550986 >> > > >> > >>>>>>>> für >> > >>>>>>>>> Windows 10 >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 >> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto: >> > solr-user@lucene.apache.org >> > >>> >> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect >> > >> multiple >> > >>>>>>>> \n >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> Hi Paul, >> > >>>>>>>>> >> > >>>>>>>>> Thanks for your reply. >> > >>>>>>>>> >> > >>>>>>>>> When I use this pattern: >> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >> > >>>>>>>>> <str name="fieldName">content</str> >> > >>>>>>>>> <str name="pattern">(\n+\s*){2,}</str> >> > >>>>>>>>> <str name="replacement"><br><br></str> >> > >>>>>>>>> </processor> >> > >>>>>>>>> >> > >>>>>>>>> It is working for some sentence within the same content and >> not >> > >>>>>>>> working for >> > >>>>>>>>> some sentences. Please see below for the one that is working >> and >> > >>>>>>>> another >> > >>>>>>>>> that is not working (partially working): >> > >>>>>>>>> >> > >>>>>>>>> Example 1: The sentence that the above regex pattern is >> working >> > >>>>>>>> correctly >> > >>>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >> terminating >> > >>>>>>>>> *Index content: * Dear Sir, <br><br>I am terminating >> > >>>>>>>>> >> > >>>>>>>>> Example 2: The sentence that the above regex pattern is >> partially >> > >>>>>>>> working >> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >> > >>>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n >> > \n\n 3 >> > >>>>>>>> Choa >> > >>>>>>>>> Chu Kang Avenue 4, Singapore >> > >>>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> >> > <br><br>3 >> > >>>>>>>> Choa >> > >>>>>>>>> Chu Kang Avenue 4, Singapore >> > >>>>>>>>> >> > >>>>>>>>> Example 3: The sentence that the above regex pattern is >> partially >> > >>>>>>>> working >> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n >> > >> \n\n >> > >>>>>>>> \n >> > >>>>>>>>> \n\n >> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, >> > Dec >> > >>>>>>>> 18, 2018 >> > >>>>>>>>> at 10:07 AM >> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> >> > >>>>>>>> <br><br>On >> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >> > >>>>>>>>> >> > >>>>>>>>> We would appreciate your help to see what is wrong? >> > >>>>>>>>> >> > >>>>>>>>> Thank you. >> > >>>>>>>>> >> > >>>>>>>>> Regards, >> > >>>>>>>>> Edwin >> > >>>>>>>>> >> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> wrote: >> > >>>>>>>>>> >> > >>>>>>>>>> You don’t say what happens, just that it is not working. I >> > assume >> > >>>>>>>> nothing >> > >>>>>>>>>> is replaced? Perhaps the pattern should be >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> <str name="pattern">"(\n\s*){2,}"</str> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> ?? >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> Gesendet von Mail< >> > https://go.microsoft.com/fwlink/?LinkId=550986> >> > >>>>>>>> für >> > >>>>>>>>>> Windows 10 >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 >> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto: >> > >> solr-user@lucene.apache.org >> > >>>>>>>>> >> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect >> multiple >> > >> \n >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> Hi, >> > >>>>>>>>>> >> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove >> > more >> > >>>>>>>> than >> > >>>>>>>>> two >> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, >> \n >> > \n >> > >>>>>>>> \n >> > >>>>>>>>> \n), >> > >>>>>>>>>> and replace it with two <br>. >> > >>>>>>>>>> >> > >>>>>>>>>> I use the following regex pattern and it is working when I >> test >> > it >> > >>>>>>>> in >> > >>>>>>>>>> regex101.com. But it is not working when I put it inside the >> > >>>>>>>>>> RegexReplaceProcessorFactory as below: >> > >>>>>>>>>> >> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode"> >> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >> > >>>>>>>>>> <str name="fieldName">content</str> >> > >>>>>>>>>> <str name="pattern">"(\\n\s*){2,}"</str> >> > >>>>>>>>>> <str name="replacement"><br><br></str> >> > >>>>>>>>>> </processor> >> > >>>>>>>>>> </updateRequestProcessorChain> >> > >>>>>>>>>> >> > >>>>>>>>>> To explain further about my regex pattern, \s* is instructing >> > the >> > >>>>>>>> regex >> > >>>>>>>>> to >> > >>>>>>>>>> match any \n that have space after and {2,} is instructing >> the >> > >>>>>>>> regex to >> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n). >> > >>>>>>>>>> >> > >>>>>>>>>> Please kindly let me know what is wrong and how should I do >> it? >> > >>>>>>>>>> >> > >>>>>>>>>> I am using Solr 7.6.0. >> > >>>>>>>>>> >> > >>>>>>>>>> Regards, >> > >>>>>>>>>> Edwin >> > >>>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>> >> > >>>>>>> >> > >> >> > >> >