Hi Paul, Would like to check, if there is any difference in performance when we use the two different patterns method?
<str name="pattern">(\n\W*){2,}</str> <str name="pattern">[ \t\x0b\f]*\r?\n</str> Regards, Edwin On Thu, 14 Mar 2019 at 09:36, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi Paul, > > Thanks for your reply. > > So far we did not find cases of punctuation that are being removed. > > Our aim is to remove the list of spaces (\n) into 2 <br>, and they are not > likely to have any punctuation in between. > > Do you know if this pattern <str name="pattern">(\n\W*){2,}</str> that > we are using is ok? > Or would the other pattern like <str name="pattern">[ > \t\x0b\f]*\r?\n</str> is better? > > Regards, > Edwin > > On Wed, 13 Mar 2019 at 20:08, <paul.d...@ub.unibe.ch> wrote: > >> Hi Edwin, >> With \W you will also replace non-word characters such as punktuation. If >> that's OK fine. Otherwise you need to identify the white space characters >> that are causing the problem. >> ________________________________ >> Von: Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> Gesendet: Mittwoch, 13. März 2019 03:25:39 >> An: solr-user@lucene.apache.org >> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >> >> Hi, >> >> We have managed to resolve the issue, by changing the \s to \W. The reason >> could be due to that some of the spaces and white space instead of just a >> space. Using \s will only remove the spaces and not the white spaces, but >> using \W will remove the white spaces as well. >> >> We have used this config, and it works. >> >> <processor class="solr.RegexReplaceProcessorFactory"> >> <str name="fieldName">content</str> >> <str name="pattern">(\n\W*){2,}</str> >> <str name="replacement"><br><br></str> >> <bool name="literalReplacement">true</bool> >> </processor> >> <processor class="solr.RegexReplaceProcessorFactory"> >> <str name="fieldName">content</str> >> <str name="pattern">(\n\W*){1,}</str> >> <str name="replacement"><br></str> >> <bool name="literalReplacement">true</bool> >> </processor> >> >> Regards, >> Edwin >> >> On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> wrote: >> >> > Hi, >> > >> > Has anyone else faced the same issue before? >> > So far all the regex patterns that we tried in this thread are not able >> to >> > resolve the issue. >> > >> > Regards, >> > Edwin >> > >> > On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> > wrote: >> > >> >> Hi Paul, >> >> >> >> Sorry, I realized there is an extra ']' in the pattern provided, which >> is >> >> why there are so many <br> in the output. >> >> >> >> The output is exactly the same as previously (previous index result) if >> >> we remove the extra ']', as shown in the configuration below. >> >> >> >> <processor class="solr.RegexReplaceProcessorFactory"> >> >> <str name="fieldName">content</str> >> >> <str name="pattern">[ \t\x0b\f]*\r?\n</str> >> >> <str name="replacement"><br></str> >> >> <bool name="literalReplacement">true</bool> >> >> </processor> >> >> <processor class="solr.RegexReplaceProcessorFactory"> >> >> <str name="fieldName">content</str> >> >> <str name="pattern">(<br>[ \t\x0b\f]*){3,}</str> >> >> <str name="replacement"><br><br></str> >> >> <bool name="literalReplacement">true</bool> >> >> </processor> >> >> >> >> Regards, >> >> Edwin >> >> >> >> >> >> >> >> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo <edwinye...@gmail.com >> > >> >> wrote: >> >> >> >>> Hi Paul, >> >>> >> >>> Thanks for the reply. >> >>> >> >>> For the 2nd pattern, if we put this pattern <str >> >>> name="pattern">(<br>[ \t\x0b\f]]*){3,}</str>, which is like the >> >>> configurations below: >> >>> >> >>> <processor class="solr.RegexReplaceProcessorFactory"> >> >>> <str name="fieldName">content</str> >> >>> <str name="pattern">[ \t\x0b\f]*\r?\n</str> >> >>> <str name="replacement"><br></str> >> >>> <bool name="literalReplacement">true</bool> >> >>> </processor> >> >>> <processor class="solr.RegexReplaceProcessorFactory"> >> >>> <str name="fieldName">content</str> >> >>> <str name="pattern">(<br>[ \t\x0b\f]]*){3,}</str> >> >>> <str name="replacement"><br><br></str> >> >>> <bool name="literalReplacement">true</bool> >> >>> </processor> >> >>> >> >>> It will not be able to change all those more than 3 <br> to 2 <br>. >> >>> >> >>> We will end up with many <br> in the output, like the example below: >> >>> >> >>> http://www.concorded.com/<br><br> >> <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> >> On Tue, Dec 18, 2018 >> >>> >> >>> >> >>> Regards, >> >>> Edwin >> >>> >> >>> >> >>> >> >>> >> >>> On Thu, 7 Mar 2019 at 20:44, <paul.d...@ub.unibe.ch> wrote: >> >>> >> >>>> Hi Edwin >> >>>> >> >>>> >> >>>> >> >>>> I can’t understand why the pattern is not working and where the >> spaces >> >>>> between the <br> are coming from. It should be possible to allow for >> spaces >> >>>> between the <br> in the second match pattern however i.e. 2nd pattern >> >>>> >> >>>> >> >>>> >> >>>> <str name="pattern">(<br>[ \t\x0b\f]]*){3,}</str> >> >>>> >> >>>> >> >>>> >> >>>> /Paul >> >>>> >> >>>> >> >>>> >> >>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> >> für >> >>>> Windows 10 >> >>>> >> >>>> >> >>>> >> >>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> >>>> Gesendet: Mittwoch, 6. März 2019 16:28 >> >>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >> >>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple >> \n >> >>>> >> >>>> >> >>>> >> >>>> Hi Paul, >> >>>> >> >>>> I have tried with the first match pattern to be <str name="pattern">[ >> >>>> \t\x0b\f]*\r?\n</str>, like the configuration below: >> >>>> >> >>>> <processor class="solr.RegexReplaceProcessorFactory"> >> >>>> <str name="fieldName">content</str> >> >>>> <str name="pattern">[ \t\x0b\f]*\r?\n</str> >> >>>> <str name="replacement"><br></str> >> >>>> <bool name="literalReplacement">true</bool> >> >>>> </processor> >> >>>> <processor class="solr.RegexReplaceProcessorFactory"> >> >>>> <str name="fieldName">content</str> >> >>>> <str name="pattern">(<br>){3,}</str> >> >>>> <str name="replacement"><br><br></str> >> >>>> <bool name="literalReplacement">true</bool> >> >>>> </processor> >> >>>> >> >>>> However, the result is still the same as before (previous index >> >>>> results), >> >>>> with the 4 <br>. >> >>>> >> >>>> Regards, >> >>>> Edwin >> >>>> >> >>>> >> >>>> On Wed, 6 Mar 2019 at 18:23, <paul.d...@ub.unibe.ch> wrote: >> >>>> >> >>>> > Hi Edwin >> >>>> > >> >>>> > >> >>>> > >> >>>> > You are correct re the 2nd pattern – my bad. Looking at the 4 >> <br>, >> >>>> it’s >> >>>> > actually the sequence «<br><br> <br><br>»? So perhaps the first >> match >> >>>> > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str> >> >>>> > >> >>>> > >> >>>> > >> >>>> > i.e. [space tab vertical-tab formfeed] >> >>>> > >> >>>> > >> >>>> > >> >>>> > Regards, >> >>>> > >> >>>> > Paul >> >>>> > >> >>>> > >> >>>> > >> >>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> >> für >> >>>> > Windows 10 >> >>>> > >> >>>> > >> >>>> > >> >>>> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> >>>> > Gesendet: Mittwoch, 6. März 2019 07:44 >> >>>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org >> > >> >>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect >> multiple >> >>>> \n >> >>>> > >> >>>> > >> >>>> > >> >>>> > Hi Paul, >> >>>> > >> >>>> > I have modified the second pattern to be (<br>){3,}, instead >> of >> >>>> > (<br><br>){3,}. This pattern of >> >>>> (<br><br>){3,} >> >>>> > will actually look for 6 or more <br> instead of 3 <br>, as we >> have >> >>>> put >> >>>> > the <br> two times in the pattern, which is the reason that there >> are >> >>>> more >> >>>> > <br> in the result, as cases where there are less than 6 <br> are >> not >> >>>> being >> >>>> > replaced, so we ended up having up to 5 <br> in the index. >> >>>> > >> >>>> > Modified configuration: >> >>>> > <processor class="solr.RegexReplaceProcessorFactory"> >> >>>> > <str name="fieldName">content</str> >> >>>> > <str name="pattern">(<br>){3,}</str> >> >>>> > <str name="replacement"><br><br></str> >> >>>> > <bool name="literalReplacement">true</bool> >> >>>> > </processor> >> >>>> > >> >>>> > This will bring us back to the result of the previous index >> content, >> >>>> > meaning the issue of having the 4 <br> is still there. >> >>>> > >> >>>> > Regards, >> >>>> > Edwin >> >>>> > >> >>>> > >> >>>> > >> >>>> > Regards, >> >>>> > Edwin >> >>>> > >> >>>> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo < >> >>>> edwinye...@gmail.com> >> >>>> > wrote: >> >>>> > >> >>>> > > Hi Paul, >> >>>> > > >> >>>> > > Further to my previous email, which there was an extra "}" in the >> >>>> > > configuration, I have changed to use the below configuration >> based >> >>>> on >> >>>> > your >> >>>> > > suggestion. >> >>>> > > >> >>>> > > <processor class="solr.RegexReplaceProcessorFactory"> >> >>>> > > <str name="fieldName">content</str> >> >>>> > > <str name="pattern">[ \t]*\r?\n</str> >> >>>> > > <str name="replacement"><br></str> >> >>>> > > <bool name="literalReplacement">true</bool> >> >>>> > > </processor> >> >>>> > > <processor class="solr.RegexReplaceProcessorFactory"> >> >>>> > > <str name="fieldName">content</str> >> >>>> > > <str name="pattern">(<br><br>){3,}</str> >> >>>> > > <str name="replacement"><br><br></str> >> >>>> > > <bool name="literalReplacement">true</bool> >> >>>> > > </processor> >> >>>> > > >> >>>> > > However, the result that I get still has more than 2 <br>. In >> fact, >> >>>> the >> >>>> > > result become worse, as you can see from the comparison below. >> >>>> > > >> >>>> > > Example 1: The sentence that the regex pattern used to work >> >>>> correctly. >> >>>> > But >> >>>> > > with the latest pattern, it has now changed from 2 <br> to >> become 5 >> >>>> <br>, >> >>>> > > which is wrong. >> >>>> > > *Original content in EML file:* >> >>>> > > Dear Sir, >> >>>> > > >> >>>> > > >> >>>> > > I am terminating >> >>>> > > *Original content:* Dear Sir, \n\n \n \n\n I am terminating >> >>>> > > *Previous Index content: * Dear Sir, <br><br>I am terminating >> >>>> > > *Current Index content*: Dear Sir, <br><br><br><br><br> I am >> >>>> > terminating >> >>>> > > >> >>>> > > Example 2: The sentence that the above regex pattern is partially >> >>>> working >> >>>> > > (as you can see, instead of 2 <br>, there are 4 <br>) >> >>>> > > *Original content in EML file:* >> >>>> > > >> >>>> > > *exalted* >> >>>> > > >> >>>> > > *Psalm 89:17* >> >>>> > > >> >>>> > > >> >>>> > > 3 Choa Chu Kang Avenue 4 >> >>>> > > *Original content:* exalted \n \n\n Psalm 89:17 \n\n >> \n\n 3 >> >>>> Choa >> >>>> > > Chu Kang Avenue 4, Singapore >> >>>> > > *Previous Index content: *exalted <br><br>Psalm 89:17 <br><br> >> >>>> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore >> >>>> > > *Current Index content*: <br><br><br> Psalm 89:17<br><br> >> >>>> <br><br> 3 >> >>>> > > Choa Chu Kang Avenue 3, Singapor4 >> >>>> > > >> >>>> > > Example 3: The sentence that the above regex pattern is partially >> >>>> working >> >>>> > > (as you can see, instead of 2 <br>, there are 4 <br>). For the >> >>>> latest >> >>>> > code, >> >>>> > > there are now 5 <br> >> >>>> > > *Original content in EML file:* >> >>>> > > >> >>>> > > http://www.concorded.com/ >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > On Tue, Dec 18, 2018 at 10:07 AM >> >>>> > > *Original content:* http://www.concorded.com/ \n\n \n\n \n >> >>>> \n\n \n\n >> >>>> > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, >> >>>> 2018 at >> >>>> > > 10:07 AM >> >>>> > > *Previous Index content: *http://www.concorded.com/ <br><br> >> >>>> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM >> >>>> > > *Current Index content:* http://www.concorded.com/<br><br> >> >>>> <br><br><br> >> >>>> > > On Tue, Dec 18, 2018 at 10:07 AM >> >>>> > > >> >>>> > > >> >>>> > > Regards, >> >>>> > > Edwin >> >>>> > > >> >>>> > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo < >> >>>> edwinye...@gmail.com> >> >>>> > > wrote: >> >>>> > > >> >>>> > >> Hi Paul, >> >>>> > >> >> >>>> > >> Thank you for the reply. >> >>>> > >> >> >>>> > >> I have tried to add the following configuration according to >> your >> >>>> > >> suggestion: >> >>>> > >> >> >>>> > >> <processor class="solr.RegexReplaceProcessorFactory"> >> >>>> > >> <str name="fieldName">content</str> >> >>>> > >> <str name="pattern">[ \t]*\r?\n}</str> >> >>>> > >> <str name="replacement"><br></str> >> >>>> > >> <bool name="literalReplacement">true</bool> >> >>>> > >> </processor> >> >>>> > >> >> >>>> > >> <processor class="solr.RegexReplaceProcessorFactory"> >> >>>> > >> <str name="fieldName">content</str> >> >>>> > >> <str name="pattern">(<br><br>){3,}</str> >> >>>> > >> <str name="replacement"><br><br></str> >> >>>> > >> <bool name="literalReplacement">true</bool> >> >>>> > >> </processor> >> >>>> > >> >> >>>> > >> However, none of the \n is being removed this time round. >> >>>> > >> Is the order and/or the pattern correct? >> >>>> > >> >> >>>> > >> Regards, >> >>>> > >> Edwin >> >>>> > >> >> >>>> > >> On Tue, 5 Mar 2019 at 19:54, <paul.d...@ub.unibe.ch> wrote: >> >>>> > >> >> >>>> > >>> Hi Edwin >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> Try for the first pattern/replacement >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> <str name="pattern">[ \t]*\r?\n</str> >> >>>> > >>> >> >>>> > >>> <str name="replacement"><br></str> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> Now all line endings and preceding whitespace characters >> should be >> >>>> > >>> changed to ‘<br>’. >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> The second pattern replacement should replace 3 or more ‘<br>’ >> >>>> > sequences >> >>>> > >>> to 2 ‘<br>’ sequences: >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> <str name="pattern">(<br><br>){3,}</str> >> >>>> > >>> >> >>>> > >>> <str name="replacement"><br><br></str> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> Hope this approach works. Sorry for not replying earlier and >> best >> >>>> > >>> regards, >> >>>> > >>> >> >>>> > >>> Paul >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> Gesendet von Mail< >> https://go.microsoft.com/fwlink/?LinkId=550986> >> >>>> für >> >>>> > >>> Windows 10 >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> >>>> > >>> Gesendet: Dienstag, 5. März 2019 03:35 >> >>>> > >>> An: solr-user@lucene.apache.org<mailto: >> >>>> solr-user@lucene.apache.org> >> >>>> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect >> >>>> multiple \n >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > >>> Hi, >> >>>> > >>> >> >>>> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as >> >>>> well. >> >>>> > >>> >> >>>> > >>> Regards, >> >>>> > >>> Edwin >> >>>> > >>> >> >>>> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo < >> >>>> > edwinye...@gmail.com> >> >>>> > >>> wrote: >> >>>> > >>> >> >>>> > >>> > Hi, >> >>>> > >>> > >> >>>> > >>> > Anyone else has other suggestions or have faced the same >> >>>> problem? >> >>>> > >>> > >> >>>> > >>> > Regards, >> >>>> > >>> > Edwin >> >>>> > >>> > >> >>>> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo < >> >>>> > >>> edwinye...@gmail.com> >> >>>> > >>> > wrote: >> >>>> > >>> > >> >>>> > >>> >> Hi Paul, >> >>>> > >>> >> >> >>>> > >>> >> If I tried to execute the second step first, then I will >> only >> >>>> get a >> >>>> > >>> >> single <br> for those with 2 <br>. >> >>>> > >>> >> For those that we originally get 4 <br>, there will be 2 >> <br> >> >>>> with a >> >>>> > >>> >> space in between. >> >>>> > >>> >> >> >>>> > >>> >> This is just changing the 2 <br> to be a single <br>, since >> the >> >>>> > second >> >>>> > >>> >> step is to replace with a single <br>. >> >>>> > >>> >> But it has not solved the underlying problem yet. >> >>>> > >>> >> >> >>>> > >>> >> Regards, >> >>>> > >>> >> Edwin >> >>>> > >>> >> >> >>>> > >>> >> >> >>>> > >>> >> On Wed, 20 Feb 2019 at 16:41, <paul.d...@ub.unibe.ch> >> wrote: >> >>>> > >>> >> >> >>>> > >>> >>> If the second step is executed first, then you will get the >> >>>> > unwanted >> >>>> > >>> 4 >> >>>> > >>> >>> <br> >> >>>> > >>> >>> >> >>>> > >>> >>> >> >>>> > >>> >>> >> >>>> > >>> >>> Gesendet von Mail< >> >>>> https://go.microsoft.com/fwlink/?LinkId=550986> >> >>>> > >>> für >> >>>> > >>> >>> Windows 10 >> >>>> > >>> >>> >> >>>> > >>> >>> >> >>>> > >>> >>> >> >>>> > >>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >> >>>> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29 >> >>>> > >>> >>> An: solr-user@lucene.apache.org<mailto: >> >>>> solr-user@lucene.apache.org >> >>>> > > >> >>>> > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect >> >>>> > multiple >> >>>> > >>> \n >> >>>> > >>> >>> >> >>>> > >>> >>> >> >>>> > >>> >>> >> >>>> > >>> >>> Hi Jörn , >> >>>> > >>> >>> >> >>>> > >>> >>> Do you mean the regex is not correct? >> >>>> > >>> >>> >> >>>> > >>> >>> We are already using two RegexReplaceProcessorFactory >> steps, >> >>>> like >> >>>> > >>> the one >> >>>> > >>> >>> shown below. The output that we get is still the same. >> >>>> > >>> >>> >> >>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> >> >>>> > >>> >>> <str name="fieldName">content</str> >> >>>> > >>> >>> <str name="pattern">([ \t]*\r?\n){2,}</str> >> >>>> > >>> >>> <str name="replacement"><br><br></str> >> >>>> > >>> >>> <bool name="literalReplacement">true</bool> >> >>>> > >>> >>> <processor> >> >>>> > >>> >>> >> >>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> >> >>>> > >>> >>> <str name="fieldName">content</str> >> >>>> > >>> >>> <str name="pattern">([ \t]*\r?\n){1,}</str> >> >>>> > >>> >>> <str name="replacement"><br></str> >> >>>> > >>> >>> <bool name="literalReplacement">true</bool> >> >>>> > >>> >>> <processor> >> >>>> > >>> >>> >> >>>> > >>> >>> Regards, >> >>>> > >>> >>> Edwin >> >>>> > >>> >>> >> >>>> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke < >> >>>> jornfra...@gmail.com> >> >>>> > >>> wrote: >> >>>> > >>> >>> >> >>>> > >>> >>> > Then you need two regexprocessfactory steps >> >>>> > >>> >>> > >> >>>> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo < >> >>>> > >>> >>> edwinye...@gmail.com >> >>>> > >>> >>> > >: >> >>>> > >>> >>> > > >> >>>> > >>> >>> > > Hi, >> >>>> > >>> >>> > > >> >>>> > >>> >>> > > Thanks for the reply. >> >>>> > >>> >>> > > >> >>>> > >>> >>> > > Do you know of any regex online tool that works >> correctly >> >>>> for >> >>>> > >>> Java >> >>>> > >>> >>> regex? >> >>>> > >>> >>> > > I tried to find some, but they are not working >> properly. >> >>>> > >>> >>> > > >> >>>> > >>> >>> > > Yes, our plan is to replace more than one \n with >> >>>> <br><br>, and >> >>>> > >>> >>> single \n >> >>>> > >>> >>> > > with single <br>. >> >>>> > >>> >>> > > >> >>>> > >>> >>> > > Regards, >> >>>> > >>> >>> > > Edwin >> >>>> > >>> >>> > > >> >>>> > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke < >> >>>> > jornfra...@gmail.com >> >>>> > >>> > >> >>>> > >>> >>> wrote: >> >>>> > >>> >>> > >> >> >>>> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a >> bug >> >>>> - it >> >>>> > >>> would >> >>>> > >>> >>> then >> >>>> > >>> >>> > >> be in the JDK. Try out in a regex online Tool that >> >>>> supports >> >>>> > Java >> >>>> > >>> >>> regex >> >>>> > >>> >>> > for >> >>>> > >>> >>> > >> your solution. >> >>>> > >>> >>> > >> >> >>>> > >>> >>> > >> I believe you want to have 2 regex process factories: >> >>>> > >>> >>> > >> One that deals with single \n and one that deals with >> >>>> more >> >>>> > than >> >>>> > >>> one >> >>>> > >>> >>> \n >> >>>> > >>> >>> > >> >> >>>> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo < >> >>>> > >>> >>> > edwinye...@gmail.com >> >>>> > >>> >>> > >>> : >> >>>> > >>> >>> > >>> >> >>>> > >>> >>> > >>> Hi, >> >>>> > >>> >>> > >>> >> >>>> > >>> >>> > >>> We have tried with the following pattern ([ >> >>>> \t]*\r?\n){2,} >> >>>> > and >> >>>> > >>> >>> > >>> configuration: >> >>>> > >>> >>> > >>> >> >>>> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory"> >> >>>> > >>> >>> > >>> <str name="fieldName">content</str> >> >>>> > >>> >>> > >>> <str name="pattern">([ \t]*\r?\n){2,}</str> >> >>>> > >>> >>> > >>> <str name="replacement"><br><br></str> >> >>>> > >>> >>> > >>> <bool name="literalReplacement">true</bool> >> >>>> > >>> >>> > >>> </processor> >> >>>> > >>> >>> > >>> >> >>>> > >>> >>> > >>> However, the issue is still occurring. >> >>>> > >>> >>> > >>> >> >>>> > >>> >>> > >>> Anyone else is able to help? >> >>>> > >>> >>> > >>> >> >>>> > >>> >>> > >>> Regards, >> >>>> > >>> >>> > >>> Edwin >> >>>> > >>> >>> > >>> >> >>>> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo < >> >>>> > >>> >>> > edwinye...@gmail.com> >> >>>> > >>> >>> > >>> wrote: >> >>>> > >>> >>> > >>> >> >>>> > >>> >>> > >>>> Hi, >> >>>> > >>> >>> > >>>> >> >>>> > >>> >>> > >>>> For your info, this issue is occurring in Solr >> 7.7.0 as >> >>>> > well. >> >>>> > >>> >>> > >>>> >> >>>> > >>> >>> > >>>> Regards, >> >>>> > >>> >>> > >>>> Edwin >> >>>> > >>> >>> > >>>> >> >>>> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo < >> >>>> > >>> >>> > edwinye...@gmail.com >> >>>> > >>> >>> > >>> >> >>>> > >>> >>> > >>>> wrote: >> >>>> > >>> >>> > >>>> >> >>>> > >>> >>> > >>>>> Hi, >> >>>> > >>> >>> > >>>>> >> >>>> > >>> >>> > >>>>> Should we report this as a bug in Solr? >> >>>> > >>> >>> > >>>>> >> >>>> > >>> >>> > >>>>> Regards, >> >>>> > >>> >>> > >>>>> Edwin >> >>>> > >>> >>> > >>>>> >> >>>> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo < >> >>>> > >>> >>> > edwinye...@gmail.com >> >>>> > >>> >>> > >>> >> >>>> > >>> >>> > >>>>> wrote: >> >>>> > >>> >>> > >>>>> >> >>>> > >>> >>> > >>>>>> Hi Paul, >> >>>> > >>> >>> > >>>>>> >> >>>> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, >> >>>> when we >> >>>> > >>> try >> >>>> > >>> >>> in on >> >>>> > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the >> >>>> correct >> >>>> > >>> >>> result for >> >>>> > >>> >>> > >> all >> >>>> > >>> >>> > >>>>>> the examples (ie: All of them will only have >> >>>> <br><br>, and >> >>>> > >>> not >> >>>> > >>> >>> more >> >>>> > >>> >>> > >> than >> >>>> > >>> >>> > >>>>>> that like what we are getting in Solr in our >> earlier >> >>>> > >>> examples). >> >>>> > >>> >>> > >>>>>> >> >>>> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr? >> >>>> > >>> >>> > >>>>>> >> >>>> > >>> >>> > >>>>>> Regards, >> >>>> > >>> >>> > >>>>>> Edwin >> >>>> > >>> >>> > >>>>>> >> >>>> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo < >> >>>> > >>> >>> > >> edwinye...@gmail.com> >> >>>> > >>> >>> > >>>>>> wrote: >> >>>> > >>> >>> > >>>>>> >> >>>> > >>> >>> > >>>>>>> Hi Paul, >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n >> >>>> i.e. >> >>>> > <str >> >>>> > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the >> following >> >>>> > regex >> >>>> > >>> >>> pattern: >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> <processor >> >>>> class="solr.RegexReplaceProcessorFactory"> >> >>>> > >>> >>> > >>>>>>> <str name="fieldName">content</str> >> >>>> > >>> >>> > >>>>>>> <str name="pattern">(\s*\n){2,}</str> >> >>>> > >>> >>> > >>>>>>> <str >> name="replacement"><br><br></str> >> >>>> > >>> >>> > >>>>>>> </processor> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> However, we are also getting the exact same >> results >> >>>> as >> >>>> > the >> >>>> > >>> >>> earlier >> >>>> > >>> >>> > >>>>>>> Example 1, 2 and 3. >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you >> have >> >>>> other >> >>>> > >>> (non >> >>>> > >>> >>> > >>>>>>> printing) characters than \n, we have find that >> >>>> there are >> >>>> > >>> no >> >>>> > >>> >>> non >> >>>> > >>> >>> > >> printing >> >>>> > >>> >>> > >>>>>>> characters. It is just next line with a space. >> You >> >>>> can >> >>>> > >>> refer >> >>>> > >>> >>> to the >> >>>> > >>> >>> > >>>>>>> original content in the same examples below. >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex >> >>>> pattern is >> >>>> > >>> working >> >>>> > >>> >>> > >>>>>>> correctly >> >>>> > >>> >>> > >>>>>>> *Original content in EML file:* >> >>>> > >>> >>> > >>>>>>> Dear Sir, >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> I am terminating >> >>>> > >>> >>> > >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I >> am >> >>>> > >>> terminating >> >>>> > >>> >>> > >>>>>>> *Index content: * Dear Sir, <br><br>I am >> >>>> terminating >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex >> >>>> pattern is >> >>>> > >>> >>> partially >> >>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there >> >>>> are 4 >> >>>> > >>> <br>) >> >>>> > >>> >>> > >>>>>>> *Original content in EML file:* >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> *exalted* >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> *Psalm 89:17* >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4 >> >>>> > >>> >>> > >>>>>>> *Original content:* exalted \n \n\n Psalm >> 89:17 >> >>>> \n\n >> >>>> > >>> >>> \n\n 3 >> >>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >> >>>> > >>> >>> > >>>>>>> *Index content: *exalted <br><br>Psalm 89:17 >> >>>> <br><br> >> >>>> > >>> >>> <br><br>3 >> >>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex >> >>>> pattern is >> >>>> > >>> >>> partially >> >>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there >> >>>> are 4 >> >>>> > >>> <br>) >> >>>> > >>> >>> > >>>>>>> *Original content in EML file:* >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/ >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM >> >>>> > >>> >>> > >>>>>>> *Original content:* >> >>>> http://www.concordpri.moe.edu.sg/ >> >>>> > >>> \n\n >> >>>> > >>> >>> > \n\n >> >>>> > >>> >>> > >> \n >> >>>> > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n >> \n\n\n >> >>>> > >>> \n\n\n On >> >>>> > >>> >>> Tue, >> >>>> > >>> >>> > >> Dec 18, >> >>>> > >>> >>> > >>>>>>> 2018 at 10:07 AM >> >>>> > >>> >>> > >>>>>>> *Index content: * >> http://www.concordpri.moe.edu.sg/ >> >>>> > >>> <br><br> >> >>>> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that >> you >> >>>> may >> >>>> > >>> have. >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> Thank you. >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>> Regards, >> >>>> > >>> >>> > >>>>>>> Edwin >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, < >> >>>> paul.d...@ub.unibe.ch> >> >>>> > >>> wrote: >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> Hi Edwin >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> 1. Sorry, the pattern was wrong, the space >> should >> >>>> > preceed >> >>>> > >>> >>> the \n >> >>>> > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str> >> >>>> > >>> >>> > >>>>>>>> 2. Perhaps in the data you have other (non >> >>>> printing) >> >>>> > >>> >>> characters >> >>>> > >>> >>> > >>>>>>>> than \n? >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> Gesendet von Mail< >> >>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986> >> >>>> > >>> >>> > >> für >> >>>> > >>> >>> > >>>>>>>> Windows 10 >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto: >> >>>> edwinye...@gmail.com> >> >>>> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 >> >>>> > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto: >> >>>> > >>> >>> > solr-user@lucene.apache.org> >> >>>> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory >> pattern >> >>>> to >> >>>> > >>> detect >> >>>> > >>> >>> > >> multiple \n >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> Hi Paul, >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as >> >>>> follow: >> >>>> > >>> >>> > >>>>>>>> <processor >> >>>> class="solr.RegexReplaceProcessorFactory"> >> >>>> > >>> >>> > >>>>>>>> <str name="fieldName">content</str> >> >>>> > >>> >>> > >>>>>>>> <str name="pattern">(\n\s*){2,}</str> >> >>>> > >>> >>> > >>>>>>>> <str >> name="replacement"><br><br></str> >> >>>> > >>> >>> > >>>>>>>> </processor> >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> But we still have exactly the same problem of >> >>>> Example >> >>>> > 1,2 >> >>>> > >>> and >> >>>> > >>> >>> 3 >> >>>> > >>> >>> > >> below. >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex >> >>>> pattern is >> >>>> > >>> >>> working >> >>>> > >>> >>> > >>>>>>>> correctly >> >>>> > >>> >>> > >>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n >> I am >> >>>> > >>> >>> terminating >> >>>> > >>> >>> > >>>>>>>> *Index content: * Dear Sir, <br><br>I am >> >>>> terminating >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex >> >>>> pattern is >> >>>> > >>> >>> partially >> >>>> > >>> >>> > >>>>>>>> working >> >>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 >> >>>> <br>) >> >>>> > >>> >>> > >>>>>>>> *Original content:* exalted \n \n\n Psalm >> 89:17 >> >>>> > \n\n >> >>>> > >>> >>> \n\n >> >>>> > >>> >>> > 3 >> >>>> > >>> >>> > >>>>>>>> Choa >> >>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore >> >>>> > >>> >>> > >>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 >> >>>> <br><br> >> >>>> > >>> >>> > <br><br>3 >> >>>> > >>> >>> > >>>>>>>> Choa >> >>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex >> >>>> pattern is >> >>>> > >>> >>> partially >> >>>> > >>> >>> > >>>>>>>> working >> >>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 >> >>>> <br>) >> >>>> > >>> >>> > >>>>>>>> *Original content:* >> >>>> http://www.concordpri.moe.edu.sg/ >> >>>> > >>> \n\n >> >>>> > >>> >>> > \n\n >> >>>> > >>> >>> > >>>>>>>> \n \n\n >> >>>> > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n >> >>>> \n\n\n >> >>>> > On >> >>>> > >>> >>> Tue, Dec >> >>>> > >>> >>> > >> 18, >> >>>> > >>> >>> > >>>>>>>> 2018 >> >>>> > >>> >>> > >>>>>>>> at 10:07 AM >> >>>> > >>> >>> > >>>>>>>> *Index content: * >> http://www.concordpri.moe.edu.sg/ >> >>>> > >>> <br><br> >> >>>> > >>> >>> > >>>>>>>> <br><br>On >> >>>> > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> Any further suggestion? >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> Thank you. >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>> Regards, >> >>>> > >>> >>> > >>>>>>>> Edwin >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, < >> >>>> paul.d...@ub.unibe.ch> >> >>>> > >>> wrote: >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and >> >>>> then >> >>>> > >>> failing >> >>>> > >>> >>> on >> >>>> > >>> >>> > the >> >>>> > >>> >>> > >>>>>>>> {2,} >> >>>> > >>> >>> > >>>>>>>>> part you could try >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> If you also want to match CRLF then >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> Gesendet von Mail< >> >>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986 >> >>>> > >>> >>> > > >> >>>> > >>> >>> > >>>>>>>> für >> >>>> > >>> >>> > >>>>>>>>> Windows 10 >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto: >> >>>> edwinye...@gmail.com> >> >>>> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 >> >>>> > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto: >> >>>> > >>> >>> > solr-user@lucene.apache.org >> >>>> > >>> >>> > >>> >> >>>> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory >> pattern >> >>>> to >> >>>> > >>> detect >> >>>> > >>> >>> > >> multiple >> >>>> > >>> >>> > >>>>>>>> \n >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> Hi Paul, >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> Thanks for your reply. >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> When I use this pattern: >> >>>> > >>> >>> > >>>>>>>>> <processor >> >>>> class="solr.RegexReplaceProcessorFactory"> >> >>>> > >>> >>> > >>>>>>>>> <str name="fieldName">content</str> >> >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n+\s*){2,}</str> >> >>>> > >>> >>> > >>>>>>>>> <str >> >>>> name="replacement"><br><br></str> >> >>>> > >>> >>> > >>>>>>>>> </processor> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> It is working for some sentence within the same >> >>>> content >> >>>> > >>> and >> >>>> > >>> >>> not >> >>>> > >>> >>> > >>>>>>>> working for >> >>>> > >>> >>> > >>>>>>>>> some sentences. Please see below for the one >> that >> >>>> is >> >>>> > >>> working >> >>>> > >>> >>> and >> >>>> > >>> >>> > >>>>>>>> another >> >>>> > >>> >>> > >>>>>>>>> that is not working (partially working): >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex >> >>>> pattern is >> >>>> > >>> >>> working >> >>>> > >>> >>> > >>>>>>>> correctly >> >>>> > >>> >>> > >>>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n >> I >> >>>> am >> >>>> > >>> >>> terminating >> >>>> > >>> >>> > >>>>>>>>> *Index content: * Dear Sir, <br><br>I am >> >>>> > terminating >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex >> >>>> pattern is >> >>>> > >>> >>> partially >> >>>> > >>> >>> > >>>>>>>> working >> >>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 >> >>>> <br>) >> >>>> > >>> >>> > >>>>>>>>> *Original content:* exalted \n \n\n Psalm >> 89:17 >> >>>> > \n\n >> >>>> > >>> >>> > \n\n 3 >> >>>> > >>> >>> > >>>>>>>> Choa >> >>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore >> >>>> > >>> >>> > >>>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 >> >>>> > <br><br> >> >>>> > >>> >>> > <br><br>3 >> >>>> > >>> >>> > >>>>>>>> Choa >> >>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex >> >>>> pattern is >> >>>> > >>> >>> partially >> >>>> > >>> >>> > >>>>>>>> working >> >>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 >> >>>> <br>) >> >>>> > >>> >>> > >>>>>>>>> *Original content:* >> >>>> http://www.concordpri.moe.edu.sg/ >> >>>> > >>> \n\n >> >>>> > >>> >>> > >> \n\n >> >>>> > >>> >>> > >>>>>>>> \n >> >>>> > >>> >>> > >>>>>>>>> \n\n >> >>>> > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n >> >>>> \n\n\n >> >>>> > On >> >>>> > >>> >>> Tue, >> >>>> > >>> >>> > Dec >> >>>> > >>> >>> > >>>>>>>> 18, 2018 >> >>>> > >>> >>> > >>>>>>>>> at 10:07 AM >> >>>> > >>> >>> > >>>>>>>>> *Index content: * >> >>>> http://www.concordpri.moe.edu.sg/ >> >>>> > >>> >>> <br><br> >> >>>> > >>> >>> > >>>>>>>> <br><br>On >> >>>> > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is >> >>>> wrong? >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> Thank you. >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> Regards, >> >>>> > >>> >>> > >>>>>>>>> Edwin >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, < >> >>>> paul.d...@ub.unibe.ch> >> >>>> > >>> wrote: >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is >> not >> >>>> > >>> working. I >> >>>> > >>> >>> > assume >> >>>> > >>> >>> > >>>>>>>> nothing >> >>>> > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> <str name="pattern">"(\n\s*){2,}"</str> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> ?? >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> Gesendet von Mail< >> >>>> > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986> >> >>>> > >>> >>> > >>>>>>>> für >> >>>> > >>> >>> > >>>>>>>>>> Windows 10 >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto: >> >>>> edwinye...@gmail.com> >> >>>> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 >> >>>> > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto: >> >>>> > >>> >>> > >> solr-user@lucene.apache.org >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern >> to >> >>>> > detect >> >>>> > >>> >>> multiple >> >>>> > >>> >>> > >> \n >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> Hi, >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> I am trying to use the >> >>>> RegexReplaceProcessorFactory to >> >>>> > >>> >>> remove >> >>>> > >>> >>> > more >> >>>> > >>> >>> > >>>>>>>> than >> >>>> > >>> >>> > >>>>>>>>> two >> >>>> > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: >> >>>> \n\n, >> >>>> > \n >> >>>> > >>> \n, >> >>>> > >>> >>> \n >> >>>> > >>> >>> > \n >> >>>> > >>> >>> > >>>>>>>> \n >> >>>> > >>> >>> > >>>>>>>>> \n), >> >>>> > >>> >>> > >>>>>>>>>> and replace it with two <br>. >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is >> >>>> working >> >>>> > >>> when I >> >>>> > >>> >>> test >> >>>> > >>> >>> > it >> >>>> > >>> >>> > >>>>>>>> in >> >>>> > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I >> put >> >>>> it >> >>>> > >>> inside >> >>>> > >>> >>> the >> >>>> > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below: >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain >> name="removeCode"> >> >>>> > >>> >>> > >>>>>>>>>> <processor >> >>>> class="solr.RegexReplaceProcessorFactory"> >> >>>> > >>> >>> > >>>>>>>>>> <str name="fieldName">content</str> >> >>>> > >>> >>> > >>>>>>>>>> <str name="pattern">"(\\n\s*){2,}"</str> >> >>>> > >>> >>> > >>>>>>>>>> <str >> >>>> name="replacement"><br><br></str> >> >>>> > >>> >>> > >>>>>>>>>> </processor> >> >>>> > >>> >>> > >>>>>>>>>> </updateRequestProcessorChain> >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern, >> \s* is >> >>>> > >>> >>> instructing >> >>>> > >>> >>> > the >> >>>> > >>> >>> > >>>>>>>> regex >> >>>> > >>> >>> > >>>>>>>>> to >> >>>> > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is >> >>>> > >>> instructing >> >>>> > >>> >>> the >> >>>> > >>> >>> > >>>>>>>> regex to >> >>>> > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern >> (\n). >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and >> how >> >>>> should >> >>>> > >>> I do >> >>>> > >>> >>> it? >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0. >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>>> Regards, >> >>>> > >>> >>> > >>>>>>>>>> Edwin >> >>>> > >>> >>> > >>>>>>>>>> >> >>>> > >>> >>> > >>>>>>>>> >> >>>> > >>> >>> > >>>>>>>> >> >>>> > >>> >>> > >>>>>>> >> >>>> > >>> >>> > >> >> >>>> > >>> >>> > >> >>>> > >>> >>> >> >>>> > >>> >> >> >>>> > >>> >> >>>> > >> >> >>>> > >> >>>> >> >>> >> >