Hi Paul,

Would like to check, if there is any difference in performance when we use
the two different patterns method?

<str name="pattern">(\n\W*){2,}</str>

<str name="pattern">[ \t\x0b\f]*\r?\n</str>

Regards,
Edwin

On Thu, 14 Mar 2019 at 09:36, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Hi Paul,
>
> Thanks for your reply.
>
> So far we did not find cases of punctuation that are being removed.
>
> Our aim is to remove the list of spaces (\n) into 2 <br>, and they are not
> likely to have any punctuation in between.
>
> Do you know if this pattern  <str name="pattern">(\n\W*){2,}</str> that
> we are using is ok?
> Or would the other pattern like  <str name="pattern">[
> \t\x0b\f]*\r?\n</str> is better?
>
> Regards,
> Edwin
>
> On Wed, 13 Mar 2019 at 20:08, <paul.d...@ub.unibe.ch> wrote:
>
>> Hi Edwin,
>> With \W you will also replace non-word characters such as punktuation. If
>> that's OK fine. Otherwise you need to identify the white space characters
>> that are causing the problem.
>> ________________________________
>> Von: Zheng Lin Edwin Yeo <edwinye...@gmail.com>
>> Gesendet: Mittwoch, 13. März 2019 03:25:39
>> An: solr-user@lucene.apache.org
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>> Hi,
>>
>> We have managed to resolve the issue, by changing the \s to \W. The reason
>> could be due to that some of the spaces and white space instead of just a
>> space. Using \s will only remove the spaces and not the white spaces, but
>> using \W will remove the white spaces as well.
>>
>> We have used this config, and it works.
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(\n\W*){2,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(\n\W*){1,}</str>
>>    <str name="replacement">&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>>
>> Regards,
>> Edwin
>>
>> On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Has anyone else faced the same issue before?
>> > So far all the regex patterns that we tried in this thread are not able
>> to
>> > resolve the issue.
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
>> > wrote:
>> >
>> >> Hi Paul,
>> >>
>> >> Sorry, I realized there is an extra ']' in the pattern provided, which
>> is
>> >> why there are so many <br> in the output.
>> >>
>> >> The output is exactly the same as previously (previous index result) if
>> >> we remove the extra ']', as shown in the configuration below.
>> >>
>> >>  <processor class="solr.RegexReplaceProcessorFactory">
>> >>    <str name="fieldName">content</str>
>> >>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>> >>    <str name="replacement">&lt;br&gt;</str>
>> >>    <bool name="literalReplacement">true</bool>
>> >>  </processor>
>> >>  <processor class="solr.RegexReplaceProcessorFactory">
>> >>    <str name="fieldName">content</str>
>> >>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]*){3,}</str>
>> >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>    <bool name="literalReplacement">true</bool>
>> >>  </processor>
>> >>
>> >> Regards,
>> >> Edwin
>> >>
>> >>
>> >>
>> >> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo <edwinye...@gmail.com
>> >
>> >> wrote:
>> >>
>> >>> Hi Paul,
>> >>>
>> >>> Thanks for the reply.
>> >>>
>> >>> For the 2nd pattern, if we put this pattern <str
>> >>> name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>, which is like the
>> >>> configurations below:
>> >>>
>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>    <str name="fieldName">content</str>
>> >>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>> >>>    <str name="replacement">&lt;br&gt;</str>
>> >>>    <bool name="literalReplacement">true</bool>
>> >>> </processor>
>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>    <str name="fieldName">content</str>
>> >>>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>> >>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>    <bool name="literalReplacement">true</bool>
>> >>> </processor>
>> >>>
>> >>> It will not be able to change all those more than 3 <br> to 2 <br>.
>> >>>
>> >>> We will end up with many <br> in the output, like the example below:
>> >>>
>> >>>  http://www.concorded.com/<br><br>
>> <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>
>> On Tue, Dec 18, 2018
>> >>>
>> >>>
>> >>> Regards,
>> >>> Edwin
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, 7 Mar 2019 at 20:44, <paul.d...@ub.unibe.ch> wrote:
>> >>>
>> >>>> Hi Edwin
>> >>>>
>> >>>>
>> >>>>
>> >>>> I can’t understand why the pattern is not working and where the
>> spaces
>> >>>> between the <br> are coming from. It should be possible to allow for
>> spaces
>> >>>> between the <br> in the second match pattern however i.e. 2nd pattern
>> >>>>
>> >>>>
>> >>>>
>> >>>> <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>> >>>>
>> >>>>
>> >>>>
>> >>>> /Paul
>> >>>>
>> >>>>
>> >>>>
>> >>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>> für
>> >>>> Windows 10
>> >>>>
>> >>>>
>> >>>>
>> >>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> >>>> Gesendet: Mittwoch, 6. März 2019 16:28
>> >>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> >>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>> \n
>> >>>>
>> >>>>
>> >>>>
>> >>>> Hi Paul,
>> >>>>
>> >>>> I have tried with the first match pattern to be <str name="pattern">[
>> >>>> \t\x0b\f]*\r?\n</str>, like the configuration below:
>> >>>>
>> >>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>>    <str name="fieldName">content</str>
>> >>>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>> >>>>    <str name="replacement">&lt;br&gt;</str>
>> >>>>    <bool name="literalReplacement">true</bool>
>> >>>> </processor>
>> >>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>>    <str name="fieldName">content</str>
>> >>>>    <str name="pattern">(&lt;br&gt;){3,}</str>
>> >>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>>    <bool name="literalReplacement">true</bool>
>> >>>> </processor>
>> >>>>
>> >>>> However, the result is still the same as before (previous index
>> >>>> results),
>> >>>> with the 4 <br>.
>> >>>>
>> >>>> Regards,
>> >>>> Edwin
>> >>>>
>> >>>>
>> >>>> On Wed, 6 Mar 2019 at 18:23, <paul.d...@ub.unibe.ch> wrote:
>> >>>>
>> >>>> > Hi Edwin
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > You are correct  re the 2nd pattern – my bad. Looking at the 4
>> <br>,
>> >>>> it’s
>> >>>> > actually the sequence «<br><br>  <br><br>»? So perhaps the first
>> match
>> >>>> > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > i.e. [space tab vertical-tab formfeed]
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > Regards,
>> >>>> >
>> >>>> > Paul
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>> für
>> >>>> > Windows 10
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> >>>> > Gesendet: Mittwoch, 6. März 2019 07:44
>> >>>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
>> >
>> >>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> multiple
>> >>>> \n
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > Hi Paul,
>> >>>> >
>> >>>> > I have modified the second pattern to be (&lt;br&gt;){3,}, instead
>> of
>> >>>> > (&lt;br&gt;&lt;br&gt;){3,}. This pattern of
>> >>>> (&lt;br&gt;&lt;br&gt;){3,}
>> >>>> > will actually look for 6 or more <br> instead of 3 <br>,  as we
>> have
>> >>>> put
>> >>>> > the <br> two times in the pattern, which is the reason that there
>> are
>> >>>> more
>> >>>> > <br> in the result, as cases where there are less than 6 <br> are
>> not
>> >>>> being
>> >>>> > replaced, so we ended up having up to 5 <br> in the index.
>> >>>> >
>> >>>> > Modified configuration:
>> >>>> >  <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> >    <str name="fieldName">content</str>
>> >>>> >    <str name="pattern">(&lt;br&gt;){3,}</str>
>> >>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> >    <bool name="literalReplacement">true</bool>
>> >>>> >  </processor>
>> >>>> >
>> >>>> > This will bring us back to the result of the previous index
>> content,
>> >>>> > meaning the issue of having the 4 <br> is still there.
>> >>>> >
>> >>>> > Regards,
>> >>>> > Edwin
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > Regards,
>> >>>> > Edwin
>> >>>> >
>> >>>> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <
>> >>>> edwinye...@gmail.com>
>> >>>> > wrote:
>> >>>> >
>> >>>> > > Hi Paul,
>> >>>> > >
>> >>>> > > Further to my previous email, which there was an extra "}" in the
>> >>>> > > configuration, I have changed to use the below configuration
>> based
>> >>>> on
>> >>>> > your
>> >>>> > > suggestion.
>> >>>> > >
>> >>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >    <str name="fieldName">content</str>
>> >>>> > >    <str name="pattern">[ \t]*\r?\n</str>
>> >>>> > >    <str name="replacement">&lt;br&gt;</str>
>> >>>> > >    <bool name="literalReplacement">true</bool>
>> >>>> > > </processor>
>> >>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >    <str name="fieldName">content</str>
>> >>>> > >    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>> >>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >    <bool name="literalReplacement">true</bool>
>> >>>> > > </processor>
>> >>>> > >
>> >>>> > > However, the result that I get still has more than 2 <br>. In
>> fact,
>> >>>> the
>> >>>> > > result become worse, as you can see from the comparison below.
>> >>>> > >
>> >>>> > > Example 1: The sentence that the regex pattern used to work
>> >>>> correctly.
>> >>>> > But
>> >>>> > > with the latest pattern, it has now changed from 2 <br> to
>> become 5
>> >>>> <br>,
>> >>>> > > which is wrong.
>> >>>> > > *Original content in EML file:*
>> >>>> > > Dear Sir,
>> >>>> > >
>> >>>> > >
>> >>>> > > I am terminating
>> >>>> > > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> >>>> > > *Previous Index content: *    Dear Sir,  <br><br>I am terminating
>> >>>> > > *Current Index content*:   Dear Sir, <br><br><br><br><br> I am
>> >>>> > terminating
>> >>>> > >
>> >>>> > > Example 2: The sentence that the above regex pattern is partially
>> >>>> working
>> >>>> > > (as you can see, instead of 2 <br>, there are 4 <br>)
>> >>>> > > *Original content in EML file:*
>> >>>> > >
>> >>>> > > *exalted*
>> >>>> > >
>> >>>> > > *Psalm 89:17*
>> >>>> > >
>> >>>> > >
>> >>>> > > 3 Choa Chu Kang Avenue 4
>> >>>> > > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>  \n\n  3
>> >>>> Choa
>> >>>> > > Chu Kang Avenue 4, Singapore
>> >>>> > > *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> >>>> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore
>> >>>> > > *Current Index content*: <br><br><br>   Psalm 89:17<br><br>
>> >>>> <br><br>  3
>> >>>> > > Choa Chu Kang Avenue 3, Singapor4
>> >>>> > >
>> >>>> > > Example 3: The sentence that the above regex pattern is partially
>> >>>> working
>> >>>> > > (as you can see, instead of 2 <br>, there are 4 <br>). For the
>> >>>> latest
>> >>>> > code,
>> >>>> > > there are now 5 <br>
>> >>>> > > *Original content in EML file:*
>> >>>> > >
>> >>>> > > http://www.concorded.com/
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > > On Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > > *Original content:* http://www.concorded.com/   \n\n   \n\n \n
>> >>>> \n\n \n\n
>> >>>> > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> >>>> 2018 at
>> >>>> > > 10:07 AM
>> >>>> > > *Previous Index content: *http://www.concorded.com/   <br><br>
>> >>>> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > > *Current Index content:* http://www.concorded.com/<br><br>
>> >>>> <br><br><br>
>> >>>> > > On Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > >
>> >>>> > >
>> >>>> > > Regards,
>> >>>> > > Edwin
>> >>>> > >
>> >>>> > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <
>> >>>> edwinye...@gmail.com>
>> >>>> > > wrote:
>> >>>> > >
>> >>>> > >> Hi Paul,
>> >>>> > >>
>> >>>> > >> Thank you for the reply.
>> >>>> > >>
>> >>>> > >> I have tried to add the following configuration according to
>> your
>> >>>> > >> suggestion:
>> >>>> > >>
>> >>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>    <str name="fieldName">content</str>
>> >>>> > >>    <str name="pattern">[ \t]*\r?\n}</str>
>> >>>> > >>    <str name="replacement">&lt;br&gt;</str>
>> >>>> > >>    <bool name="literalReplacement">true</bool>
>> >>>> > >> </processor>
>> >>>> > >>
>> >>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>    <str name="fieldName">content</str>
>> >>>> > >>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>> >>>> > >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>    <bool name="literalReplacement">true</bool>
>> >>>> > >> </processor>
>> >>>> > >>
>> >>>> > >> However, none of the \n is being removed this time round.
>> >>>> > >> Is the order and/or the pattern correct?
>> >>>> > >>
>> >>>> > >> Regards,
>> >>>> > >> Edwin
>> >>>> > >>
>> >>>> > >> On Tue, 5 Mar 2019 at 19:54, <paul.d...@ub.unibe.ch> wrote:
>> >>>> > >>
>> >>>> > >>> Hi Edwin
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Try for the first pattern/replacement
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> <str name="pattern">[ \t]*\r?\n</str>
>> >>>> > >>>
>> >>>> > >>> <str name="replacement">&lt;br&gt;</str>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Now all line endings and preceding whitespace characters
>> should be
>> >>>> > >>> changed to ‘<br>’.
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> The second pattern replacement should replace 3 or more ‘<br>’
>> >>>> > sequences
>> >>>> > >>> to 2 ‘<br>’ sequences:
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>> >>>> > >>>
>> >>>> > >>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Hope this approach works. Sorry for not replying earlier and
>> best
>> >>>> > >>> regards,
>> >>>> > >>>
>> >>>> > >>> Paul
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Gesendet von Mail<
>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>>> für
>> >>>> > >>> Windows 10
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> >>>> > >>> Gesendet: Dienstag, 5. März 2019 03:35
>> >>>> > >>> An: solr-user@lucene.apache.org<mailto:
>> >>>> solr-user@lucene.apache.org>
>> >>>> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> >>>> multiple \n
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Hi,
>> >>>> > >>>
>> >>>> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as
>> >>>> well.
>> >>>> > >>>
>> >>>> > >>> Regards,
>> >>>> > >>> Edwin
>> >>>> > >>>
>> >>>> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <
>> >>>> > edwinye...@gmail.com>
>> >>>> > >>> wrote:
>> >>>> > >>>
>> >>>> > >>> > Hi,
>> >>>> > >>> >
>> >>>> > >>> > Anyone else has other suggestions or have faced the same
>> >>>> problem?
>> >>>> > >>> >
>> >>>> > >>> > Regards,
>> >>>> > >>> > Edwin
>> >>>> > >>> >
>> >>>> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
>> >>>> > >>> edwinye...@gmail.com>
>> >>>> > >>> > wrote:
>> >>>> > >>> >
>> >>>> > >>> >> Hi Paul,
>> >>>> > >>> >>
>> >>>> > >>> >> If I tried to execute the second step first, then I will
>> only
>> >>>> get a
>> >>>> > >>> >> single <br> for those with 2 <br>.
>> >>>> > >>> >> For those that we originally get 4 <br>, there will be 2
>> <br>
>> >>>> with a
>> >>>> > >>> >> space in between.
>> >>>> > >>> >>
>> >>>> > >>> >> This is just changing the 2 <br> to be a single <br>, since
>> the
>> >>>> > second
>> >>>> > >>> >> step is to replace with a single <br>.
>> >>>> > >>> >> But it has not solved the underlying problem yet.
>> >>>> > >>> >>
>> >>>> > >>> >> Regards,
>> >>>> > >>> >> Edwin
>> >>>> > >>> >>
>> >>>> > >>> >>
>> >>>> > >>> >> On Wed, 20 Feb 2019 at 16:41, <paul.d...@ub.unibe.ch>
>> wrote:
>> >>>> > >>> >>
>> >>>> > >>> >>> If the second step is executed first, then you will get the
>> >>>> > unwanted
>> >>>> > >>> 4
>> >>>> > >>> >>> <br>
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>> Gesendet von Mail<
>> >>>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>>> > >>> für
>> >>>> > >>> >>> Windows 10
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> >>>> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>> >>>> > >>> >>> An: solr-user@lucene.apache.org<mailto:
>> >>>> solr-user@lucene.apache.org
>> >>>> > >
>> >>>> > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> >>>> > multiple
>> >>>> > >>> \n
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>> Hi Jörn ,
>> >>>> > >>> >>>
>> >>>> > >>> >>> Do you mean the regex is not correct?
>> >>>> > >>> >>>
>> >>>> > >>> >>> We are already using two RegexReplaceProcessorFactory
>> steps,
>> >>>> like
>> >>>> > >>> the one
>> >>>> > >>> >>> shown below. The output that we get is still the same.
>> >>>> > >>> >>>
>> >>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>>      <str name="fieldName">content</str>
>> >>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>> >>>> > >>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>>      <bool name="literalReplacement">true</bool>
>> >>>> > >>> >>> <processor>
>> >>>> > >>> >>>
>> >>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>>      <str name="fieldName">content</str>
>> >>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>> >>>> > >>> >>>      <str name="replacement">&lt;br&gt;</str>
>> >>>> > >>> >>>      <bool name="literalReplacement">true</bool>
>> >>>> > >>> >>> <processor>
>> >>>> > >>> >>>
>> >>>> > >>> >>> Regards,
>> >>>> > >>> >>> Edwin
>> >>>> > >>> >>>
>> >>>> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <
>> >>>> jornfra...@gmail.com>
>> >>>> > >>> wrote:
>> >>>> > >>> >>>
>> >>>> > >>> >>> > Then you need two regexprocessfactory steps
>> >>>> > >>> >>> >
>> >>>> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> edwinye...@gmail.com
>> >>>> > >>> >>> > >:
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > > Hi,
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > > Thanks for the reply.
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > > Do you know of any regex online tool that works
>> correctly
>> >>>> for
>> >>>> > >>> Java
>> >>>> > >>> >>> regex?
>> >>>> > >>> >>> > > I tried to find some, but they are not working
>> properly.
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > > Yes, our plan is to replace more than one \n with
>> >>>> <br><br>, and
>> >>>> > >>> >>> single \n
>> >>>> > >>> >>> > > with single <br>.
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > > Regards,
>> >>>> > >>> >>> > > Edwin
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <
>> >>>> > jornfra...@gmail.com
>> >>>> > >>> >
>> >>>> > >>> >>> wrote:
>> >>>> > >>> >>> > >>
>> >>>> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a
>> bug
>> >>>> - it
>> >>>> > >>> would
>> >>>> > >>> >>> then
>> >>>> > >>> >>> > >> be in the JDK. Try out in a regex online Tool that
>> >>>> supports
>> >>>> > Java
>> >>>> > >>> >>> regex
>> >>>> > >>> >>> > for
>> >>>> > >>> >>> > >> your solution.
>> >>>> > >>> >>> > >>
>> >>>> > >>> >>> > >> I believe you want to have 2 regex process factories:
>> >>>> > >>> >>> > >> One that deals with single \n and one that deals with
>> >>>> more
>> >>>> > than
>> >>>> > >>> one
>> >>>> > >>> >>> \n
>> >>>> > >>> >>> > >>
>> >>>> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> > edwinye...@gmail.com
>> >>>> > >>> >>> > >>> :
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> Hi,
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> We have tried with the following pattern ([
>> >>>> \t]*\r?\n){2,}
>> >>>> > and
>> >>>> > >>> >>> > >>> configuration:
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>> > >>>  <str name="fieldName">content</str>
>> >>>> > >>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>> >>>> > >>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>> > >>>  <bool name="literalReplacement">true</bool>
>> >>>> > >>> >>> > >>> </processor>
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> However, the issue is still occurring.
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> Anyone else is able to help?
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> Regards,
>> >>>> > >>> >>> > >>> Edwin
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> > edwinye...@gmail.com>
>> >>>> > >>> >>> > >>> wrote:
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>>> Hi,
>> >>>> > >>> >>> > >>>>
>> >>>> > >>> >>> > >>>> For your info, this issue is occurring in Solr
>> 7.7.0 as
>> >>>> > well.
>> >>>> > >>> >>> > >>>>
>> >>>> > >>> >>> > >>>> Regards,
>> >>>> > >>> >>> > >>>> Edwin
>> >>>> > >>> >>> > >>>>
>> >>>> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> > edwinye...@gmail.com
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>>> wrote:
>> >>>> > >>> >>> > >>>>
>> >>>> > >>> >>> > >>>>> Hi,
>> >>>> > >>> >>> > >>>>>
>> >>>> > >>> >>> > >>>>> Should we report this as a bug in Solr?
>> >>>> > >>> >>> > >>>>>
>> >>>> > >>> >>> > >>>>> Regards,
>> >>>> > >>> >>> > >>>>> Edwin
>> >>>> > >>> >>> > >>>>>
>> >>>> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> > edwinye...@gmail.com
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>>>> wrote:
>> >>>> > >>> >>> > >>>>>
>> >>>> > >>> >>> > >>>>>> Hi Paul,
>> >>>> > >>> >>> > >>>>>>
>> >>>> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using,
>> >>>> when we
>> >>>> > >>> try
>> >>>> > >>> >>> in on
>> >>>> > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the
>> >>>> correct
>> >>>> > >>> >>> result for
>> >>>> > >>> >>> > >> all
>> >>>> > >>> >>> > >>>>>> the examples (ie: All of them will only have
>> >>>> <br><br>, and
>> >>>> > >>> not
>> >>>> > >>> >>> more
>> >>>> > >>> >>> > >> than
>> >>>> > >>> >>> > >>>>>> that like what we are getting in Solr in our
>> earlier
>> >>>> > >>> examples).
>> >>>> > >>> >>> > >>>>>>
>> >>>> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
>> >>>> > >>> >>> > >>>>>>
>> >>>> > >>> >>> > >>>>>> Regards,
>> >>>> > >>> >>> > >>>>>> Edwin
>> >>>> > >>> >>> > >>>>>>
>> >>>> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> > >> edwinye...@gmail.com>
>> >>>> > >>> >>> > >>>>>> wrote:
>> >>>> > >>> >>> > >>>>>>
>> >>>> > >>> >>> > >>>>>>> Hi Paul,
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n
>> >>>> i.e.
>> >>>> > <str
>> >>>> > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the
>> following
>> >>>> > regex
>> >>>> > >>> >>> pattern:
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> <processor
>> >>>> class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>> > >>>>>>>  <str name="fieldName">content</str>
>> >>>> > >>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>> >>>> > >>> >>> > >>>>>>>  <str
>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>> > >>>>>>> </processor>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> However, we are also getting the exact same
>> results
>> >>>> as
>> >>>> > the
>> >>>> > >>> >>> earlier
>> >>>> > >>> >>> > >>>>>>> Example 1, 2 and 3.
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you
>> have
>> >>>> other
>> >>>> > >>> (non
>> >>>> > >>> >>> > >>>>>>> printing) characters than \n, we have find that
>> >>>> there are
>> >>>> > >>> no
>> >>>> > >>> >>> non
>> >>>> > >>> >>> > >> printing
>> >>>> > >>> >>> > >>>>>>> characters. It is just next line with a space.
>> You
>> >>>> can
>> >>>> > >>> refer
>> >>>> > >>> >>> to the
>> >>>> > >>> >>> > >>>>>>> original content in the same examples below.
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> working
>> >>>> > >>> >>> > >>>>>>> correctly
>> >>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>> >>>> > >>> >>> > >>>>>>> Dear Sir,
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> I am terminating
>> >>>> > >>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I
>> am
>> >>>> > >>> terminating
>> >>>> > >>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>> >>>> terminating
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
>> >>>> are 4
>> >>>> > >>> <br>)
>> >>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> *exalted*
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> *Psalm 89:17*
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>> >>>> > >>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm
>> 89:17
>> >>>>  \n\n
>> >>>> > >>> >>>  \n\n  3
>> >>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>> >>>>  <br><br>
>> >>>> > >>> >>> <br><br>3
>> >>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
>> >>>> are 4
>> >>>> > >>> <br>)
>> >>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > >>> >>> > >>>>>>> *Original content:*
>> >>>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>>  \n\n
>> >>>> > >>> >>> >  \n\n
>> >>>> > >>> >>> > >> \n
>> >>>> > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n
>> \n\n\n
>> >>>> > >>> \n\n\n  On
>> >>>> > >>> >>> Tue,
>> >>>> > >>> >>> > >> Dec 18,
>> >>>> > >>> >>> > >>>>>>> 2018 at 10:07 AM
>> >>>> > >>> >>> > >>>>>>> *Index content: *
>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>>  <br><br>
>> >>>> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that
>> you
>> >>>> may
>> >>>> > >>> have.
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Thank you.
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Regards,
>> >>>> > >>> >>> > >>>>>>> Edwin
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <
>> >>>> paul.d...@ub.unibe.ch>
>> >>>> > >>> wrote:
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Hi Edwin
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space
>> should
>> >>>> > preceed
>> >>>> > >>> >>> the \n
>> >>>> > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>> >>>> > >>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non
>> >>>> printing)
>> >>>> > >>> >>> characters
>> >>>> > >>> >>> > >>>>>>>> than \n?
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Gesendet von Mail<
>> >>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>>> > >>> >>> > >> für
>> >>>> > >>> >>> > >>>>>>>> Windows 10
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>> >>>> edwinye...@gmail.com>
>> >>>> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> >>>> > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>>> > >>> >>> > solr-user@lucene.apache.org>
>> >>>> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory
>> pattern
>> >>>> to
>> >>>> > >>> detect
>> >>>> > >>> >>> > >> multiple \n
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Hi Paul,
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as
>> >>>> follow:
>> >>>> > >>> >>> > >>>>>>>> <processor
>> >>>> class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>> > >>>>>>>>  <str name="fieldName">content</str>
>> >>>> > >>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>> >>>> > >>> >>> > >>>>>>>>  <str
>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>> > >>>>>>>> </processor>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> But we still have exactly the same problem of
>> >>>> Example
>> >>>> > 1,2
>> >>>> > >>> and
>> >>>> > >>> >>> 3
>> >>>> > >>> >>> > >> below.
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> working
>> >>>> > >>> >>> > >>>>>>>> correctly
>> >>>> > >>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n
>> I am
>> >>>> > >>> >>> terminating
>> >>>> > >>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>> >>>> terminating
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>>> working
>> >>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
>> >>>> <br>)
>> >>>> > >>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm
>> 89:17
>> >>>> >  \n\n
>> >>>> > >>> >>>  \n\n
>> >>>> > >>> >>> > 3
>> >>>> > >>> >>> > >>>>>>>> Choa
>> >>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>> >>>>  <br><br>
>> >>>> > >>> >>> > <br><br>3
>> >>>> > >>> >>> > >>>>>>>> Choa
>> >>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>>> working
>> >>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
>> >>>> <br>)
>> >>>> > >>> >>> > >>>>>>>> *Original content:*
>> >>>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>>  \n\n
>> >>>> > >>> >>> >  \n\n
>> >>>> > >>> >>> > >>>>>>>> \n \n\n
>> >>>> > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>> >>>> \n\n\n
>> >>>> > On
>> >>>> > >>> >>> Tue, Dec
>> >>>> > >>> >>> > >> 18,
>> >>>> > >>> >>> > >>>>>>>> 2018
>> >>>> > >>> >>> > >>>>>>>> at 10:07 AM
>> >>>> > >>> >>> > >>>>>>>> *Index content: *
>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>>  <br><br>
>> >>>> > >>> >>> > >>>>>>>> <br><br>On
>> >>>> > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Any further suggestion?
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Thank you.
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Regards,
>> >>>> > >>> >>> > >>>>>>>> Edwin
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <
>> >>>> paul.d...@ub.unibe.ch>
>> >>>> > >>> wrote:
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and
>> >>>> then
>> >>>> > >>> failing
>> >>>> > >>> >>> on
>> >>>> > >>> >>> > the
>> >>>> > >>> >>> > >>>>>>>> {2,}
>> >>>> > >>> >>> > >>>>>>>>> part you could try
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> If you also want to match CRLF then
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Gesendet von Mail<
>> >>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > >>>>>>>> für
>> >>>> > >>> >>> > >>>>>>>>> Windows 10
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>> >>>> edwinye...@gmail.com>
>> >>>> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>> >>>> > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>>> > >>> >>> > solr-user@lucene.apache.org
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory
>> pattern
>> >>>> to
>> >>>> > >>> detect
>> >>>> > >>> >>> > >> multiple
>> >>>> > >>> >>> > >>>>>>>> \n
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Hi Paul,
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Thanks for your reply.
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> When I use this pattern:
>> >>>> > >>> >>> > >>>>>>>>> <processor
>> >>>> class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
>> >>>> > >>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>> >>>> > >>> >>> > >>>>>>>>>  <str
>> >>>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>> > >>>>>>>>> </processor>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> It is working for some sentence within the same
>> >>>> content
>> >>>> > >>> and
>> >>>> > >>> >>> not
>> >>>> > >>> >>> > >>>>>>>> working for
>> >>>> > >>> >>> > >>>>>>>>> some sentences. Please see below for the one
>> that
>> >>>> is
>> >>>> > >>> working
>> >>>> > >>> >>> and
>> >>>> > >>> >>> > >>>>>>>> another
>> >>>> > >>> >>> > >>>>>>>>> that is not working (partially working):
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> working
>> >>>> > >>> >>> > >>>>>>>> correctly
>> >>>> > >>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n
>> I
>> >>>> am
>> >>>> > >>> >>> terminating
>> >>>> > >>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>> >>>> > terminating
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>>> working
>> >>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>> >>>> <br>)
>> >>>> > >>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm
>> 89:17
>> >>>> >  \n\n
>> >>>> > >>> >>> >  \n\n  3
>> >>>> > >>> >>> > >>>>>>>> Choa
>> >>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>> >>>> >  <br><br>
>> >>>> > >>> >>> > <br><br>3
>> >>>> > >>> >>> > >>>>>>>> Choa
>> >>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>>> working
>> >>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>> >>>> <br>)
>> >>>> > >>> >>> > >>>>>>>>> *Original content:*
>> >>>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>>  \n\n
>> >>>> > >>> >>> > >> \n\n
>> >>>> > >>> >>> > >>>>>>>> \n
>> >>>> > >>> >>> > >>>>>>>>> \n\n
>> >>>> > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>> >>>> \n\n\n
>> >>>> > On
>> >>>> > >>> >>> Tue,
>> >>>> > >>> >>> > Dec
>> >>>> > >>> >>> > >>>>>>>> 18, 2018
>> >>>> > >>> >>> > >>>>>>>>> at 10:07 AM
>> >>>> > >>> >>> > >>>>>>>>> *Index content: *
>> >>>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>> >>>  <br><br>
>> >>>> > >>> >>> > >>>>>>>> <br><br>On
>> >>>> > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is
>> >>>> wrong?
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Thank you.
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Regards,
>> >>>> > >>> >>> > >>>>>>>>> Edwin
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <
>> >>>> paul.d...@ub.unibe.ch>
>> >>>> > >>> wrote:
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is
>> not
>> >>>> > >>> working. I
>> >>>> > >>> >>> > assume
>> >>>> > >>> >>> > >>>>>>>> nothing
>> >>>> > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> ??
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Gesendet von Mail<
>> >>>> > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>>> > >>> >>> > >>>>>>>> für
>> >>>> > >>> >>> > >>>>>>>>>> Windows 10
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>> >>>> edwinye...@gmail.com>
>> >>>> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>> >>>> > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>>> > >>> >>> > >> solr-user@lucene.apache.org
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern
>> to
>> >>>> > detect
>> >>>> > >>> >>> multiple
>> >>>> > >>> >>> > >> \n
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Hi,
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> I am trying to use the
>> >>>> RegexReplaceProcessorFactory to
>> >>>> > >>> >>> remove
>> >>>> > >>> >>> > more
>> >>>> > >>> >>> > >>>>>>>> than
>> >>>> > >>> >>> > >>>>>>>>> two
>> >>>> > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg:
>> >>>> \n\n,
>> >>>> > \n
>> >>>> > >>> \n,
>> >>>> > >>> >>> \n
>> >>>> > >>> >>> > \n
>> >>>> > >>> >>> > >>>>>>>> \n
>> >>>> > >>> >>> > >>>>>>>>> \n),
>> >>>> > >>> >>> > >>>>>>>>>> and replace it with two <br>.
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is
>> >>>> working
>> >>>> > >>> when I
>> >>>> > >>> >>> test
>> >>>> > >>> >>> > it
>> >>>> > >>> >>> > >>>>>>>> in
>> >>>> > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I
>> put
>> >>>> it
>> >>>> > >>> inside
>> >>>> > >>> >>> the
>> >>>> > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain
>> name="removeCode">
>> >>>> > >>> >>> > >>>>>>>>>> <processor
>> >>>> class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
>> >>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>> >>>> > >>> >>> > >>>>>>>>>>  <str
>> >>>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>> > >>>>>>>>>> </processor>
>> >>>> > >>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern,
>> \s* is
>> >>>> > >>> >>> instructing
>> >>>> > >>> >>> > the
>> >>>> > >>> >>> > >>>>>>>> regex
>> >>>> > >>> >>> > >>>>>>>>> to
>> >>>> > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
>> >>>> > >>> instructing
>> >>>> > >>> >>> the
>> >>>> > >>> >>> > >>>>>>>> regex to
>> >>>> > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern
>> (\n).
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and
>> how
>> >>>> should
>> >>>> > >>> I do
>> >>>> > >>> >>> it?
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Regards,
>> >>>> > >>> >>> > >>>>>>>>>> Edwin
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>
>> >>>> > >>> >>> >
>> >>>> > >>> >>>
>> >>>> > >>> >>
>> >>>> > >>>
>> >>>> > >>
>> >>>> >
>> >>>>
>> >>>
>>
>

Reply via email to