Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Zheng Lin Edwin Yeo Fri, 08 Feb 2019 06:19:28 -0800

Hi Paul,

Regarding the regex (\n\s*){2,} that we are using, when we try in on
https://regex101.com/, it is able to give us the correct result for all the
examples (ie: All of them will only have <br><br>, and not more than that
like what we are getting in Solr in our earlier examples).


Could there be a possibility of a bug in Solr?

Regards,
Edwin

On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Hi Paul,
>
> We have tried it with the space preceeding the \n i.e. <str
> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\s*\n){2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>
> However, we are also getting the exact same results as the earlier Example
> 1, 2 and 3.
>
> As for your point 2 on perhaps in the data you have other (non printing)
> characters than \n, we have find that there are no non printing characters.
> It is just next line with a space. You can refer to the original content in
> the same examples below.
>
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content in EML file:*
> Dear Sir,
>
>
> I am terminating
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *    Dear Sir,  <br><br>I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*
>
> *exalted*
>
> *Psalm 89:17*
>
>
> 3 Choa Chu Kang Avenue 4
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*
>
> http://www.concordpri.moe.edu.sg/
>
>
>
>
>
>
>
>
> On Tue, Dec 18, 2018 at 10:07 AM
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
> 2018 at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
> Tue, Dec 18, 2018 at 10:07 AM
>
>
> Appreciate any other ideas or suggestions that you may have.
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> wrote:
>
>> Hi Edwin
>>
>>
>>
>>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
>> <str name="pattern">(\s*\n){2,}</str>
>>   2.  Perhaps in the data you have other (non printing) characters than
>> \n?
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi Paul,
>>
>> We have tried this suggested regex pattern as follow:
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(\n\s*){2,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> </processor>
>>
>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>
>> Example 1: The sentence that the above regex pattern is working correctly
>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>
>> Example 2: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> Chu Kang Avenue 4, Singapore
>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>> Chu Kang Avenue 4, Singapore
>>
>> Example 3: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> \n\n
>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018
>> at 10:07 AM
>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
>> Tue, Dec 18, 2018 at 10:07 AM
>>
>> Any further suggestion?
>>
>> Thank you.
>>
>> Regards,
>> Edwin
>>
>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> wrote:
>>
>> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
>> > part you could try
>> >
>> >
>> >
>> > <str name="pattern">(\n\s*){2,}</str>
>> >
>> >
>> >
>> > If you also want to match CRLF then
>> >
>> > <str name="pattern">(\r?\n\s*){2,}</str>
>> >
>> >
>> >
>> >
>> >
>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> > Windows 10
>> >
>> >
>> >
>> > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>> >
>> >
>> >
>> > Hi Paul,
>> >
>> > Thanks for your reply.
>> >
>> > When I use this pattern:
>> > <processor class="solr.RegexReplaceProcessorFactory">
>> >    <str name="fieldName">content</str>
>> >    <str name="pattern">(\n+\s*){2,}</str>
>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > </processor>
>> >
>> > It is working for some sentence within the same content and not working
>> for
>> > some sentences. Please see below for the one that is working and another
>> > that is not working (partially working):
>> >
>> > Example 1: The sentence that the above regex pattern is working
>> correctly
>> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> > *Index content: *    Dear Sir,  <br><br>I am terminating
>> >
>> > Example 2: The sentence that the above regex pattern is partially
>> working
>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> > Chu Kang Avenue 4, Singapore
>> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>> > Chu Kang Avenue 4, Singapore
>> >
>> > Example 3: The sentence that the above regex pattern is partially
>> working
>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> > \n\n
>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018
>> > at 10:07 AM
>> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>> <br><br>On
>> > Tue, Dec 18, 2018 at 10:07 AM
>> >
>> > We would appreciate your help to see what is wrong?
>> >
>> > Thank you.
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> wrote:
>> >
>> > > You don’t say what happens, just that it is not working. I assume
>> nothing
>> > > is replaced? Perhaps the pattern should be
>> > >
>> > >
>> > >
>> > >    <str name="pattern">"(\n\s*){2,}"</str>
>> > >
>> > >
>> > >
>> > > ??
>> > >
>> > >
>> > >
>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> > > Windows 10
>> > >
>> > >
>> > >
>> > > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
>> > > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>> > >
>> > >
>> > >
>> > > Hi,
>> > >
>> > > I am trying to use the RegexReplaceProcessorFactory to remove more
>> than
>> > two
>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
>> > \n),
>> > > and replace it with two <br>.
>> > >
>> > > I use the following regex pattern and it is working when I test it in
>> > > regex101.com. But it is not working when I put it inside the
>> > > RegexReplaceProcessorFactory as below:
>> > >
>> > > <updateRequestProcessorChain name="removeCode">
>> > > <processor class="solr.RegexReplaceProcessorFactory">
>> > >    <str name="fieldName">content</str>
>> > >    <str name="pattern">"(\\n\s*){2,}"</str>
>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > > </processor>
>> > >           </updateRequestProcessorChain>
>> > >
>> > > To explain further about my regex pattern, \s* is instructing the
>> regex
>> > to
>> > > match any \n that have space after and {2,} is instructing the regex
>> to
>> > > match 2 or more occurrence of such pattern (\n).
>> > >
>> > > Please kindly let me know what is wrong and how should I do it?
>> > >
>> > > I am using Solr 7.6.0.
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> >
>>
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Reply via email to