Re: pcre2: newline any => anycrlf

YASUOKA Masahiko Wed, 02 Nov 2022 08:51:42 -0700

I found a past thread.

https://marc.info/?t=165731897400003&r=1&w=2


On Fri, 08 Jul 2022 17:42:37 -0700
Nam Nguyen <n...@berkeley.edu> wrote:
> pcre2build(3): NEWLINE RECOGNITION
> This explains --enable-newline-is-any as a flexibile combination of CR,
> LF, CRLF, plus unicode newline sequences.

When non ascii chars in unicode is used in a byte sequence, UTF-8 will
be used, then \u0085 becomes \xc2 \x85.  Sole \x85 might be a part of
NEL, but it also might be a part of other chars (eg. kanji).

> mediawiki might have to be adapted. --enable-newline-is-any seems
> reasonable as a default since it is flexible and can be overridden by
> consumers.

At least php have to be adapted.

Since the default pcre is --enable-newline-is-nl, I suppose many
people don't expect the behavior of --enable-newline-is-any.


On Wed, 02 Nov 2022 23:50:40 +0900 (JST)
YASUOKA Masahiko <yasu...@openbsd.org> wrote:
> Hello,
> 
> Currently pcre2 is configured with "--enable-newline-is-any".  With
> the option, the library treats 0x85 as a newline char.  But in UTF-8,
> 0x85 is used at least for some casual Kanji chars.  So the pcre2
> cannot handle text which includes such the chars properly.
> 
> Since --enable-newline-is-any conflicts with using UTF-8, I think we
> should change it to --enable-newline-is-anycrlf to avoid the
> conflict.
> 
> https://github.com/PCRE2Project/pcre2/blob/pcre2-10.37/src/pcre2_internal.h#L663
>     657 /* In ASCII/Unicode, linefeed is '\n' and we equate this to NL for
>     658 compatibility. NEL is the Unicode newline character; make sure it is
>     659 a positive value. */
>     660 
>     661 #define CHAR_LF                     '\n'
>     662 #define CHAR_NL                     CHAR_LF
> ->  663 #define CHAR_NEL                    ((unsigned char)'\x85')
>     664 #define CHAR_ESC                    '\033'
>     665 #define CHAR_DEL                    '\177'
>     666 #define CHAR_NBSP                   ((unsigned char)'\xa0')
> 
> \u8005 is "\xe0\x80\x85" in UTF-8, which includes "\x85".
> https://glyphwiki.org/wiki/u8005
> 
> test code in php:
> 
>   <?php
>     $test = "\u{8005} hogehoge";
>     if (preg_match("/^(.+)$/m", $test, $match)) {
>         print("result: " . str_ends_with($match[1], "hoge") .
>           " (should be 1)\n");
>     }
>    ?>
> 
> ok?
> 
> 
> Specify --enable-newline-is-anycrlf instead of --enable-newline-is-any
> which doesn't work properly with UTF-8 text.  The former option treats
> 0x85, which is used for some kanji in UTF-8, as a newline char.w
> 
> Index: devel/pcre2/Makefile
> ===================================================================
> RCS file: /cvs/ports/devel/pcre2/Makefile,v
> retrieving revision 1.16
> diff -u -p -r1.16 Makefile
> --- devel/pcre2/Makefile      11 Mar 2022 18:52:29 -0000      1.16
> +++ devel/pcre2/Makefile      2 Nov 2022 14:02:31 -0000
> @@ -9,6 +9,8 @@ SHARED_LIBS +=  pcre2-posix             
>  
>  CATEGORIES = devel
>  
> +REVISION =   0
> +
>  MASTER_SITES =       https://ftp.pcre.org/pub/pcre/ \
>               ${MASTER_SITE_SOURCEFORGE:=pcre/} \
>               http://ftp.csx.cam.ac.uk/pub/software/programming/pcre/ \
> @@ -27,7 +29,7 @@ LIB_DEPENDS =               archivers/bzip2
>  CONFIGURE_STYLE =    gnu
>  CONFIGURE_ARGS =     --enable-pcre2-16 \
>                       --enable-pcre2-32 \
> -                     --enable-newline-is-any \
> +                     --enable-newline-is-anycrlf \
>                       --enable-pcre2grep-libz \
>                       --enable-pcre2grep-libbz2 \
>                       --enable-pcre2test-libreadline

Re: pcre2: newline any => anycrlf

Reply via email to