I found a past thread. https://marc.info/?t=165731897400003&r=1&w=2
On Fri, 08 Jul 2022 17:42:37 -0700 Nam Nguyen <n...@berkeley.edu> wrote: > pcre2build(3): NEWLINE RECOGNITION > This explains --enable-newline-is-any as a flexibile combination of CR, > LF, CRLF, plus unicode newline sequences. When non ascii chars in unicode is used in a byte sequence, UTF-8 will be used, then \u0085 becomes \xc2 \x85. Sole \x85 might be a part of NEL, but it also might be a part of other chars (eg. kanji). > mediawiki might have to be adapted. --enable-newline-is-any seems > reasonable as a default since it is flexible and can be overridden by > consumers. At least php have to be adapted. Since the default pcre is --enable-newline-is-nl, I suppose many people don't expect the behavior of --enable-newline-is-any. On Wed, 02 Nov 2022 23:50:40 +0900 (JST) YASUOKA Masahiko <yasu...@openbsd.org> wrote: > Hello, > > Currently pcre2 is configured with "--enable-newline-is-any". With > the option, the library treats 0x85 as a newline char. But in UTF-8, > 0x85 is used at least for some casual Kanji chars. So the pcre2 > cannot handle text which includes such the chars properly. > > Since --enable-newline-is-any conflicts with using UTF-8, I think we > should change it to --enable-newline-is-anycrlf to avoid the > conflict. > > https://github.com/PCRE2Project/pcre2/blob/pcre2-10.37/src/pcre2_internal.h#L663 > 657 /* In ASCII/Unicode, linefeed is '\n' and we equate this to NL for > 658 compatibility. NEL is the Unicode newline character; make sure it is > 659 a positive value. */ > 660 > 661 #define CHAR_LF '\n' > 662 #define CHAR_NL CHAR_LF > -> 663 #define CHAR_NEL ((unsigned char)'\x85') > 664 #define CHAR_ESC '\033' > 665 #define CHAR_DEL '\177' > 666 #define CHAR_NBSP ((unsigned char)'\xa0') > > \u8005 is "\xe0\x80\x85" in UTF-8, which includes "\x85". > https://glyphwiki.org/wiki/u8005 > > test code in php: > > <?php > $test = "\u{8005} hogehoge"; > if (preg_match("/^(.+)$/m", $test, $match)) { > print("result: " . str_ends_with($match[1], "hoge") . > " (should be 1)\n"); > } > ?> > > ok? > > > Specify --enable-newline-is-anycrlf instead of --enable-newline-is-any > which doesn't work properly with UTF-8 text. The former option treats > 0x85, which is used for some kanji in UTF-8, as a newline char.w > > Index: devel/pcre2/Makefile > =================================================================== > RCS file: /cvs/ports/devel/pcre2/Makefile,v > retrieving revision 1.16 > diff -u -p -r1.16 Makefile > --- devel/pcre2/Makefile 11 Mar 2022 18:52:29 -0000 1.16 > +++ devel/pcre2/Makefile 2 Nov 2022 14:02:31 -0000 > @@ -9,6 +9,8 @@ SHARED_LIBS += pcre2-posix > > CATEGORIES = devel > > +REVISION = 0 > + > MASTER_SITES = https://ftp.pcre.org/pub/pcre/ \ > ${MASTER_SITE_SOURCEFORGE:=pcre/} \ > http://ftp.csx.cam.ac.uk/pub/software/programming/pcre/ \ > @@ -27,7 +29,7 @@ LIB_DEPENDS = archivers/bzip2 > CONFIGURE_STYLE = gnu > CONFIGURE_ARGS = --enable-pcre2-16 \ > --enable-pcre2-32 \ > - --enable-newline-is-any \ > + --enable-newline-is-anycrlf \ > --enable-pcre2grep-libz \ > --enable-pcre2grep-libbz2 \ > --enable-pcre2test-libreadline