ID:               47526
 User updated by:  phpwnd at gmail dot com
 Reported By:      phpwnd at gmail dot com
 Status:           Bogus
 Bug Type:         PCRE related
 Operating System: *
 PHP Version:      5.3CVS-2009-02-28 (CVS)
 New Comment:

My point exactly. Why do we have an escape sequence for surrogates when
they are invalid and it doesn't work anyway? \p{Cs} appears in the
manual (http://docs.php.net/manual/en/regexp.reference.php) under
"Supported property codes"

Also, why do preg_match() and preg_replace() fail differently?
preg_match returns 0, which lets the user believe the input was valid
but didn't match, whereas preg_replace() returns NULL, which indicates
the input was invalid. I cannot verify what preg_last_error() says right
now as I'm having trouble with latest CVS.


Previous Comments:
------------------------------------------------------------------------

[2009-04-10 15:58:31] nlop...@php.net

As far as I understand that codepoint is invalid in UTF-8.
If you call preg_last_error() after preg_match() it will return
PREG_BAD_UTF8_ERROR, confirming my hipothesis.
So no bug here.

------------------------------------------------------------------------

[2009-02-28 08:51:18] phpwnd at gmail dot com

Description:
------------
According to http://docs.php.net/manual/en/regexp.reference.php PCRE
functions should be able to match surrogates in Unicode mode. However,
it is my understanding that surrogates are not allowed in UTF-8, which
is the encoding used by the Unicode mode. That would explain why
preg_match() and preg_replace() fail when operating on UTF-8-encoded
surrogates.

Note that both functions fail in a different way. preg_match() returns
0 whereas preg_replace() returns NULL.

I'm not sure what the fix should be. Being able to match surrogates
would make my life easier, but if it's not valid UTF-8 then it might be
more consistent (albeit in a twisted way) to return NULL, as that's what
PCRE functions do on invalid UTF-8.

Reproduce code:
---------------
// \xED\xA0\x80 is character 0xD800 in UTF-8
var_dump(preg_match('#.#u', ".\xED\xA0\x80"));
var_dump(preg_replace('#\p{Cs}#u', '', ".\xED\xA0\x80"));

Expected result:
----------------
int(1)
string(1) "."

Actual result:
--------------
int(0)
NULL


------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=47526&edit=1

Reply via email to