Bug #55465 [Com]: preg_match segmentation fault when subject too large
Edit report at https://bugs.php.net/bug.php?id=55465&edit=1 ID: 55465 Comment by: masakielastic at gmail dot com Reported by:zedwoodnoreply at zedwood dot com Summary:preg_match segmentation fault when subject too large Status: Not a bug Type: Bug Package:PCRE related Operating System: Ubuntu 10.04 PHP Version:5.3.7 Block user comment: N Private report: N New Comment: This report is a duplicate. See https://bugs.php.net/bug.php?id=36463 Previous Comments: [2011-08-19 22:42:28] fel...@php.net Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://www.php.net/manual/ and the instructions on how to report a bug at http://bugs.php.net/how-to-report.php This is a known behavior from PCRE library, it's not a PHP bug. http://docs.php.net/manual/en/pcre.configuration.php [2011-08-19 21:10:27] zedwoodnoreply at zedwood dot com Description: When I change $n_times to 8, and run the command line script php -f myscript.php, I get "Segmentation Fault". The error also occurs when run via apache: [Fri Aug 19 15:05:14 2011] [notice] child pid 11995 exit signal Segmentation fault (11) If you change $n_times to be sufficiently large, preg_match seems to consistently seg fault. If I change $n_times to something lower like 1000, there is no seg fault. Test script: --- http://w3.org/International/questions/qa-forms-utf-8.html echo preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E]# ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF]# excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF]# excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$%xs', $string) ? 'y' : 'n'; die("\n"); Expected result: 'y' or 'n' Actual result: -- command line: Segmentation Fault via apache error.log [Fri Aug 19 15:05:14 2011] [notice] child pid 11995 exit signal Segmentation fault (11) -- Edit this bug report at https://bugs.php.net/bug.php?id=55465&edit=1
[PHP-BUG] Bug #65045 [NEW]: mb_convert_encoding breaks well-formed character
From: masakielastic at gmail dot com Operating system: Mac OSX PHP version: 5.5.0RC3 Package: mbstring related Bug Type: Bug Bug description:mb_convert_encoding breaks well-formed character Description: When converting string from UTF-8 to UTF-8 by using mb_convert_encoding for replacing ill-formed byte sequence with the substitute character(U+FFFD), mb_convert_encoding replaces the character follwing ill-formed byte sequence with the substitute character. mb_convert_encoding also delete trailing ill-formed byte sequence and doesn't replace it with the substitute character. The comprehensive test case for 2-4 byte characters is here: https://gist.github.com/masakielastic/5793665 . Test script: --- // U+24B62: "\xF0\xA4\xAD\xA2" // ill-formed: "\xF0\xA4\xAD" // U+FFFD: "\xEF\xBF\xBD" $str = "\xF0\xA4\xAD". "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"; $expected = "\xEF\xBF\xBD"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"; $str2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD"; $expected2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xEF\xBF\xBD"; mb_substitute_character(0xFFFD); var_dump( $expected === htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')), $expected2 === htmlspecialchars_decode(htmlspecialchars($str2, ENT_SUBSTITUTE, 'UTF-8')), $expected === mb_convert_encoding($str, 'UTF-8', 'UTF-8'), $expected2 === mb_convert_encoding($str2, 'UTF-8', 'UTF-8') ); Expected result: bool(true) bool(true) bool(true) bool(true) Actual result: -- bool(true) bool(true) bool(false) bool(false) -- Edit bug report at https://bugs.php.net/bug.php?id=65045&edit=1 -- Try a snapshot (PHP 5.4): https://bugs.php.net/fix.php?id=65045&r=trysnapshot54 Try a snapshot (PHP 5.3): https://bugs.php.net/fix.php?id=65045&r=trysnapshot53 Try a snapshot (trunk): https://bugs.php.net/fix.php?id=65045&r=trysnapshottrunk Fixed in SVN: https://bugs.php.net/fix.php?id=65045&r=fixed Fixed in release: https://bugs.php.net/fix.php?id=65045&r=alreadyfixed Need backtrace: https://bugs.php.net/fix.php?id=65045&r=needtrace Need Reproduce Script: https://bugs.php.net/fix.php?id=65045&r=needscript Try newer version: https://bugs.php.net/fix.php?id=65045&r=oldversion Not developer issue:https://bugs.php.net/fix.php?id=65045&r=support Expected behavior: https://bugs.php.net/fix.php?id=65045&r=notwrong Not enough info: https://bugs.php.net/fix.php?id=65045&r=notenoughinfo Submitted twice: https://bugs.php.net/fix.php?id=65045&r=submittedtwice register_globals: https://bugs.php.net/fix.php?id=65045&r=globals PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65045&r=php4 Daylight Savings: https://bugs.php.net/fix.php?id=65045&r=dst IIS Stability: https://bugs.php.net/fix.php?id=65045&r=isapi Install GNU Sed:https://bugs.php.net/fix.php?id=65045&r=gnused Floating point limitations: https://bugs.php.net/fix.php?id=65045&r=float No Zend Extensions: https://bugs.php.net/fix.php?id=65045&r=nozend MySQL Configuration Error: https://bugs.php.net/fix.php?id=65045&r=mysqlcfg
[PHP-BUG] Req #65079 [NEW]: mb_ereg_replace's e modifier should be deprecated
From: masakielastic at gmail dot com Operating system: Any PHP version: 5.5.0 Package: mbstring related Bug Type: Feature/Change Request Bug description:mb_ereg_replace's e modifier should be deprecated Description: mb_ereg_replace's e modifier should be deprecated for prevent PHP's code execution and the explanation for using mb_ereg_replace_callback (since PHP 5.4.1) should be added in the manual. PHP: code execution via mb_ereg_replace http://vigilance.fr/vulnerability/PHP-code-execution-via-mb-ereg-replace-8711 The reason why preg_replace's e modifier was deprecated in PHP 5.5 can be applied to mb_ereg_replace's e modifier. http://www.php.net/manual/en/function.preg-replace.php https://wiki.php.net/rfc/remove_preg_replace_eval_modifier There is an example of implementation of mb_ereg_replace_callback as a user function. http://d.hatena.ne.jp/hnw/20110206 -- Edit bug report at https://bugs.php.net/bug.php?id=65079&edit=1 -- Try a snapshot (PHP 5.4): https://bugs.php.net/fix.php?id=65079&r=trysnapshot54 Try a snapshot (PHP 5.3): https://bugs.php.net/fix.php?id=65079&r=trysnapshot53 Try a snapshot (trunk): https://bugs.php.net/fix.php?id=65079&r=trysnapshottrunk Fixed in SVN: https://bugs.php.net/fix.php?id=65079&r=fixed Fixed in release: https://bugs.php.net/fix.php?id=65079&r=alreadyfixed Need backtrace: https://bugs.php.net/fix.php?id=65079&r=needtrace Need Reproduce Script: https://bugs.php.net/fix.php?id=65079&r=needscript Try newer version: https://bugs.php.net/fix.php?id=65079&r=oldversion Not developer issue:https://bugs.php.net/fix.php?id=65079&r=support Expected behavior: https://bugs.php.net/fix.php?id=65079&r=notwrong Not enough info: https://bugs.php.net/fix.php?id=65079&r=notenoughinfo Submitted twice: https://bugs.php.net/fix.php?id=65079&r=submittedtwice register_globals: https://bugs.php.net/fix.php?id=65079&r=globals PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65079&r=php4 Daylight Savings: https://bugs.php.net/fix.php?id=65079&r=dst IIS Stability: https://bugs.php.net/fix.php?id=65079&r=isapi Install GNU Sed:https://bugs.php.net/fix.php?id=65079&r=gnused Floating point limitations: https://bugs.php.net/fix.php?id=65079&r=float No Zend Extensions: https://bugs.php.net/fix.php?id=65079&r=nozend MySQL Configuration Error: https://bugs.php.net/fix.php?id=65079&r=mysqlcfg
[PHP-BUG] Bug #65080 [NEW]: ctype_lower detects non-lower characters
From: masakielastic at gmail dot com Operating system: Mac OSX PHP version: 5.5.0 Package: Strings related Bug Type: Bug Bug description:ctype_lower detects non-lower characters Description: ctype_lower detects non-lower characters when the local is set to 'en_US.UTF-8' on Mac OSX 10.8. This phenomenon cannot't be reproduced on Ubuntu Linux. This phenomenon means ctype_lower detects Chinese characters and Hangul (Korean Alphabet) which have no concept about lower and upper cases. The test cases for C language and showing misdetected characters can be seen here: https://gist.github.com/masakielastic/5828106 The tests for BSD-compatible OSes are needed judging from Xcode's manual. http://developer.apple.com/library/Mac/documentation/Darwin/Reference/ManPages/m an3/islower.3.html ctype_upper also detects non-upper characters. Test script: --- $expected = []; $result = []; for ($i = 0; $i <= 0xFF; ++$i) { setlocale(LC_ALL, 'C'); if (ctype_lower(chr($i))) { $expected[] = $i; } setlocale(LC_ALL, 'en_US.UTF-8'); if (ctype_lower(chr($i))) { $result[] = $i; } } var_dump( [] === array_diff($result, $expected) ); Expected result: bool(true) Actual result: -- bool(false) -- Edit bug report at https://bugs.php.net/bug.php?id=65080&edit=1 -- Try a snapshot (PHP 5.4): https://bugs.php.net/fix.php?id=65080&r=trysnapshot54 Try a snapshot (PHP 5.3): https://bugs.php.net/fix.php?id=65080&r=trysnapshot53 Try a snapshot (trunk): https://bugs.php.net/fix.php?id=65080&r=trysnapshottrunk Fixed in SVN: https://bugs.php.net/fix.php?id=65080&r=fixed Fixed in release: https://bugs.php.net/fix.php?id=65080&r=alreadyfixed Need backtrace: https://bugs.php.net/fix.php?id=65080&r=needtrace Need Reproduce Script: https://bugs.php.net/fix.php?id=65080&r=needscript Try newer version: https://bugs.php.net/fix.php?id=65080&r=oldversion Not developer issue:https://bugs.php.net/fix.php?id=65080&r=support Expected behavior: https://bugs.php.net/fix.php?id=65080&r=notwrong Not enough info: https://bugs.php.net/fix.php?id=65080&r=notenoughinfo Submitted twice: https://bugs.php.net/fix.php?id=65080&r=submittedtwice register_globals: https://bugs.php.net/fix.php?id=65080&r=globals PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65080&r=php4 Daylight Savings: https://bugs.php.net/fix.php?id=65080&r=dst IIS Stability: https://bugs.php.net/fix.php?id=65080&r=isapi Install GNU Sed:https://bugs.php.net/fix.php?id=65080&r=gnused Floating point limitations: https://bugs.php.net/fix.php?id=65080&r=float No Zend Extensions: https://bugs.php.net/fix.php?id=65080&r=nozend MySQL Configuration Error: https://bugs.php.net/fix.php?id=65080&r=mysqlcfg
[PHP-BUG] Req #65081 [NEW]: new function for replacing ill-formd byte sequences with substitute characters
From: masakielastic at gmail dot com Operating system: All PHP version: 5.5.0 Package: mbstring related Bug Type: Feature/Change Request Bug description:new function for replacing ill-formd byte sequences with substitute characters Description: New function for replacing ill-formd byte sequences with substitute characters is needed. The problem using mb_convert_encoding for that purpose is that the function name doesn't represent the intent.Specfying same encoding twice is verbose and can be interpreted as meaningless conversion for the beginners. $str = mb_convert_encoding($str, 'UTF-8', 'UTF-8'); The case study can be seen in Ruby. Ruby 2.1 introduces String#scrub. http://bugs.ruby-lang.org/issues/6752 https://github.com/ruby/ruby/blob/1e8a05c1dfee94db9b6b825097e1d192ad32930a/strin g.c#L7770-L7783 The debate whether the substitute character can be specified or not is needed. function mb_scrub($str, $encoding = '', $substitute = '') { if ('' === $encoding) { $encoding = mb_internal_encoding(); } if ('' === $substutute) { $ret = mb_convert_encoding($str, $encoding, $encoding); } else { $before_substitute = mb_substitute_character(); mb_substitute_character($substitute); $ret = mb_convert_encoding($str, $encoding, $encoding); mb_substitute_character($before_substitute); } return $ret; } This discussion can be applied to Uconverter. function uconverter_scrub($str, $encoding, $opts = '') { if ('' === $opts) { return UConverter::transcode($str, $encoding, $encoding, $opts); } else { return UConverter::transcode($str, $encoding, $encoding); } } The discussion for standard string functions and filter functions may be needed since htmlspecialchars can be used for that purpose. function str_scrub($str, $encoding = 'UTF-8') { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, $encoding)); } -- Edit bug report at https://bugs.php.net/bug.php?id=65081&edit=1 -- Try a snapshot (PHP 5.4): https://bugs.php.net/fix.php?id=65081&r=trysnapshot54 Try a snapshot (PHP 5.3): https://bugs.php.net/fix.php?id=65081&r=trysnapshot53 Try a snapshot (trunk): https://bugs.php.net/fix.php?id=65081&r=trysnapshottrunk Fixed in SVN: https://bugs.php.net/fix.php?id=65081&r=fixed Fixed in release: https://bugs.php.net/fix.php?id=65081&r=alreadyfixed Need backtrace: https://bugs.php.net/fix.php?id=65081&r=needtrace Need Reproduce Script: https://bugs.php.net/fix.php?id=65081&r=needscript Try newer version: https://bugs.php.net/fix.php?id=65081&r=oldversion Not developer issue:https://bugs.php.net/fix.php?id=65081&r=support Expected behavior: https://bugs.php.net/fix.php?id=65081&r=notwrong Not enough info: https://bugs.php.net/fix.php?id=65081&r=notenoughinfo Submitted twice: https://bugs.php.net/fix.php?id=65081&r=submittedtwice register_globals: https://bugs.php.net/fix.php?id=65081&r=globals PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65081&r=php4 Daylight Savings: https://bugs.php.net/fix.php?id=65081&r=dst IIS Stability: https://bugs.php.net/fix.php?id=65081&r=isapi Install GNU Sed:https://bugs.php.net/fix.php?id=65081&r=gnused Floating point limitations: https://bugs.php.net/fix.php?id=65081&r=float No Zend Extensions: https://bugs.php.net/fix.php?id=65081&r=nozend MySQL Configuration Error: https://bugs.php.net/fix.php?id=65081&r=mysqlcfg
[PHP-BUG] Req #65082 [NEW]: json_encode's option for replacing ill-formd byte sequences with substitute cha
From: masakielastic at gmail dot com Operating system: All PHP version: 5.5.0 Package: JSON related Bug Type: Feature/Change Request Bug description:json_encode's option for replacing ill-formd byte sequences with substitute cha Description: json_encode returns false if the string contains ill-formed byte sequences. It is hard to find the problem since a lot of web applications don't expect the existence of ill-formed byte sequences. The one example is Symfony's JsonResponse class. https://github.com/symfony/symfony/blob/master/src/Symfony/Component/HttpFoundat ion/JsonResponse.php#L83 Introducing json_encode's option for replacing ill-formd byte sequences with substitute characters (such as U+FFFD) save writing the logic. function json_encode2($value, $options, $depth) { if (is_scalar($value)) { return json_encode($value, $options, $depth); } $value2 = []; foreach ($value as $key => $elm) { $value2[str_scrub($key)] = str_scrub($elm); } return json_encode($value2, $options, $depth); } // https://bugs.php.net/bug.php?id=65081 function str_scrub($str, $encoding = 'UTF-8') { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, $encoding)); } The precedent example is htmlspecialchars's ENT_SUBSTITUTE option which was introduced in PHP 5.4. json_encode shares the part of logic used such as php_next_utf8_char by htmlspecialchars since PHP 5.5. https://github.com/php/php-src/blob/master/ext/json/json.c#L369 Another reason for introducing the option is existence of JsonSerializable interface. Accessing jsonSerialize method's values come from private properties is hard or impossbile. The one of names of candiates for the option is JSON_SUBSTITUTE similar to htmlspecialchar's ENT_SUBSTITUTE option. json_encode($object, JSON_SUBSTITUTE); -- Edit bug report at https://bugs.php.net/bug.php?id=65082&edit=1 -- Try a snapshot (PHP 5.4): https://bugs.php.net/fix.php?id=65082&r=trysnapshot54 Try a snapshot (PHP 5.3): https://bugs.php.net/fix.php?id=65082&r=trysnapshot53 Try a snapshot (trunk): https://bugs.php.net/fix.php?id=65082&r=trysnapshottrunk Fixed in SVN: https://bugs.php.net/fix.php?id=65082&r=fixed Fixed in release: https://bugs.php.net/fix.php?id=65082&r=alreadyfixed Need backtrace: https://bugs.php.net/fix.php?id=65082&r=needtrace Need Reproduce Script: https://bugs.php.net/fix.php?id=65082&r=needscript Try newer version: https://bugs.php.net/fix.php?id=65082&r=oldversion Not developer issue:https://bugs.php.net/fix.php?id=65082&r=support Expected behavior: https://bugs.php.net/fix.php?id=65082&r=notwrong Not enough info: https://bugs.php.net/fix.php?id=65082&r=notenoughinfo Submitted twice: https://bugs.php.net/fix.php?id=65082&r=submittedtwice register_globals: https://bugs.php.net/fix.php?id=65082&r=globals PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65082&r=php4 Daylight Savings: https://bugs.php.net/fix.php?id=65082&r=dst IIS Stability: https://bugs.php.net/fix.php?id=65082&r=isapi Install GNU Sed:https://bugs.php.net/fix.php?id=65082&r=gnused Floating point limitations: https://bugs.php.net/fix.php?id=65082&r=float No Zend Extensions: https://bugs.php.net/fix.php?id=65082&r=nozend MySQL Configuration Error: https://bugs.php.net/fix.php?id=65082&r=mysqlcfg
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: Hi, thanks nikic and remi. After several considering, I changed my mind. I think the behavior of substituting U+FFFD for ill-formed sequences should be default. How do you think? We might need the discussion about the consitency for Escaper API. htmlspecialchars's ENT_SUBSTITUTE option is adopted by Symfony and Zend Framework. https://wiki.php.net/rfc/escaper Although the behavior breaks 2 test suites, it don't break user's codebases. A lot of people don't use any option looking in github. https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code The same problem can be seen in htmlspecialchars. https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code New options complicate the situation when using JSON_UNESCAPED_UNICODE option and json_decode. [two option] json_encode JSON_NOTUTF8_SUBSTITUTE JSON_NOTUTF8_IGNORE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE json_decode JSON_NOTUTF8_SUBSTITUTE JSON_NOTUTF8_IGNORE If JSON_NOTUTF8_SUBSTITUTE is default behavior, the problem we need to consider is only JSON_NOTUTF8_IGNORE option. [one option] json_encode JSON_NOTUTF8_IGNORE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE json_decode JSON_NOTUTF8_IGNORE Previous Comments: [2013-07-10 13:48:35] r...@php.net Here is a proposal fo this issue https://github.com/remicollet/pecl-json-c/commit/5a499a4550d1f29f1f8eeb1b4ca0b01a33c64779 This add 2 new options to json_encode - JSON_NOTUTF8_SUBSTITUTE (name seems better, at least to me), to replace not-utf8 char with the replacement char. - JSON_NOTUTF8_IGNORE to ignore not-utf8 char (remove in escaped mode, keep without any check in unescaped mode) [2013-06-21 07:26:33] ni...@php.net It's currently possible to get a partial output using JSON_PARTIAL_OUTPUT_ON_ERROR. This will replace invalid UTF8 strings with NULL though. It probably would make sense to have an alternative option that inserts the substitution character. ---------------- [2013-06-21 05:31:34] masakielastic at gmail dot com Description: json_encode returns false if the string contains ill-formed byte sequences. It is hard to find the problem since a lot of web applications don't expect the existence of ill-formed byte sequences. The one example is Symfony's JsonResponse class. https://github.com/symfony/symfony/blob/master/src/Symfony/Component/HttpFoundat ion/JsonResponse.php#L83 Introducing json_encode's option for replacing ill-formd byte sequences with substitute characters (such as U+FFFD) save writing the logic. function json_encode2($value, $options, $depth) { if (is_scalar($value)) { return json_encode($value, $options, $depth); } $value2 = []; foreach ($value as $key => $elm) { $value2[str_scrub($key)] = str_scrub($elm); } return json_encode($value2, $options, $depth); } // https://bugs.php.net/bug.php?id=65081 function str_scrub($str, $encoding = 'UTF-8') { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, $encoding)); } The precedent example is htmlspecialchars's ENT_SUBSTITUTE option which was introduced in PHP 5.4. json_encode shares the part of logic used such as php_next_utf8_char by htmlspecialchars since PHP 5.5. https://github.com/php/php-src/blob/master/ext/json/json.c#L369 Another reason for introducing the option is existence of JsonSerializable interface. Accessing jsonSerialize method's values come from private properties is hard or impossbile. The one of names of candiates for the option is JSON_SUBSTITUTE similar to htmlspecialchar's ENT_SUBSTITUTE option. json_encode($object, JSON_SUBSTITUTE); -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option? The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options. https://gist.github.com/masakielastic/5973095 Previous Comments: [2013-07-11 04:59:02] r...@php.net I don't think changing the current behavior is a good idea, the reason why I really prefer some new options. [2013-07-11 04:27:19] masakielastic at gmail dot com Hi, thanks nikic and remi. After several considering, I changed my mind. I think the behavior of substituting U+FFFD for ill-formed sequences should be default. How do you think? We might need the discussion about the consitency for Escaper API. htmlspecialchars's ENT_SUBSTITUTE option is adopted by Symfony and Zend Framework. https://wiki.php.net/rfc/escaper Although the behavior breaks 2 test suites, it don't break user's codebases. A lot of people don't use any option looking in github. https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code The same problem can be seen in htmlspecialchars. https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code New options complicate the situation when using JSON_UNESCAPED_UNICODE option and json_decode. [two option] json_encode JSON_NOTUTF8_SUBSTITUTE JSON_NOTUTF8_IGNORE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE json_decode JSON_NOTUTF8_SUBSTITUTE JSON_NOTUTF8_IGNORE If JSON_NOTUTF8_SUBSTITUTE is default behavior, the problem we need to consider is only JSON_NOTUTF8_IGNORE option. [one option] json_encode JSON_NOTUTF8_IGNORE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE json_decode JSON_NOTUTF8_IGNORE [2013-07-10 13:48:35] r...@php.net Here is a proposal fo this issue https://github.com/remicollet/pecl-json-c/commit/5a499a4550d1f29f1f8eeb1b4ca0b01a33c64779 This add 2 new options to json_encode - JSON_NOTUTF8_SUBSTITUTE (name seems better, at least to me), to replace not-utf8 char with the replacement char. - JSON_NOTUTF8_IGNORE to ignore not-utf8 char (remove in escaped mode, keep without any check in unescaped mode) [2013-06-21 07:26:33] ni...@php.net It's currently possible to get a partial output using JSON_PARTIAL_OUTPUT_ON_ERROR. This will replace invalid UTF8 strings with NULL though. It probably would make sense to have an alternative option that inserts the substitution character. ---------------- [2013-06-21 05:31:34] masakielastic at gmail dot com Description: json_encode returns false if the string contains ill-formed byte sequences. It is hard to find the problem since a lot of web applications don't expect the existence of ill-formed byte sequences. The one example is Symfony's JsonResponse class. https://github.com/symfony/symfony/blob/master/src/Symfony/Component/HttpFoundat ion/JsonResponse.php#L83 Introducing json_encode's option for replacing ill-formd byte sequences with substitute characters (such as U+FFFD) save writing the logic. function json_encode2($value, $options, $depth) { if (is_scalar($value)) { return json_encode($value, $options, $depth); } $value2 = []; foreach ($value as $key => $elm) { $value2[str_scrub($key)] = str_scrub($elm); } return json_encode($value2, $options, $depth); } // https://bugs.php.net/bug.php?id=65081 function str_scrub($str, $encoding = 'UTF-8') { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, $encoding)); } The precedent example is htmlspecialchars's ENT_SUBSTITUTE option which was introduced in PHP 5.4. json_encode shares the part of logic used such as php_next_utf8_char by htmlspecialchars since PHP 5.5. https://github.com/php/php-src/blob/master/ext/json/json.c#L369 Another reason for introducing the option is existence of JsonSerializable interface. Accessing jsonSerialize method's values come
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: Hi, I fixed my patch and added test case for json_decode. Previous Comments: [2013-07-11 08:37:51] masakielastic at gmail dot com Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option? The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options. https://gist.github.com/masakielastic/5973095 [2013-07-11 04:59:02] r...@php.net I don't think changing the current behavior is a good idea, the reason why I really prefer some new options. [2013-07-11 04:27:19] masakielastic at gmail dot com Hi, thanks nikic and remi. After several considering, I changed my mind. I think the behavior of substituting U+FFFD for ill-formed sequences should be default. How do you think? We might need the discussion about the consitency for Escaper API. htmlspecialchars's ENT_SUBSTITUTE option is adopted by Symfony and Zend Framework. https://wiki.php.net/rfc/escaper Although the behavior breaks 2 test suites, it don't break user's codebases. A lot of people don't use any option looking in github. https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code The same problem can be seen in htmlspecialchars. https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code New options complicate the situation when using JSON_UNESCAPED_UNICODE option and json_decode. [two option] json_encode JSON_NOTUTF8_SUBSTITUTE JSON_NOTUTF8_IGNORE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE json_decode JSON_NOTUTF8_SUBSTITUTE JSON_NOTUTF8_IGNORE If JSON_NOTUTF8_SUBSTITUTE is default behavior, the problem we need to consider is only JSON_NOTUTF8_IGNORE option. [one option] json_encode JSON_NOTUTF8_IGNORE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE json_decode JSON_NOTUTF8_IGNORE [2013-07-10 13:48:35] r...@php.net Here is a proposal fo this issue https://github.com/remicollet/pecl-json-c/commit/5a499a4550d1f29f1f8eeb1b4ca0b01a33c64779 This add 2 new options to json_encode - JSON_NOTUTF8_SUBSTITUTE (name seems better, at least to me), to replace not-utf8 char with the replacement char. - JSON_NOTUTF8_IGNORE to ignore not-utf8 char (remove in escaped mode, keep without any check in unescaped mode) [2013-06-21 07:26:33] ni...@php.net It's currently possible to get a partial output using JSON_PARTIAL_OUTPUT_ON_ERROR. This will replace invalid UTF8 strings with NULL though. It probably would make sense to have an alternative option that inserts the substitution character. The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=65082 -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1
Bug #62010 [Com]: json_decode produces invalid byte-sequences
Edit report at https://bugs.php.net/bug.php?id=62010&edit=1 ID: 62010 Comment by: masakielastic at gmail dot com Reported by:tklingenberg at lastflood dot net Summary:json_decode produces invalid byte-sequences Status: Open Type: Bug Package:JSON related Operating System: Windows PHP Version:5.3.13 Block user comment: N Private report: N New Comment: Here is RFC 3629's description about UTF-8 definition. The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. http://tools.ietf.org/html/rfc3629 The following patch solve the part of problem, The isolated low surrogate pairs(U+DC00 U+DFFF) are replaced with U+FFFD, The imrovement for high surrogate pairs (U+D800 - U+DBFF) is needed. https://gist.github.com/masakielastic/5985383 var_dump( "\xef\xbf\xbd" === json_decode('"\udc00"'), "\xef\xbf\xbd"."\xed\xa0\x80" === json_decode('"\ud800\ud800"'), "\xed\xa0\x80" === json_decode('"\ud800"') ); The consistency for the following options (under the discussion) is needed too. json_encode's option for replacing ill-formd byte sequences with substitute characters https://bugs.php.net/bug.php?id=65082 Previous Comments: [2013-01-11 09:44:55] votefordevnull at gmail dot com Successfully reproduced on Linux [2012-05-11 22:46:34] tklingenberg at lastflood dot net Looks like that #41067 https://bugs.php.net/bug.php?id=41067 was not fully fixed. [2012-05-11 22:12:42] tklingenberg at lastflood dot net Description: It's a typical case the JSON *and* UTF-16 specifications warn about: decoding of non-existing UTF-16 code-points: json_decode('"\ud834"') shoud give NULL because \ud834 is *invalid*. But instead it starts some party, get's boozed and offers this as UTF-8 byte-sequence: 1110 1101 1010 1011 0100 1110 10xx 10xx 1101 1000 0011 0100 D8 34 U+D834 is not a valid unicode character. Test script: --- if (NULL !== json_decode('"\ud834"')) { echo "json_decode is still broken."; } Expected result: NULL because the json is invalid. Actual result: -- PHP tries to create UTF-8 out of it and fails by creating invalid UTF-8 unicode byte-sequences. -- Edit this bug report at https://bugs.php.net/bug.php?id=62010&edit=1
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: I posted a patch for handling surrogate pairs since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629). Someone's help is needed for handling high surrogate pairs and the options. https://gist.github.com/masakielastic/5985383 json_decode produces invalid byte-sequences https://bugs.php.net/bug.php?id=62010 Previous Comments: [2013-07-11 09:48:54] masakielastic at gmail dot com Hi, I fixed my patch and added test case for json_decode. [2013-07-11 08:37:51] masakielastic at gmail dot com Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option? The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options. https://gist.github.com/masakielastic/5973095 [2013-07-11 04:59:02] r...@php.net I don't think changing the current behavior is a good idea, the reason why I really prefer some new options. [2013-07-11 04:27:19] masakielastic at gmail dot com Hi, thanks nikic and remi. After several considering, I changed my mind. I think the behavior of substituting U+FFFD for ill-formed sequences should be default. How do you think? We might need the discussion about the consitency for Escaper API. htmlspecialchars's ENT_SUBSTITUTE option is adopted by Symfony and Zend Framework. https://wiki.php.net/rfc/escaper Although the behavior breaks 2 test suites, it don't break user's codebases. A lot of people don't use any option looking in github. https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code The same problem can be seen in htmlspecialchars. https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code New options complicate the situation when using JSON_UNESCAPED_UNICODE option and json_decode. [two option] json_encode JSON_NOTUTF8_SUBSTITUTE JSON_NOTUTF8_IGNORE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE json_decode JSON_NOTUTF8_SUBSTITUTE JSON_NOTUTF8_IGNORE If JSON_NOTUTF8_SUBSTITUTE is default behavior, the problem we need to consider is only JSON_NOTUTF8_IGNORE option. [one option] json_encode JSON_NOTUTF8_IGNORE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE json_decode JSON_NOTUTF8_IGNORE [2013-07-10 13:48:35] r...@php.net Here is a proposal fo this issue https://github.com/remicollet/pecl-json-c/commit/5a499a4550d1f29f1f8eeb1b4ca0b01a33c64779 This add 2 new options to json_encode - JSON_NOTUTF8_SUBSTITUTE (name seems better, at least to me), to replace not-utf8 char with the replacement char. - JSON_NOTUTF8_IGNORE to ignore not-utf8 char (remove in escaped mode, keep without any check in unescaped mode) The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=65082 -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1
[PHP-BUG] Req #65257 [NEW]: new function for preventing XSS attack
From: masakielastic at gmail dot com Operating system: PHP version: 5.5.0 Package: JSON related Bug Type: Feature/Change Request Bug description:new function for preventing XSS attack Description: Although JSON_HEX_TAG, JSON_HEX_APOS, JSON_HEX_QUOT, JSON_HEX_AMP options were added in PHP 5.3 for preventing XSS attack, a lot of people don't specify these options. https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code The one of PHP's goal is to provide a secure way for creating web application without CMSes and frameworks. The one of mesures for the problem is providing new function with make these options default. Adding recommend opitons as a default also make sense. function json_secure_encode($value, $options = 0, $depth = 512) { // JSON_NOTUTF8_SUBSTITUTE // an option replacing ill-formd byte sequences with substitute characters // https://bugs.php.net/bug.php?id=65082 $options |= JSON_HEX_TAG | JSON_HEX_APOS | JSON_HEX_QUOT | JSON_HEX_AMP | JSON_NOTUTF8_SUBSTITUTE; return json_secure_encode($value, $options, $depth); } A shortcut for these options may be helpful a bit. if (!defined('JSON_QUOTES')) { define('JSON_QUOTES', JSON_HEX_TAG | JSON_HEX_APOS | JSON_HEX_AMP | JSON_HEX_QUOT); } The following RFC shows various functions for less options. Escaping RFC for PHP Core https://wiki.php.net/rfc/escaper Ruby on Rails provide json_escape via ERB::Util. http://api.rubyonrails.org/classes/ERB/Util.html OWAPS shows the guidelines for XSS attack. RULE #3.1 - HTML escape JSON values in an HTML context and read the data with JSON.parse https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Shee t#RULE_.233.1_- _HTML_escape_JSON_values_in_an_HTML_context_and_read_the_data_with_JSON.parse As a sidenote, the default HTTP headers of Rails include "X-Content-Type-Options: nosniff" for IE. http://edgeguides.rubyonrails.org/security.html#default-headers https://github.com/rails/docrails/blob/master/actionpack/lib/action_dispatch/rai ltie.rb#L20=L24 The following articles describe JSON-based XSS exploitation. http://blog.watchfire.com/wfblog/2011/10/json-based-xss-exploitation.html https://superevr.com/blog/2012/exploiting-xss-in-ajax-web-applications -- Edit bug report at https://bugs.php.net/bug.php?id=65257&edit=1 -- Try a snapshot (PHP 5.4): https://bugs.php.net/fix.php?id=65257&r=trysnapshot54 Try a snapshot (PHP 5.3): https://bugs.php.net/fix.php?id=65257&r=trysnapshot53 Try a snapshot (trunk): https://bugs.php.net/fix.php?id=65257&r=trysnapshottrunk Fixed in SVN: https://bugs.php.net/fix.php?id=65257&r=fixed Fixed in release: https://bugs.php.net/fix.php?id=65257&r=alreadyfixed Need backtrace: https://bugs.php.net/fix.php?id=65257&r=needtrace Need Reproduce Script: https://bugs.php.net/fix.php?id=65257&r=needscript Try newer version: https://bugs.php.net/fix.php?id=65257&r=oldversion Not developer issue:https://bugs.php.net/fix.php?id=65257&r=support Expected behavior: https://bugs.php.net/fix.php?id=65257&r=notwrong Not enough info: https://bugs.php.net/fix.php?id=65257&r=notenoughinfo Submitted twice: https://bugs.php.net/fix.php?id=65257&r=submittedtwice register_globals: https://bugs.php.net/fix.php?id=65257&r=globals PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65257&r=php4 Daylight Savings: https://bugs.php.net/fix.php?id=65257&r=dst IIS Stability: https://bugs.php.net/fix.php?id=65257&r=isapi Install GNU Sed:https://bugs.php.net/fix.php?id=65257&r=gnused Floating point limitations: https://bugs.php.net/fix.php?id=65257&r=float No Zend Extensions: https://bugs.php.net/fix.php?id=65257&r=nozend MySQL Configuration Error: https://bugs.php.net/fix.php?id=65257&r=mysqlcfg
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: I created new feature request for preveting XSS attack and I withdraw my option about the change of default behavior. new function for preventing XSS attack https://bugs.php.net/bug.php?id=65257 Previous Comments: [2013-07-12 18:19:09] masakielastic at gmail dot com I posted a patch for handling surrogate pairs since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629). Someone's help is needed for handling high surrogate pairs and the options. https://gist.github.com/masakielastic/5985383 json_decode produces invalid byte-sequences https://bugs.php.net/bug.php?id=62010 [2013-07-11 09:48:54] masakielastic at gmail dot com Hi, I fixed my patch and added test case for json_decode. [2013-07-11 08:37:51] masakielastic at gmail dot com Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option? The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options. https://gist.github.com/masakielastic/5973095 [2013-07-11 04:59:02] r...@php.net I don't think changing the current behavior is a good idea, the reason why I really prefer some new options. [2013-07-11 04:27:19] masakielastic at gmail dot com Hi, thanks nikic and remi. After several considering, I changed my mind. I think the behavior of substituting U+FFFD for ill-formed sequences should be default. How do you think? We might need the discussion about the consitency for Escaper API. htmlspecialchars's ENT_SUBSTITUTE option is adopted by Symfony and Zend Framework. https://wiki.php.net/rfc/escaper Although the behavior breaks 2 test suites, it don't break user's codebases. A lot of people don't use any option looking in github. https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code The same problem can be seen in htmlspecialchars. https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code New options complicate the situation when using JSON_UNESCAPED_UNICODE option and json_decode. [two option] json_encode JSON_NOTUTF8_SUBSTITUTE JSON_NOTUTF8_IGNORE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE json_decode JSON_NOTUTF8_SUBSTITUTE JSON_NOTUTF8_IGNORE If JSON_NOTUTF8_SUBSTITUTE is default behavior, the problem we need to consider is only JSON_NOTUTF8_IGNORE option. [one option] json_encode JSON_NOTUTF8_IGNORE JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE json_decode JSON_NOTUTF8_IGNORE The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=65082 -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: Hi, nikic, I posted a document request for the mission option and error codes. https://bugs.php.net/bug.php?id=65259 Your opinion about the consistency among JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE is needed. Previous Comments: [2013-07-14 08:28:53] masakielastic at gmail dot com I created new feature request for preveting XSS attack and I withdraw my option about the change of default behavior. new function for preventing XSS attack https://bugs.php.net/bug.php?id=65257 [2013-07-12 18:19:09] masakielastic at gmail dot com I posted a patch for handling surrogate pairs since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629). Someone's help is needed for handling high surrogate pairs and the options. https://gist.github.com/masakielastic/5985383 json_decode produces invalid byte-sequences https://bugs.php.net/bug.php?id=62010 [2013-07-11 09:48:54] masakielastic at gmail dot com Hi, I fixed my patch and added test case for json_decode. [2013-07-11 08:37:51] masakielastic at gmail dot com Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option? The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options. https://gist.github.com/masakielastic/5973095 [2013-07-11 04:59:02] r...@php.net I don't think changing the current behavior is a good idea, the reason why I really prefer some new options. The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=65082 -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: I nominate other names from the view of consistency with JSON_ERROR_UTF8. JSON_UTF8_SUBSTITUTE JSON_UTF8_IGNORE Previous Comments: [2013-07-14 08:44:02] masakielastic at gmail dot com Hi, nikic, I posted a document request for the mission option and error codes. https://bugs.php.net/bug.php?id=65259 Your opinion about the consistency among JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE is needed. [2013-07-14 08:28:53] masakielastic at gmail dot com I created new feature request for preveting XSS attack and I withdraw my option about the change of default behavior. new function for preventing XSS attack https://bugs.php.net/bug.php?id=65257 [2013-07-12 18:19:09] masakielastic at gmail dot com I posted a patch for handling surrogate pairs since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629). Someone's help is needed for handling high surrogate pairs and the options. https://gist.github.com/masakielastic/5985383 json_decode produces invalid byte-sequences https://bugs.php.net/bug.php?id=62010 [2013-07-11 09:48:54] masakielastic at gmail dot com Hi, I fixed my patch and added test case for json_decode. [2013-07-11 08:37:51] masakielastic at gmail dot com Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option? The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options. https://gist.github.com/masakielastic/5973095 The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=65082 -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: Hi, nikic, sorry, ignore my last comment. I added small change in json.c https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch Previous Comments: [2013-07-14 08:48:01] masakielastic at gmail dot com I nominate other names from the view of consistency with JSON_ERROR_UTF8. JSON_UTF8_SUBSTITUTE JSON_UTF8_IGNORE [2013-07-14 08:44:02] masakielastic at gmail dot com Hi, nikic, I posted a document request for the mission option and error codes. https://bugs.php.net/bug.php?id=65259 Your opinion about the consistency among JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE is needed. [2013-07-14 08:28:53] masakielastic at gmail dot com I created new feature request for preveting XSS attack and I withdraw my option about the change of default behavior. new function for preventing XSS attack https://bugs.php.net/bug.php?id=65257 [2013-07-12 18:19:09] masakielastic at gmail dot com I posted a patch for handling surrogate pairs since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629). Someone's help is needed for handling high surrogate pairs and the options. https://gist.github.com/masakielastic/5985383 json_decode produces invalid byte-sequences https://bugs.php.net/bug.php?id=62010 [2013-07-11 09:48:54] masakielastic at gmail dot com Hi, I fixed my patch and added test case for json_decode. The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=65082 -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: As for JSON_NOTUTF8_IGNORE, the description for security is needed in the manual like htmlspecialchars's ENT_IGNORE http://www.php.net/manual/en/function.htmlspecialchars.php That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's link as resource. UNICODE SECURITY CONSIDERATIONS http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters IDS11-J. Eliminate noncharacter code points before validation https://www.securecoding.cert.org/confluence/display/java/IDS11- J.+Eliminate+noncharacter+code+points+before+validation Previous Comments: ---- [2013-07-14 12:31:29] masakielastic at gmail dot com Hi, nikic, sorry, ignore my last comment. I added small change in json.c https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch ---- [2013-07-14 08:48:01] masakielastic at gmail dot com I nominate other names from the view of consistency with JSON_ERROR_UTF8. JSON_UTF8_SUBSTITUTE JSON_UTF8_IGNORE ---- [2013-07-14 08:44:02] masakielastic at gmail dot com Hi, nikic, I posted a document request for the mission option and error codes. https://bugs.php.net/bug.php?id=65259 Your opinion about the consistency among JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE is needed. ---- [2013-07-14 08:28:53] masakielastic at gmail dot com I created new feature request for preveting XSS attack and I withdraw my option about the change of default behavior. new function for preventing XSS attack https://bugs.php.net/bug.php?id=65257 ---- [2013-07-12 18:19:09] masakielastic at gmail dot com I posted a patch for handling surrogate pairs since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629). Someone's help is needed for handling high surrogate pairs and the options. https://gist.github.com/masakielastic/5985383 json_decode produces invalid byte-sequences https://bugs.php.net/bug.php?id=62010 The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=65082 -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: I agree with you on isolated surrogate pairs. The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE must be contained since json_decode uses json_utf8_to_utf16. https://github.com/php/php-src/blob/master/ext/json/json.c#L673 I already posted the test cases. https://gist.github.com/masakielastic/5973095#file-04-test-php-L26 "a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_SUBSTITUTE), "a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE) The one way of perfomance improvement is adding json_utf8_to_utf32. I posted another patch. https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode- patch I created unsigned int *utf32 data type for not changing unsigned short *utf16 data type. If you want to provide a common variable for json_utf8_to_utf16 and json_utf8_to_utf32, the modification for JSON_parser.c is also needed. The one of candidate for the name of variable is unsigned int *code_codes. http://www.unicode.org/glossary/#code_unit I also updated the previous patch. https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode- patch if (options & PHP_JSON_UNESCAPED_UNICODE) { +if (us < 0x20) { +smart_str_appendl(buf, "\\u", 2); +smart_str_appendc(buf, digits[(us >> 12) & 0xf]); +smart_str_appendc(buf, digits[(us >> 8) & 0xf]); +smart_str_appendc(buf, digits[(us >> 4) & 0xf]); +smart_str_appendc(buf, digits[(us & 0xf)]); +} else if (us < 0x80) { Previous Comments: [2013-07-15 07:31:49] r...@php.net > Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option? > The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options. The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my patch. Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now, but converting to utf16, then back to utf8 seems really... messy. Need something simpler. Notice: this bug is only for json_encode. Other issue have their own bug for tracking (especially the json_decode one, as I dont plan to alter it) [2013-07-14 12:45:47] masakielastic at gmail dot com As for JSON_NOTUTF8_IGNORE, the description for security is needed in the manual like htmlspecialchars's ENT_IGNORE http://www.php.net/manual/en/function.htmlspecialchars.php That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's link as resource. UNICODE SECURITY CONSIDERATIONS http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters IDS11-J. Eliminate noncharacter code points before validation https://www.securecoding.cert.org/confluence/display/java/IDS11- J.+Eliminate+noncharacter+code+points+before+validation [2013-07-14 12:31:29] masakielastic at gmail dot com Hi, nikic, sorry, ignore my last comment. I added small change in json.c https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch [2013-07-14 08:48:01] masakielastic at gmail dot com I nominate other names from the view of consistency with JSON_ERROR_UTF8. JSON_UTF8_SUBSTITUTE JSON_UTF8_IGNORE [2013-07-14 08:44:02] masakielastic at gmail dot com Hi, nikic, I posted a document request for the mission option and error codes. https://bugs.php.net/bug.php?id=65259 Your opinion about the consistency among JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE is needed. The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=65082 -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: Another way of perfomance improvemnet is using php_next_utf8_char directly in json_escape_string on the condition of PHP_JSON_NOTUTF8_SUBSTITUTE and PHP_JSON_NOTUTF8_IGNORE. This way reduces one loop compared with using json_utf8_to_utf16. Previous Comments: [2013-07-19 16:33:24] masakielastic at gmail dot com I agree with you on isolated surrogate pairs. The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE must be contained since json_decode uses json_utf8_to_utf16. https://github.com/php/php-src/blob/master/ext/json/json.c#L673 I already posted the test cases. https://gist.github.com/masakielastic/5973095#file-04-test-php-L26 "a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_SUBSTITUTE), "a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE) The one way of perfomance improvement is adding json_utf8_to_utf32. I posted another patch. https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode- patch I created unsigned int *utf32 data type for not changing unsigned short *utf16 data type. If you want to provide a common variable for json_utf8_to_utf16 and json_utf8_to_utf32, the modification for JSON_parser.c is also needed. The one of candidate for the name of variable is unsigned int *code_codes. http://www.unicode.org/glossary/#code_unit I also updated the previous patch. https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode- patch if (options & PHP_JSON_UNESCAPED_UNICODE) { +if (us < 0x20) { +smart_str_appendl(buf, "\\u", 2); +smart_str_appendc(buf, digits[(us >> 12) & 0xf]); +smart_str_appendc(buf, digits[(us >> 8) & 0xf]); +smart_str_appendc(buf, digits[(us >> 4) & 0xf]); +smart_str_appendc(buf, digits[(us & 0xf)]); +} else if (us < 0x80) { [2013-07-15 07:31:49] r...@php.net > Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option? > The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options. The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my patch. Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now, but converting to utf16, then back to utf8 seems really... messy. Need something simpler. Notice: this bug is only for json_encode. Other issue have their own bug for tracking (especially the json_decode one, as I dont plan to alter it) [2013-07-14 12:45:47] masakielastic at gmail dot com As for JSON_NOTUTF8_IGNORE, the description for security is needed in the manual like htmlspecialchars's ENT_IGNORE http://www.php.net/manual/en/function.htmlspecialchars.php That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's link as resource. UNICODE SECURITY CONSIDERATIONS http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters IDS11-J. Eliminate noncharacter code points before validation https://www.securecoding.cert.org/confluence/display/java/IDS11- J.+Eliminate+noncharacter+code+points+before+validation [2013-07-14 12:31:29] masakielastic at gmail dot com Hi, nikic, sorry, ignore my last comment. I added small change in json.c https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch [2013-07-14 08:48:01] masakielastic at gmail dot com I nominate other names from the view of consistency with JSON_ERROR_UTF8. JSON_UTF8_SUBSTITUTE JSON_UTF8_IGNORE The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=65082 -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1
Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1 ID: 65082 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package:JSON related Operating System: All PHP Version:5.5.0 Assigned To:remi Block user comment: N Private report: N New Comment: I created a repo for the patches and the report of benchmarks https://github.com/masakielastic/patches/tree/master/php_bugs_65082 The difference between json_utf8_to_utf16 and json_utf8_to_utf32 isn't seen. the use of json_utf8_to_utf32 or the direct use of php_next_utf8_char in json_escape_string is better choice for JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_SUBSTITUTE|JSON_UNESCAPED_UNICODE. php_next_utf8_char in json_escape_string is a bit faster than json_utf8_to_utf32 for JSON_NOTUTF8_SUBSTITUTE. https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_ utf8_char_in_json_escape_string.patch https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_ utf8_char_in_json_escape_string.c Previous Comments: [2013-07-19 16:46:49] masakielastic at gmail dot com Another way of perfomance improvemnet is using php_next_utf8_char directly in json_escape_string on the condition of PHP_JSON_NOTUTF8_SUBSTITUTE and PHP_JSON_NOTUTF8_IGNORE. This way reduces one loop compared with using json_utf8_to_utf16. [2013-07-19 16:33:24] masakielastic at gmail dot com I agree with you on isolated surrogate pairs. The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE must be contained since json_decode uses json_utf8_to_utf16. https://github.com/php/php-src/blob/master/ext/json/json.c#L673 I already posted the test cases. https://gist.github.com/masakielastic/5973095#file-04-test-php-L26 "a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_SUBSTITUTE), "a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE) The one way of perfomance improvement is adding json_utf8_to_utf32. I posted another patch. https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode- patch I created unsigned int *utf32 data type for not changing unsigned short *utf16 data type. If you want to provide a common variable for json_utf8_to_utf16 and json_utf8_to_utf32, the modification for JSON_parser.c is also needed. The one of candidate for the name of variable is unsigned int *code_codes. http://www.unicode.org/glossary/#code_unit I also updated the previous patch. https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode- patch if (options & PHP_JSON_UNESCAPED_UNICODE) { +if (us < 0x20) { +smart_str_appendl(buf, "\\u", 2); +smart_str_appendc(buf, digits[(us >> 12) & 0xf]); +smart_str_appendc(buf, digits[(us >> 8) & 0xf]); +smart_str_appendc(buf, digits[(us >> 4) & 0xf]); +smart_str_appendc(buf, digits[(us & 0xf)]); +} else if (us < 0x80) { [2013-07-15 07:31:49] r...@php.net > Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option? > The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options. The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my patch. Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now, but converting to utf16, then back to utf8 seems really... messy. Need something simpler. Notice: this bug is only for json_encode. Other issue have their own bug for tracking (especially the json_decode one, as I dont plan to alter it) [2013-07-14 12:45:47] masakielastic at gmail dot com As for JSON_NOTUTF8_IGNORE, the description for security is needed in the manual like htmlspecialchars's ENT_IGNORE http://www.php.net/manual/en/function.htmlspecialchars.php That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's link as resource. UNICODE SECURITY CONSIDERATIONS http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters IDS11-J. Eliminate noncharacter code points before validation https://www.securecoding.cert.org/confluence/display/java/IDS11- J.+Eliminate+noncharacter+code+points+before+validation
[PHP-BUG] Req #65323 [NEW]: improvement for counting ill-formed byte sequences
From: masakielastic at gmail dot com Operating system: PHP version: 5.5.1 Package: Strings related Bug Type: Feature/Change Request Bug description:improvement for counting ill-formed byte sequences Description: Consider the number of substitute characters (U+FFFD) when the range of UTF-8 string of second byte is narrow (such as 0xA0 - 0xBF) // Code Points First Byte Second Byte Third Byte Fourth Byte // U+0800 - U+0FFF E0 A0 - BF 80 - BF // U+D000 - U+D7FF ED 80 - 9F 80 - BF // U+1 - U+3 F0 90 - BF 80 - BF80 - BF // U+10 - U+10 F4 80 - 8F 80 - BF80 - BF If you follow the recommended policy describled in "Table 3-8. Use of U+FFFD in UTF-8 Conversion" of The Unicode Standard, "\xE0\x80" should be converted to "\xEF\xBF\xBD"."\xEF\xBF\xBD". The actual result is "\xEF\xBF\xBD". The one of solution for that purpose is introducing a macro that checks second byte by first byte. https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/html.p atch https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/test.p hp Test script: --- // https://bugs.php.net/bug.php?id=65081 function str_scrub($str) { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')); } $ufffd_x2 = "\xEF\xBF\xBD"."\xEF\xBF\xBD"; $ufffd_x3 = $ufffd_x2."\xEF\xBF\xBD"; var_dump( $ufffd_x2 === str_scrub("\xE0\x80"), $ufffd_x3 === str_scrub("\xE0\x80\x80") ); Expected result: bool(true) bool(true) Actual result: -- bool(false) bool(false) -- Edit bug report at https://bugs.php.net/bug.php?id=65323&edit=1 -- Try a snapshot (PHP 5.4): https://bugs.php.net/fix.php?id=65323&r=trysnapshot54 Try a snapshot (PHP 5.3): https://bugs.php.net/fix.php?id=65323&r=trysnapshot53 Try a snapshot (trunk): https://bugs.php.net/fix.php?id=65323&r=trysnapshottrunk Fixed in SVN: https://bugs.php.net/fix.php?id=65323&r=fixed Fixed in release: https://bugs.php.net/fix.php?id=65323&r=alreadyfixed Need backtrace: https://bugs.php.net/fix.php?id=65323&r=needtrace Need Reproduce Script: https://bugs.php.net/fix.php?id=65323&r=needscript Try newer version: https://bugs.php.net/fix.php?id=65323&r=oldversion Not developer issue:https://bugs.php.net/fix.php?id=65323&r=support Expected behavior: https://bugs.php.net/fix.php?id=65323&r=notwrong Not enough info: https://bugs.php.net/fix.php?id=65323&r=notenoughinfo Submitted twice: https://bugs.php.net/fix.php?id=65323&r=submittedtwice register_globals: https://bugs.php.net/fix.php?id=65323&r=globals PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65323&r=php4 Daylight Savings: https://bugs.php.net/fix.php?id=65323&r=dst IIS Stability: https://bugs.php.net/fix.php?id=65323&r=isapi Install GNU Sed:https://bugs.php.net/fix.php?id=65323&r=gnused Floating point limitations: https://bugs.php.net/fix.php?id=65323&r=float No Zend Extensions: https://bugs.php.net/fix.php?id=65323&r=nozend MySQL Configuration Error: https://bugs.php.net/fix.php?id=65323&r=mysqlcfg
Req #65323 [Opn]: improvement for counting ill-formed byte sequences
Edit report at https://bugs.php.net/bug.php?id=65323&edit=1 ID: 65323 User updated by:masakielastic at gmail dot com Reported by:masakielastic at gmail dot com Summary:improvement for counting ill-formed byte sequences Status: Open Type: Feature/Change Request Package:Strings related PHP Version:5.5.1 Block user comment: N Private report: N New Comment: Table 3-8. Use of U+FFFD in UTF-8 Conversion" of The Unicode Standard http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf Previous Comments: [2013-07-24 10:59:34] masakielastic at gmail dot com Description: Consider the number of substitute characters (U+FFFD) when the range of UTF-8 string of second byte is narrow (such as 0xA0 - 0xBF) // Code Points First Byte Second Byte Third Byte Fourth Byte // U+0800 - U+0FFF E0 A0 - BF 80 - BF // U+D000 - U+D7FF ED 80 - 9F 80 - BF // U+1 - U+3 F0 90 - BF 80 - BF80 - BF // U+10 - U+10 F4 80 - 8F 80 - BF80 - BF If you follow the recommended policy describled in "Table 3-8. Use of U+FFFD in UTF-8 Conversion" of The Unicode Standard, "\xE0\x80" should be converted to "\xEF\xBF\xBD"."\xEF\xBF\xBD". The actual result is "\xEF\xBF\xBD". The one of solution for that purpose is introducing a macro that checks second byte by first byte. https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/html.p atch https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/test.p hp Test script: --- // https://bugs.php.net/bug.php?id=65081 function str_scrub($str) { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')); } $ufffd_x2 = "\xEF\xBF\xBD"."\xEF\xBF\xBD"; $ufffd_x3 = $ufffd_x2."\xEF\xBF\xBD"; var_dump( $ufffd_x2 === str_scrub("\xE0\x80"), $ufffd_x3 === str_scrub("\xE0\x80\x80") ); Expected result: bool(true) bool(true) Actual result: -- bool(false) bool(false) -- Edit this bug report at https://bugs.php.net/bug.php?id=65323&edit=1