Bug #55465 [Com]: preg_match segmentation fault when subject too large

2013-06-01 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=55465&edit=1

 ID: 55465
 Comment by: masakielastic at gmail dot com
 Reported by:zedwoodnoreply at zedwood dot com
 Summary:preg_match segmentation fault when subject too large
 Status: Not a bug
 Type:   Bug
 Package:PCRE related
 Operating System:   Ubuntu 10.04
 PHP Version:5.3.7
 Block user comment: N
 Private report: N

 New Comment:

This report is a duplicate. See https://bugs.php.net/bug.php?id=36463


Previous Comments:

[2011-08-19 22:42:28] fel...@php.net

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

This is a known behavior from PCRE library, it's not a PHP bug.

http://docs.php.net/manual/en/pcre.configuration.php


[2011-08-19 21:10:27] zedwoodnoreply at zedwood dot com

Description:

When I change $n_times to 8, and run the command line script php -f 
myscript.php, I get "Segmentation Fault".  The error also occurs when run via 
apache: [Fri Aug 19 15:05:14 2011] [notice] child pid 11995 exit signal 
Segmentation fault (11)

If you change $n_times to be sufficiently large, preg_match seems to 
consistently seg fault.

If I change $n_times to  something lower like 1000, there is no seg fault.

Test script:
---
http://w3.org/International/questions/qa-forms-utf-8.html
echo preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E]# ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|  \xE0[\xA0-\xBF][\x80-\xBF]# excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
|  \xED[\x80-\x9F][\x80-\xBF]# excluding surrogates
|  \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3}  # planes 4-15
|  \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string) ? 'y' : 'n';
die("\n");


Expected result:

'y' or 'n'

Actual result:
--
command line:
Segmentation Fault

via apache error.log
[Fri Aug 19 15:05:14 2011] [notice] child pid 11995 exit signal Segmentation 
fault (11)







-- 
Edit this bug report at https://bugs.php.net/bug.php?id=55465&edit=1


[PHP-BUG] Bug #65045 [NEW]: mb_convert_encoding breaks well-formed character

2013-06-16 Thread masakielastic at gmail dot com
From: masakielastic at gmail dot com
Operating system: Mac OSX
PHP version:  5.5.0RC3
Package:  mbstring related
Bug Type: Bug
Bug description:mb_convert_encoding breaks well-formed character

Description:

When converting string from UTF-8 to UTF-8 by using mb_convert_encoding for

replacing ill-formed byte sequence with the substitute character(U+FFFD), 
mb_convert_encoding replaces the character follwing ill-formed byte
sequence with 
the substitute character. mb_convert_encoding also delete trailing
ill-formed byte 
sequence and doesn't replace it with the substitute character.

The comprehensive test case for 2-4 byte 
characters is here: https://gist.github.com/masakielastic/5793665 .

Test script:
---
// U+24B62: "\xF0\xA4\xAD\xA2"
// ill-formed: "\xF0\xA4\xAD"
// U+FFFD: "\xEF\xBF\xBD"

$str = "\xF0\xA4\xAD".  "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2";
$expected = "\xEF\xBF\xBD"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2";

$str2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD";
$expected2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xEF\xBF\xBD";

mb_substitute_character(0xFFFD);
var_dump(
$expected === htmlspecialchars_decode(htmlspecialchars($str,
ENT_SUBSTITUTE, 'UTF-8')),
$expected2 === htmlspecialchars_decode(htmlspecialchars($str2,
ENT_SUBSTITUTE, 'UTF-8')), 
$expected === mb_convert_encoding($str, 'UTF-8', 'UTF-8'),
$expected2 === mb_convert_encoding($str2, 'UTF-8', 'UTF-8')
);

Expected result:

bool(true)
bool(true)
bool(true)
bool(true)

Actual result:
--
bool(true)
bool(true)
bool(false)
bool(false)

-- 
Edit bug report at https://bugs.php.net/bug.php?id=65045&edit=1
-- 
Try a snapshot (PHP 5.4):   
https://bugs.php.net/fix.php?id=65045&r=trysnapshot54
Try a snapshot (PHP 5.3):   
https://bugs.php.net/fix.php?id=65045&r=trysnapshot53
Try a snapshot (trunk): 
https://bugs.php.net/fix.php?id=65045&r=trysnapshottrunk
Fixed in SVN:   https://bugs.php.net/fix.php?id=65045&r=fixed
Fixed in release:   https://bugs.php.net/fix.php?id=65045&r=alreadyfixed
Need backtrace: https://bugs.php.net/fix.php?id=65045&r=needtrace
Need Reproduce Script:  https://bugs.php.net/fix.php?id=65045&r=needscript
Try newer version:  https://bugs.php.net/fix.php?id=65045&r=oldversion
Not developer issue:https://bugs.php.net/fix.php?id=65045&r=support
Expected behavior:  https://bugs.php.net/fix.php?id=65045&r=notwrong
Not enough info:
https://bugs.php.net/fix.php?id=65045&r=notenoughinfo
Submitted twice:
https://bugs.php.net/fix.php?id=65045&r=submittedtwice
register_globals:   https://bugs.php.net/fix.php?id=65045&r=globals
PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65045&r=php4
Daylight Savings:   https://bugs.php.net/fix.php?id=65045&r=dst
IIS Stability:  https://bugs.php.net/fix.php?id=65045&r=isapi
Install GNU Sed:https://bugs.php.net/fix.php?id=65045&r=gnused
Floating point limitations: https://bugs.php.net/fix.php?id=65045&r=float
No Zend Extensions: https://bugs.php.net/fix.php?id=65045&r=nozend
MySQL Configuration Error:  https://bugs.php.net/fix.php?id=65045&r=mysqlcfg



[PHP-BUG] Req #65079 [NEW]: mb_ereg_replace's e modifier should be deprecated

2013-06-20 Thread masakielastic at gmail dot com
From: masakielastic at gmail dot com
Operating system: Any
PHP version:  5.5.0
Package:  mbstring related
Bug Type: Feature/Change Request
Bug description:mb_ereg_replace's e modifier should be deprecated

Description:

mb_ereg_replace's e modifier should be deprecated for prevent PHP's code 
execution and the explanation for using mb_ereg_replace_callback (since PHP

5.4.1) should be added in the manual. 

PHP: code execution via mb_ereg_replace
http://vigilance.fr/vulnerability/PHP-code-execution-via-mb-ereg-replace-8711

The reason why preg_replace's e modifier was deprecated in PHP 5.5 can be 
applied to mb_ereg_replace's e modifier.

http://www.php.net/manual/en/function.preg-replace.php
https://wiki.php.net/rfc/remove_preg_replace_eval_modifier

There is an example of implementation of mb_ereg_replace_callback as a user

function.

http://d.hatena.ne.jp/hnw/20110206


-- 
Edit bug report at https://bugs.php.net/bug.php?id=65079&edit=1
-- 
Try a snapshot (PHP 5.4):   
https://bugs.php.net/fix.php?id=65079&r=trysnapshot54
Try a snapshot (PHP 5.3):   
https://bugs.php.net/fix.php?id=65079&r=trysnapshot53
Try a snapshot (trunk): 
https://bugs.php.net/fix.php?id=65079&r=trysnapshottrunk
Fixed in SVN:   https://bugs.php.net/fix.php?id=65079&r=fixed
Fixed in release:   https://bugs.php.net/fix.php?id=65079&r=alreadyfixed
Need backtrace: https://bugs.php.net/fix.php?id=65079&r=needtrace
Need Reproduce Script:  https://bugs.php.net/fix.php?id=65079&r=needscript
Try newer version:  https://bugs.php.net/fix.php?id=65079&r=oldversion
Not developer issue:https://bugs.php.net/fix.php?id=65079&r=support
Expected behavior:  https://bugs.php.net/fix.php?id=65079&r=notwrong
Not enough info:
https://bugs.php.net/fix.php?id=65079&r=notenoughinfo
Submitted twice:
https://bugs.php.net/fix.php?id=65079&r=submittedtwice
register_globals:   https://bugs.php.net/fix.php?id=65079&r=globals
PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65079&r=php4
Daylight Savings:   https://bugs.php.net/fix.php?id=65079&r=dst
IIS Stability:  https://bugs.php.net/fix.php?id=65079&r=isapi
Install GNU Sed:https://bugs.php.net/fix.php?id=65079&r=gnused
Floating point limitations: https://bugs.php.net/fix.php?id=65079&r=float
No Zend Extensions: https://bugs.php.net/fix.php?id=65079&r=nozend
MySQL Configuration Error:  https://bugs.php.net/fix.php?id=65079&r=mysqlcfg



[PHP-BUG] Bug #65080 [NEW]: ctype_lower detects non-lower characters

2013-06-20 Thread masakielastic at gmail dot com
From: masakielastic at gmail dot com
Operating system: Mac OSX
PHP version:  5.5.0
Package:  Strings related
Bug Type: Bug
Bug description:ctype_lower detects non-lower characters

Description:

ctype_lower detects non-lower characters when the local is set to
'en_US.UTF-8' 
on Mac OSX 10.8. This phenomenon cannot't be reproduced on Ubuntu Linux.

This phenomenon means ctype_lower detects Chinese characters and Hangul
(Korean 
Alphabet) which have no concept about lower and upper cases.

The test cases for C language and showing misdetected characters can be
seen 
here: 
https://gist.github.com/masakielastic/5828106

The tests for BSD-compatible OSes are needed judging from Xcode's manual. 

http://developer.apple.com/library/Mac/documentation/Darwin/Reference/ManPages/m
an3/islower.3.html

ctype_upper also detects non-upper characters.

Test script:
---
$expected = [];
$result = [];
 
for ($i = 0; $i <= 0xFF; ++$i) {
 
setlocale(LC_ALL, 'C');
if (ctype_lower(chr($i))) {
$expected[] = $i;
}
 
setlocale(LC_ALL, 'en_US.UTF-8');
if (ctype_lower(chr($i))) {
$result[] = $i;
}
 
}
 
var_dump(
[] === array_diff($result, $expected)
);

Expected result:

bool(true)

Actual result:
--
bool(false)

-- 
Edit bug report at https://bugs.php.net/bug.php?id=65080&edit=1
-- 
Try a snapshot (PHP 5.4):   
https://bugs.php.net/fix.php?id=65080&r=trysnapshot54
Try a snapshot (PHP 5.3):   
https://bugs.php.net/fix.php?id=65080&r=trysnapshot53
Try a snapshot (trunk): 
https://bugs.php.net/fix.php?id=65080&r=trysnapshottrunk
Fixed in SVN:   https://bugs.php.net/fix.php?id=65080&r=fixed
Fixed in release:   https://bugs.php.net/fix.php?id=65080&r=alreadyfixed
Need backtrace: https://bugs.php.net/fix.php?id=65080&r=needtrace
Need Reproduce Script:  https://bugs.php.net/fix.php?id=65080&r=needscript
Try newer version:  https://bugs.php.net/fix.php?id=65080&r=oldversion
Not developer issue:https://bugs.php.net/fix.php?id=65080&r=support
Expected behavior:  https://bugs.php.net/fix.php?id=65080&r=notwrong
Not enough info:
https://bugs.php.net/fix.php?id=65080&r=notenoughinfo
Submitted twice:
https://bugs.php.net/fix.php?id=65080&r=submittedtwice
register_globals:   https://bugs.php.net/fix.php?id=65080&r=globals
PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65080&r=php4
Daylight Savings:   https://bugs.php.net/fix.php?id=65080&r=dst
IIS Stability:  https://bugs.php.net/fix.php?id=65080&r=isapi
Install GNU Sed:https://bugs.php.net/fix.php?id=65080&r=gnused
Floating point limitations: https://bugs.php.net/fix.php?id=65080&r=float
No Zend Extensions: https://bugs.php.net/fix.php?id=65080&r=nozend
MySQL Configuration Error:  https://bugs.php.net/fix.php?id=65080&r=mysqlcfg



[PHP-BUG] Req #65081 [NEW]: new function for replacing ill-formd byte sequences with substitute characters

2013-06-20 Thread masakielastic at gmail dot com
From: masakielastic at gmail dot com
Operating system: All
PHP version:  5.5.0
Package:  mbstring related
Bug Type: Feature/Change Request
Bug description:new function for replacing ill-formd byte sequences with 
substitute characters

Description:

New function for replacing ill-formd byte sequences with substitute
characters 
is needed. The problem using mb_convert_encoding for that purpose is that
the 
function name doesn't represent the intent.Specfying same encoding twice is

verbose and can be interpreted as meaningless conversion for the beginners.


$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');

The case study can be seen in Ruby. Ruby 2.1 introduces String#scrub.

http://bugs.ruby-lang.org/issues/6752
https://github.com/ruby/ruby/blob/1e8a05c1dfee94db9b6b825097e1d192ad32930a/strin
g.c#L7770-L7783

The debate whether the substitute character can be specified or not is
needed.

function mb_scrub($str, $encoding = '', $substitute = '')
{
if ('' === $encoding) {

$encoding = mb_internal_encoding();

}

if ('' === $substutute) {

$ret = mb_convert_encoding($str, $encoding, $encoding);
   
} else {

$before_substitute = mb_substitute_character();
mb_substitute_character($substitute);
$ret = mb_convert_encoding($str, $encoding, $encoding);
mb_substitute_character($before_substitute);

}

return $ret;
}

This discussion can be applied to Uconverter.

function uconverter_scrub($str, $encoding, $opts = '')
{
if ('' === $opts) {
return UConverter::transcode($str, $encoding, $encoding, $opts);
} else {
return UConverter::transcode($str, $encoding, $encoding);
}
}

The discussion for standard string functions and filter functions may be
needed 
since htmlspecialchars can be used for that purpose.

function str_scrub($str, $encoding = 'UTF-8')
{
return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 
$encoding));
}


-- 
Edit bug report at https://bugs.php.net/bug.php?id=65081&edit=1
-- 
Try a snapshot (PHP 5.4):   
https://bugs.php.net/fix.php?id=65081&r=trysnapshot54
Try a snapshot (PHP 5.3):   
https://bugs.php.net/fix.php?id=65081&r=trysnapshot53
Try a snapshot (trunk): 
https://bugs.php.net/fix.php?id=65081&r=trysnapshottrunk
Fixed in SVN:   https://bugs.php.net/fix.php?id=65081&r=fixed
Fixed in release:   https://bugs.php.net/fix.php?id=65081&r=alreadyfixed
Need backtrace: https://bugs.php.net/fix.php?id=65081&r=needtrace
Need Reproduce Script:  https://bugs.php.net/fix.php?id=65081&r=needscript
Try newer version:  https://bugs.php.net/fix.php?id=65081&r=oldversion
Not developer issue:https://bugs.php.net/fix.php?id=65081&r=support
Expected behavior:  https://bugs.php.net/fix.php?id=65081&r=notwrong
Not enough info:
https://bugs.php.net/fix.php?id=65081&r=notenoughinfo
Submitted twice:
https://bugs.php.net/fix.php?id=65081&r=submittedtwice
register_globals:   https://bugs.php.net/fix.php?id=65081&r=globals
PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65081&r=php4
Daylight Savings:   https://bugs.php.net/fix.php?id=65081&r=dst
IIS Stability:  https://bugs.php.net/fix.php?id=65081&r=isapi
Install GNU Sed:https://bugs.php.net/fix.php?id=65081&r=gnused
Floating point limitations: https://bugs.php.net/fix.php?id=65081&r=float
No Zend Extensions: https://bugs.php.net/fix.php?id=65081&r=nozend
MySQL Configuration Error:  https://bugs.php.net/fix.php?id=65081&r=mysqlcfg



[PHP-BUG] Req #65082 [NEW]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-06-20 Thread masakielastic at gmail dot com
From: masakielastic at gmail dot com
Operating system: All
PHP version:  5.5.0
Package:  JSON related
Bug Type: Feature/Change Request
Bug description:json_encode's option for replacing ill-formd byte sequences 
with substitute cha

Description:

json_encode returns false if the string contains ill-formed byte 
sequences. It is hard to find the problem since a lot of web applications
don't 
expect the existence of ill-formed byte sequences. The one example is
Symfony's 
JsonResponse class.

https://github.com/symfony/symfony/blob/master/src/Symfony/Component/HttpFoundat
ion/JsonResponse.php#L83

Introducing json_encode's option for replacing ill-formd byte sequences
with 
substitute characters (such as U+FFFD) save writing the logic.

function json_encode2($value, $options, $depth)
{
if (is_scalar($value)) {
return json_encode($value, $options, $depth);
}

$value2 = [];

foreach ($value as $key => $elm) {

$value2[str_scrub($key)] = str_scrub($elm);

}

return json_encode($value2, $options, $depth);
}


// https://bugs.php.net/bug.php?id=65081
function str_scrub($str, $encoding = 'UTF-8')
{
return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 
$encoding));
}

The precedent example is htmlspecialchars's ENT_SUBSTITUTE option which was

introduced 
in PHP 5.4. json_encode shares the part of logic used such as
php_next_utf8_char 
by htmlspecialchars since PHP 5.5.

https://github.com/php/php-src/blob/master/ext/json/json.c#L369

Another reason for introducing the option is existence of JsonSerializable

interface.

Accessing jsonSerialize method's values come from private properties is
hard 
or impossbile.

The one of names of candiates for the option is JSON_SUBSTITUTE similar to

htmlspecialchar's ENT_SUBSTITUTE option.

json_encode($object, JSON_SUBSTITUTE);


-- 
Edit bug report at https://bugs.php.net/bug.php?id=65082&edit=1
-- 
Try a snapshot (PHP 5.4):   
https://bugs.php.net/fix.php?id=65082&r=trysnapshot54
Try a snapshot (PHP 5.3):   
https://bugs.php.net/fix.php?id=65082&r=trysnapshot53
Try a snapshot (trunk): 
https://bugs.php.net/fix.php?id=65082&r=trysnapshottrunk
Fixed in SVN:   https://bugs.php.net/fix.php?id=65082&r=fixed
Fixed in release:   https://bugs.php.net/fix.php?id=65082&r=alreadyfixed
Need backtrace: https://bugs.php.net/fix.php?id=65082&r=needtrace
Need Reproduce Script:  https://bugs.php.net/fix.php?id=65082&r=needscript
Try newer version:  https://bugs.php.net/fix.php?id=65082&r=oldversion
Not developer issue:https://bugs.php.net/fix.php?id=65082&r=support
Expected behavior:  https://bugs.php.net/fix.php?id=65082&r=notwrong
Not enough info:
https://bugs.php.net/fix.php?id=65082&r=notenoughinfo
Submitted twice:
https://bugs.php.net/fix.php?id=65082&r=submittedtwice
register_globals:   https://bugs.php.net/fix.php?id=65082&r=globals
PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65082&r=php4
Daylight Savings:   https://bugs.php.net/fix.php?id=65082&r=dst
IIS Stability:  https://bugs.php.net/fix.php?id=65082&r=isapi
Install GNU Sed:https://bugs.php.net/fix.php?id=65082&r=gnused
Floating point limitations: https://bugs.php.net/fix.php?id=65082&r=float
No Zend Extensions: https://bugs.php.net/fix.php?id=65082&r=nozend
MySQL Configuration Error:  https://bugs.php.net/fix.php?id=65082&r=mysqlcfg



Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-10 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

Hi, thanks nikic and remi.

After several considering, I changed my mind.
I think the behavior of substituting U+FFFD 
for ill-formed sequences should be default.

How do you think?

We might need the discussion about the consitency for Escaper API. 
htmlspecialchars's ENT_SUBSTITUTE option is adopted 
by Symfony and Zend Framework.

https://wiki.php.net/rfc/escaper

Although the behavior breaks 2 test suites, it don't break user's codebases.

A lot of people don't use any option looking in github.

https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code
https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code

The same problem can be seen in htmlspecialchars.

https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code

New options complicate the situation 
when using JSON_UNESCAPED_UNICODE option and json_decode.

[two option]
json_encode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE


If JSON_NOTUTF8_SUBSTITUTE is default behavior, 
the problem we need to consider is only JSON_NOTUTF8_IGNORE option.

[one option]
json_encode
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_IGNORE


Previous Comments:

[2013-07-10 13:48:35] r...@php.net

Here is a proposal fo this issue
https://github.com/remicollet/pecl-json-c/commit/5a499a4550d1f29f1f8eeb1b4ca0b01a33c64779

This add 2 new options to json_encode

- JSON_NOTUTF8_SUBSTITUTE (name seems better, at least to me), to replace 
not-utf8 char with the replacement char.

- JSON_NOTUTF8_IGNORE to ignore not-utf8 char (remove in escaped mode, keep 
without any check in unescaped mode)


[2013-06-21 07:26:33] ni...@php.net

It's currently possible to get a partial output using 
JSON_PARTIAL_OUTPUT_ON_ERROR. This will replace invalid UTF8 strings with NULL 
though. It probably would make sense to have an alternative option that inserts 
the substitution character.

----------------
[2013-06-21 05:31:34] masakielastic at gmail dot com

Description:

json_encode returns false if the string contains ill-formed byte 
sequences. It is hard to find the problem since a lot of web applications don't 
expect the existence of ill-formed byte sequences. The one example is Symfony's 
JsonResponse class.

https://github.com/symfony/symfony/blob/master/src/Symfony/Component/HttpFoundat
ion/JsonResponse.php#L83

Introducing json_encode's option for replacing ill-formd byte sequences with 
substitute characters (such as U+FFFD) save writing the logic.

function json_encode2($value, $options, $depth)
{
if (is_scalar($value)) {
return json_encode($value, $options, $depth);
}

$value2 = [];

foreach ($value as $key => $elm) {

$value2[str_scrub($key)] = str_scrub($elm);

}

return json_encode($value2, $options, $depth);
}


// https://bugs.php.net/bug.php?id=65081
function str_scrub($str, $encoding = 'UTF-8')
{
return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 
$encoding));
}

The precedent example is htmlspecialchars's ENT_SUBSTITUTE option which was 
introduced 
in PHP 5.4. json_encode shares the part of logic used such as 
php_next_utf8_char 
by htmlspecialchars since PHP 5.5.

https://github.com/php/php-src/blob/master/ext/json/json.c#L369

Another reason for introducing the option is existence of JsonSerializable 
interface.

Accessing jsonSerialize method's values come from private properties is hard 
or impossbile.

The one of names of candiates for the option is JSON_SUBSTITUTE similar to 
htmlspecialchar's ENT_SUBSTITUTE option.

json_encode($object, JSON_SUBSTITUTE);







-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1


Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-11 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

https://gist.github.com/masakielastic/5973095


Previous Comments:

[2013-07-11 04:59:02] r...@php.net

I don't think changing the current behavior is a good idea, the reason why I 
really prefer some new options.


[2013-07-11 04:27:19] masakielastic at gmail dot com

Hi, thanks nikic and remi.

After several considering, I changed my mind.
I think the behavior of substituting U+FFFD 
for ill-formed sequences should be default.

How do you think?

We might need the discussion about the consitency for Escaper API. 
htmlspecialchars's ENT_SUBSTITUTE option is adopted 
by Symfony and Zend Framework.

https://wiki.php.net/rfc/escaper

Although the behavior breaks 2 test suites, it don't break user's codebases.

A lot of people don't use any option looking in github.

https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code
https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code

The same problem can be seen in htmlspecialchars.

https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code

New options complicate the situation 
when using JSON_UNESCAPED_UNICODE option and json_decode.

[two option]
json_encode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE


If JSON_NOTUTF8_SUBSTITUTE is default behavior, 
the problem we need to consider is only JSON_NOTUTF8_IGNORE option.

[one option]
json_encode
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_IGNORE


[2013-07-10 13:48:35] r...@php.net

Here is a proposal fo this issue
https://github.com/remicollet/pecl-json-c/commit/5a499a4550d1f29f1f8eeb1b4ca0b01a33c64779

This add 2 new options to json_encode

- JSON_NOTUTF8_SUBSTITUTE (name seems better, at least to me), to replace 
not-utf8 char with the replacement char.

- JSON_NOTUTF8_IGNORE to ignore not-utf8 char (remove in escaped mode, keep 
without any check in unescaped mode)


[2013-06-21 07:26:33] ni...@php.net

It's currently possible to get a partial output using 
JSON_PARTIAL_OUTPUT_ON_ERROR. This will replace invalid UTF8 strings with NULL 
though. It probably would make sense to have an alternative option that inserts 
the substitution character.

----------------
[2013-06-21 05:31:34] masakielastic at gmail dot com

Description:

json_encode returns false if the string contains ill-formed byte 
sequences. It is hard to find the problem since a lot of web applications don't 
expect the existence of ill-formed byte sequences. The one example is Symfony's 
JsonResponse class.

https://github.com/symfony/symfony/blob/master/src/Symfony/Component/HttpFoundat
ion/JsonResponse.php#L83

Introducing json_encode's option for replacing ill-formd byte sequences with 
substitute characters (such as U+FFFD) save writing the logic.

function json_encode2($value, $options, $depth)
{
if (is_scalar($value)) {
return json_encode($value, $options, $depth);
}

$value2 = [];

foreach ($value as $key => $elm) {

$value2[str_scrub($key)] = str_scrub($elm);

}

return json_encode($value2, $options, $depth);
}


// https://bugs.php.net/bug.php?id=65081
function str_scrub($str, $encoding = 'UTF-8')
{
return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 
$encoding));
}

The precedent example is htmlspecialchars's ENT_SUBSTITUTE option which was 
introduced 
in PHP 5.4. json_encode shares the part of logic used such as 
php_next_utf8_char 
by htmlspecialchars since PHP 5.5.

https://github.com/php/php-src/blob/master/ext/json/json.c#L369

Another reason for introducing the option is existence of JsonSerializable 
interface.

Accessing jsonSerialize method's values come

Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-11 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

Hi, I fixed my patch and added test case for json_decode.


Previous Comments:

[2013-07-11 08:37:51] masakielastic at gmail dot com

Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

https://gist.github.com/masakielastic/5973095


[2013-07-11 04:59:02] r...@php.net

I don't think changing the current behavior is a good idea, the reason why I 
really prefer some new options.


[2013-07-11 04:27:19] masakielastic at gmail dot com

Hi, thanks nikic and remi.

After several considering, I changed my mind.
I think the behavior of substituting U+FFFD 
for ill-formed sequences should be default.

How do you think?

We might need the discussion about the consitency for Escaper API. 
htmlspecialchars's ENT_SUBSTITUTE option is adopted 
by Symfony and Zend Framework.

https://wiki.php.net/rfc/escaper

Although the behavior breaks 2 test suites, it don't break user's codebases.

A lot of people don't use any option looking in github.

https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code
https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code

The same problem can be seen in htmlspecialchars.

https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code

New options complicate the situation 
when using JSON_UNESCAPED_UNICODE option and json_decode.

[two option]
json_encode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE


If JSON_NOTUTF8_SUBSTITUTE is default behavior, 
the problem we need to consider is only JSON_NOTUTF8_IGNORE option.

[one option]
json_encode
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_IGNORE


[2013-07-10 13:48:35] r...@php.net

Here is a proposal fo this issue
https://github.com/remicollet/pecl-json-c/commit/5a499a4550d1f29f1f8eeb1b4ca0b01a33c64779

This add 2 new options to json_encode

- JSON_NOTUTF8_SUBSTITUTE (name seems better, at least to me), to replace 
not-utf8 char with the replacement char.

- JSON_NOTUTF8_IGNORE to ignore not-utf8 char (remove in escaped mode, keep 
without any check in unescaped mode)


[2013-06-21 07:26:33] ni...@php.net

It's currently possible to get a partial output using 
JSON_PARTIAL_OUTPUT_ON_ERROR. This will replace invalid UTF8 strings with NULL 
though. It probably would make sense to have an alternative option that inserts 
the substitution character.




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

https://bugs.php.net/bug.php?id=65082


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1


Bug #62010 [Com]: json_decode produces invalid byte-sequences

2013-07-12 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=62010&edit=1

 ID: 62010
 Comment by: masakielastic at gmail dot com
 Reported by:tklingenberg at lastflood dot net
 Summary:json_decode produces invalid byte-sequences
 Status: Open
 Type:   Bug
 Package:JSON related
 Operating System:   Windows
 PHP Version:5.3.13
 Block user comment: N
 Private report: N

 New Comment:

Here is RFC 3629's description about UTF-8 definition.

The definition of UTF-8 prohibits encoding character numbers
between U+D800 and U+DFFF, which are reserved for use with the 
UTF-16 encoding form (as surrogate pairs) and do not directly
represent characters.

http://tools.ietf.org/html/rfc3629

The following patch solve the part of problem,
The isolated low surrogate pairs(U+DC00 U+DFFF) are replaced with U+FFFD,
The imrovement for high surrogate pairs (U+D800 - U+DBFF) is needed.

https://gist.github.com/masakielastic/5985383

var_dump(
  "\xef\xbf\xbd" === json_decode('"\udc00"'),
  "\xef\xbf\xbd"."\xed\xa0\x80" === json_decode('"\ud800\ud800"'),
  "\xed\xa0\x80" === json_decode('"\ud800"')
);

The consistency for the following options
(under the discussion) is needed too.

json_encode's option for replacing ill-formd byte sequences 
with substitute characters
https://bugs.php.net/bug.php?id=65082


Previous Comments:

[2013-01-11 09:44:55] votefordevnull at gmail dot com

Successfully reproduced on Linux


[2012-05-11 22:46:34] tklingenberg at lastflood dot net

Looks like that #41067 https://bugs.php.net/bug.php?id=41067 was not fully 
fixed.


[2012-05-11 22:12:42] tklingenberg at lastflood dot net

Description:

It's a typical case the JSON *and* UTF-16 specifications warn about: decoding 
of 
non-existing UTF-16 code-points:

json_decode('"\ud834"')

shoud give NULL because \ud834 is *invalid*. But instead it starts some party, 
get's boozed and offers this as UTF-8 byte-sequence:

1110 1101  1010   1011 0100
1110   10xx   10xx 
   1101 1000  0011 0100
   D8 34

U+D834 is not a valid unicode character.



Test script:
---
if (NULL !== json_decode('"\ud834"')) {
echo "json_decode is still broken.";
}

Expected result:

NULL because the json is invalid.

Actual result:
--
PHP tries to create UTF-8 out of it and fails by creating invalid UTF-8 unicode 
byte-sequences.






-- 
Edit this bug report at https://bugs.php.net/bug.php?id=62010&edit=1


Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-12 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

I posted a patch for handling surrogate pairs 
since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629).
Someone's help is needed for handling high surrogate pairs and the options.

https://gist.github.com/masakielastic/5985383

json_decode produces invalid byte-sequences
https://bugs.php.net/bug.php?id=62010


Previous Comments:

[2013-07-11 09:48:54] masakielastic at gmail dot com

Hi, I fixed my patch and added test case for json_decode.


[2013-07-11 08:37:51] masakielastic at gmail dot com

Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

https://gist.github.com/masakielastic/5973095


[2013-07-11 04:59:02] r...@php.net

I don't think changing the current behavior is a good idea, the reason why I 
really prefer some new options.


[2013-07-11 04:27:19] masakielastic at gmail dot com

Hi, thanks nikic and remi.

After several considering, I changed my mind.
I think the behavior of substituting U+FFFD 
for ill-formed sequences should be default.

How do you think?

We might need the discussion about the consitency for Escaper API. 
htmlspecialchars's ENT_SUBSTITUTE option is adopted 
by Symfony and Zend Framework.

https://wiki.php.net/rfc/escaper

Although the behavior breaks 2 test suites, it don't break user's codebases.

A lot of people don't use any option looking in github.

https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code
https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code

The same problem can be seen in htmlspecialchars.

https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code

New options complicate the situation 
when using JSON_UNESCAPED_UNICODE option and json_decode.

[two option]
json_encode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE


If JSON_NOTUTF8_SUBSTITUTE is default behavior, 
the problem we need to consider is only JSON_NOTUTF8_IGNORE option.

[one option]
json_encode
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_IGNORE


[2013-07-10 13:48:35] r...@php.net

Here is a proposal fo this issue
https://github.com/remicollet/pecl-json-c/commit/5a499a4550d1f29f1f8eeb1b4ca0b01a33c64779

This add 2 new options to json_encode

- JSON_NOTUTF8_SUBSTITUTE (name seems better, at least to me), to replace 
not-utf8 char with the replacement char.

- JSON_NOTUTF8_IGNORE to ignore not-utf8 char (remove in escaped mode, keep 
without any check in unescaped mode)




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

https://bugs.php.net/bug.php?id=65082


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1


[PHP-BUG] Req #65257 [NEW]: new function for preventing XSS attack

2013-07-13 Thread masakielastic at gmail dot com
From: masakielastic at gmail dot com
Operating system: 
PHP version:  5.5.0
Package:  JSON related
Bug Type: Feature/Change Request
Bug description:new function for preventing XSS attack

Description:

Although JSON_HEX_TAG, JSON_HEX_APOS, JSON_HEX_QUOT, JSON_HEX_AMP options 
were added in PHP 5.3 for preventing XSS attack, 
a lot of people don't specify these options.

https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code

The one of PHP's goal is to provide a secure way for creating 
web application without CMSes and frameworks. 

The one of mesures for the problem is providing new function 
with make these options default.
Adding recommend opitons as a default also make sense.

function json_secure_encode($value, $options = 0, $depth = 512)
{
// JSON_NOTUTF8_SUBSTITUTE
// an option replacing ill-formd byte sequences with substitute
characters
// https://bugs.php.net/bug.php?id=65082

$options |= JSON_HEX_TAG 
| JSON_HEX_APOS | JSON_HEX_QUOT 
| JSON_HEX_AMP | JSON_NOTUTF8_SUBSTITUTE;

return json_secure_encode($value, $options, $depth);
}

A shortcut for these options may be helpful a bit.

if (!defined('JSON_QUOTES')) {
define('JSON_QUOTES', JSON_HEX_TAG | JSON_HEX_APOS | JSON_HEX_AMP | 
JSON_HEX_QUOT);
}

The following RFC shows various functions for less options.

Escaping RFC for PHP Core
https://wiki.php.net/rfc/escaper

Ruby on Rails provide json_escape via ERB::Util.

http://api.rubyonrails.org/classes/ERB/Util.html

OWAPS shows the guidelines for XSS attack.

RULE #3.1 - HTML escape JSON values in an HTML context and read the data
with 
JSON.parse
https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Shee
t#RULE_.233.1_-
_HTML_escape_JSON_values_in_an_HTML_context_and_read_the_data_with_JSON.parse


As a sidenote, the default HTTP headers of Rails 
include "X-Content-Type-Options: nosniff" for IE.

http://edgeguides.rubyonrails.org/security.html#default-headers
https://github.com/rails/docrails/blob/master/actionpack/lib/action_dispatch/rai
ltie.rb#L20=L24

The following articles describe JSON-based XSS exploitation.

http://blog.watchfire.com/wfblog/2011/10/json-based-xss-exploitation.html
https://superevr.com/blog/2012/exploiting-xss-in-ajax-web-applications


-- 
Edit bug report at https://bugs.php.net/bug.php?id=65257&edit=1
-- 
Try a snapshot (PHP 5.4):   
https://bugs.php.net/fix.php?id=65257&r=trysnapshot54
Try a snapshot (PHP 5.3):   
https://bugs.php.net/fix.php?id=65257&r=trysnapshot53
Try a snapshot (trunk): 
https://bugs.php.net/fix.php?id=65257&r=trysnapshottrunk
Fixed in SVN:   https://bugs.php.net/fix.php?id=65257&r=fixed
Fixed in release:   https://bugs.php.net/fix.php?id=65257&r=alreadyfixed
Need backtrace: https://bugs.php.net/fix.php?id=65257&r=needtrace
Need Reproduce Script:  https://bugs.php.net/fix.php?id=65257&r=needscript
Try newer version:  https://bugs.php.net/fix.php?id=65257&r=oldversion
Not developer issue:https://bugs.php.net/fix.php?id=65257&r=support
Expected behavior:  https://bugs.php.net/fix.php?id=65257&r=notwrong
Not enough info:
https://bugs.php.net/fix.php?id=65257&r=notenoughinfo
Submitted twice:
https://bugs.php.net/fix.php?id=65257&r=submittedtwice
register_globals:   https://bugs.php.net/fix.php?id=65257&r=globals
PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65257&r=php4
Daylight Savings:   https://bugs.php.net/fix.php?id=65257&r=dst
IIS Stability:  https://bugs.php.net/fix.php?id=65257&r=isapi
Install GNU Sed:https://bugs.php.net/fix.php?id=65257&r=gnused
Floating point limitations: https://bugs.php.net/fix.php?id=65257&r=float
No Zend Extensions: https://bugs.php.net/fix.php?id=65257&r=nozend
MySQL Configuration Error:  https://bugs.php.net/fix.php?id=65257&r=mysqlcfg



Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-14 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

I created new feature request for preveting XSS attack and I withdraw my option 
about the change of default behavior.

new function for preventing XSS attack
https://bugs.php.net/bug.php?id=65257


Previous Comments:

[2013-07-12 18:19:09] masakielastic at gmail dot com

I posted a patch for handling surrogate pairs 
since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629).
Someone's help is needed for handling high surrogate pairs and the options.

https://gist.github.com/masakielastic/5985383

json_decode produces invalid byte-sequences
https://bugs.php.net/bug.php?id=62010


[2013-07-11 09:48:54] masakielastic at gmail dot com

Hi, I fixed my patch and added test case for json_decode.


[2013-07-11 08:37:51] masakielastic at gmail dot com

Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

https://gist.github.com/masakielastic/5973095


[2013-07-11 04:59:02] r...@php.net

I don't think changing the current behavior is a good idea, the reason why I 
really prefer some new options.


[2013-07-11 04:27:19] masakielastic at gmail dot com

Hi, thanks nikic and remi.

After several considering, I changed my mind.
I think the behavior of substituting U+FFFD 
for ill-formed sequences should be default.

How do you think?

We might need the discussion about the consitency for Escaper API. 
htmlspecialchars's ENT_SUBSTITUTE option is adopted 
by Symfony and Zend Framework.

https://wiki.php.net/rfc/escaper

Although the behavior breaks 2 test suites, it don't break user's codebases.

A lot of people don't use any option looking in github.

https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code
https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code

The same problem can be seen in htmlspecialchars.

https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code

New options complicate the situation 
when using JSON_UNESCAPED_UNICODE option and json_decode.

[two option]
json_encode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE


If JSON_NOTUTF8_SUBSTITUTE is default behavior, 
the problem we need to consider is only JSON_NOTUTF8_IGNORE option.

[one option]
json_encode
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_IGNORE




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

https://bugs.php.net/bug.php?id=65082


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1


Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-14 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

Hi, nikic, I posted a document request for the mission option and error codes.

https://bugs.php.net/bug.php?id=65259

Your opinion about the consistency among 
JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE 
and JSON_NOTUTF8_IGNORE is needed.


Previous Comments:

[2013-07-14 08:28:53] masakielastic at gmail dot com

I created new feature request for preveting XSS attack and I withdraw my option 
about the change of default behavior.

new function for preventing XSS attack
https://bugs.php.net/bug.php?id=65257


[2013-07-12 18:19:09] masakielastic at gmail dot com

I posted a patch for handling surrogate pairs 
since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629).
Someone's help is needed for handling high surrogate pairs and the options.

https://gist.github.com/masakielastic/5985383

json_decode produces invalid byte-sequences
https://bugs.php.net/bug.php?id=62010


[2013-07-11 09:48:54] masakielastic at gmail dot com

Hi, I fixed my patch and added test case for json_decode.


[2013-07-11 08:37:51] masakielastic at gmail dot com

Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

https://gist.github.com/masakielastic/5973095


[2013-07-11 04:59:02] r...@php.net

I don't think changing the current behavior is a good idea, the reason why I 
really prefer some new options.




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

https://bugs.php.net/bug.php?id=65082


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1


Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-14 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

I nominate other names from the view of consistency with JSON_ERROR_UTF8.

JSON_UTF8_SUBSTITUTE
JSON_UTF8_IGNORE


Previous Comments:

[2013-07-14 08:44:02] masakielastic at gmail dot com

Hi, nikic, I posted a document request for the mission option and error codes.

https://bugs.php.net/bug.php?id=65259

Your opinion about the consistency among 
JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE 
and JSON_NOTUTF8_IGNORE is needed.


[2013-07-14 08:28:53] masakielastic at gmail dot com

I created new feature request for preveting XSS attack and I withdraw my option 
about the change of default behavior.

new function for preventing XSS attack
https://bugs.php.net/bug.php?id=65257


[2013-07-12 18:19:09] masakielastic at gmail dot com

I posted a patch for handling surrogate pairs 
since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629).
Someone's help is needed for handling high surrogate pairs and the options.

https://gist.github.com/masakielastic/5985383

json_decode produces invalid byte-sequences
https://bugs.php.net/bug.php?id=62010


[2013-07-11 09:48:54] masakielastic at gmail dot com

Hi, I fixed my patch and added test case for json_decode.


[2013-07-11 08:37:51] masakielastic at gmail dot com

Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

https://gist.github.com/masakielastic/5973095




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

https://bugs.php.net/bug.php?id=65082


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1


Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-14 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

Hi, nikic, sorry, ignore my last comment.

I added small change in json.c
https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch


Previous Comments:

[2013-07-14 08:48:01] masakielastic at gmail dot com

I nominate other names from the view of consistency with JSON_ERROR_UTF8.

JSON_UTF8_SUBSTITUTE
JSON_UTF8_IGNORE


[2013-07-14 08:44:02] masakielastic at gmail dot com

Hi, nikic, I posted a document request for the mission option and error codes.

https://bugs.php.net/bug.php?id=65259

Your opinion about the consistency among 
JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE 
and JSON_NOTUTF8_IGNORE is needed.


[2013-07-14 08:28:53] masakielastic at gmail dot com

I created new feature request for preveting XSS attack and I withdraw my option 
about the change of default behavior.

new function for preventing XSS attack
https://bugs.php.net/bug.php?id=65257


[2013-07-12 18:19:09] masakielastic at gmail dot com

I posted a patch for handling surrogate pairs 
since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629).
Someone's help is needed for handling high surrogate pairs and the options.

https://gist.github.com/masakielastic/5985383

json_decode produces invalid byte-sequences
https://bugs.php.net/bug.php?id=62010


[2013-07-11 09:48:54] masakielastic at gmail dot com

Hi, I fixed my patch and added test case for json_decode.




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

https://bugs.php.net/bug.php?id=65082


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1


Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-14 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

As for JSON_NOTUTF8_IGNORE, the description for security is needed in the 
manual 
like htmlspecialchars's ENT_IGNORE

http://www.php.net/manual/en/function.htmlspecialchars.php

That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's 
link
as resource.

UNICODE SECURITY CONSIDERATIONS
http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters
IDS11-J. Eliminate noncharacter code points before validation
https://www.securecoding.cert.org/confluence/display/java/IDS11-
J.+Eliminate+noncharacter+code+points+before+validation


Previous Comments:
----
[2013-07-14 12:31:29] masakielastic at gmail dot com

Hi, nikic, sorry, ignore my last comment.

I added small change in json.c
https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch

----
[2013-07-14 08:48:01] masakielastic at gmail dot com

I nominate other names from the view of consistency with JSON_ERROR_UTF8.

JSON_UTF8_SUBSTITUTE
JSON_UTF8_IGNORE

----
[2013-07-14 08:44:02] masakielastic at gmail dot com

Hi, nikic, I posted a document request for the mission option and error codes.

https://bugs.php.net/bug.php?id=65259

Your opinion about the consistency among 
JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE 
and JSON_NOTUTF8_IGNORE is needed.

----
[2013-07-14 08:28:53] masakielastic at gmail dot com

I created new feature request for preveting XSS attack and I withdraw my option 
about the change of default behavior.

new function for preventing XSS attack
https://bugs.php.net/bug.php?id=65257

----
[2013-07-12 18:19:09] masakielastic at gmail dot com

I posted a patch for handling surrogate pairs 
since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629).
Someone's help is needed for handling high surrogate pairs and the options.

https://gist.github.com/masakielastic/5985383

json_decode produces invalid byte-sequences
https://bugs.php.net/bug.php?id=62010




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

https://bugs.php.net/bug.php?id=65082


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1


Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-19 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

I agree with you on isolated surrogate pairs.

The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and 
JSON_NOTUTF8_IGNORE must be contained 
since json_decode uses json_utf8_to_utf16. 

https://github.com/php/php-src/blob/master/ext/json/json.c#L673

I already posted the test cases.

https://gist.github.com/masakielastic/5973095#file-04-test-php-L26

"a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512, 
JSON_NOTUTF8_SUBSTITUTE),
"a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE)


The one way of perfomance improvement is adding json_utf8_to_utf32. 
I posted  another patch.

https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode-
patch

I created unsigned int *utf32 data type 
for not changing unsigned short *utf16 data type.

If you want to provide a common variable  
for json_utf8_to_utf16 and json_utf8_to_utf32, 
the modification for JSON_parser.c is also needed.

The one of candidate for the name of variable is 
unsigned int *code_codes.

http://www.unicode.org/glossary/#code_unit


I also updated the previous patch.
https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode-
patch

if (options & PHP_JSON_UNESCAPED_UNICODE) {
+if (us < 0x20) {
+smart_str_appendl(buf, "\\u", 2);
+smart_str_appendc(buf, digits[(us >> 12) & 0xf]);
+smart_str_appendc(buf, digits[(us >> 8) & 0xf]);
+smart_str_appendc(buf, digits[(us >> 4) & 0xf]);
+smart_str_appendc(buf, digits[(us & 0xf)]);
+} else if (us < 0x80) {


Previous Comments:

[2013-07-15 07:31:49] r...@php.net

> Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
> The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my 
patch.

Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now, 
but converting to utf16, then back to utf8 seems really... messy. Need 
something simpler.

Notice: this bug is only for json_encode. Other issue have their own bug for 
tracking (especially the json_decode one, as I dont plan to alter it)


[2013-07-14 12:45:47] masakielastic at gmail dot com

As for JSON_NOTUTF8_IGNORE, the description for security is needed in the 
manual 
like htmlspecialchars's ENT_IGNORE

http://www.php.net/manual/en/function.htmlspecialchars.php

That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's 
link
as resource.

UNICODE SECURITY CONSIDERATIONS
http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters
IDS11-J. Eliminate noncharacter code points before validation
https://www.securecoding.cert.org/confluence/display/java/IDS11-
J.+Eliminate+noncharacter+code+points+before+validation


[2013-07-14 12:31:29] masakielastic at gmail dot com

Hi, nikic, sorry, ignore my last comment.

I added small change in json.c
https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch


[2013-07-14 08:48:01] masakielastic at gmail dot com

I nominate other names from the view of consistency with JSON_ERROR_UTF8.

JSON_UTF8_SUBSTITUTE
JSON_UTF8_IGNORE


[2013-07-14 08:44:02] masakielastic at gmail dot com

Hi, nikic, I posted a document request for the mission option and error codes.

https://bugs.php.net/bug.php?id=65259

Your opinion about the consistency among 
JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE 
and JSON_NOTUTF8_IGNORE is needed.




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

https://bugs.php.net/bug.php?id=65082


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1


Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-19 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

Another way of perfomance improvemnet is using php_next_utf8_char directly 
in json_escape_string on the condition of PHP_JSON_NOTUTF8_SUBSTITUTE 
and PHP_JSON_NOTUTF8_IGNORE. 
This way reduces one loop compared with using json_utf8_to_utf16.


Previous Comments:

[2013-07-19 16:33:24] masakielastic at gmail dot com

I agree with you on isolated surrogate pairs.

The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and 
JSON_NOTUTF8_IGNORE must be contained 
since json_decode uses json_utf8_to_utf16. 

https://github.com/php/php-src/blob/master/ext/json/json.c#L673

I already posted the test cases.

https://gist.github.com/masakielastic/5973095#file-04-test-php-L26

"a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512, 
JSON_NOTUTF8_SUBSTITUTE),
"a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE)


The one way of perfomance improvement is adding json_utf8_to_utf32. 
I posted  another patch.

https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode-
patch

I created unsigned int *utf32 data type 
for not changing unsigned short *utf16 data type.

If you want to provide a common variable  
for json_utf8_to_utf16 and json_utf8_to_utf32, 
the modification for JSON_parser.c is also needed.

The one of candidate for the name of variable is 
unsigned int *code_codes.

http://www.unicode.org/glossary/#code_unit


I also updated the previous patch.
https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode-
patch

if (options & PHP_JSON_UNESCAPED_UNICODE) {
+if (us < 0x20) {
+smart_str_appendl(buf, "\\u", 2);
+smart_str_appendc(buf, digits[(us >> 12) & 0xf]);
+smart_str_appendc(buf, digits[(us >> 8) & 0xf]);
+smart_str_appendc(buf, digits[(us >> 4) & 0xf]);
+smart_str_appendc(buf, digits[(us & 0xf)]);
+} else if (us < 0x80) {


[2013-07-15 07:31:49] r...@php.net

> Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
> The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my 
patch.

Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now, 
but converting to utf16, then back to utf8 seems really... messy. Need 
something simpler.

Notice: this bug is only for json_encode. Other issue have their own bug for 
tracking (especially the json_decode one, as I dont plan to alter it)


[2013-07-14 12:45:47] masakielastic at gmail dot com

As for JSON_NOTUTF8_IGNORE, the description for security is needed in the 
manual 
like htmlspecialchars's ENT_IGNORE

http://www.php.net/manual/en/function.htmlspecialchars.php

That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's 
link
as resource.

UNICODE SECURITY CONSIDERATIONS
http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters
IDS11-J. Eliminate noncharacter code points before validation
https://www.securecoding.cert.org/confluence/display/java/IDS11-
J.+Eliminate+noncharacter+code+points+before+validation


[2013-07-14 12:31:29] masakielastic at gmail dot com

Hi, nikic, sorry, ignore my last comment.

I added small change in json.c
https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch


[2013-07-14 08:48:01] masakielastic at gmail dot com

I nominate other names from the view of consistency with JSON_ERROR_UTF8.

JSON_UTF8_SUBSTITUTE
JSON_UTF8_IGNORE




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

https://bugs.php.net/bug.php?id=65082


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1


Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

2013-07-21 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID: 65082
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:json_encode's option for replacing ill-formd byte
 sequences with substitute cha
 Status: Assigned
 Type:   Feature/Change Request
 Package:JSON related
 Operating System:   All
 PHP Version:5.5.0
 Assigned To:remi
 Block user comment: N
 Private report: N

 New Comment:

I created a repo for the patches and the report of benchmarks

https://github.com/masakielastic/patches/tree/master/php_bugs_65082

The difference between json_utf8_to_utf16 and json_utf8_to_utf32 isn't seen.

the use of json_utf8_to_utf32 or the direct use of php_next_utf8_char 
in json_escape_string is better choice for 
JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_SUBSTITUTE|JSON_UNESCAPED_UNICODE.

php_next_utf8_char in json_escape_string is a bit faster than
json_utf8_to_utf32 for JSON_NOTUTF8_SUBSTITUTE.

https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_
utf8_char_in_json_escape_string.patch
https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_
utf8_char_in_json_escape_string.c


Previous Comments:

[2013-07-19 16:46:49] masakielastic at gmail dot com

Another way of perfomance improvemnet is using php_next_utf8_char directly 
in json_escape_string on the condition of PHP_JSON_NOTUTF8_SUBSTITUTE 
and PHP_JSON_NOTUTF8_IGNORE. 
This way reduces one loop compared with using json_utf8_to_utf16.


[2013-07-19 16:33:24] masakielastic at gmail dot com

I agree with you on isolated surrogate pairs.

The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and 
JSON_NOTUTF8_IGNORE must be contained 
since json_decode uses json_utf8_to_utf16. 

https://github.com/php/php-src/blob/master/ext/json/json.c#L673

I already posted the test cases.

https://gist.github.com/masakielastic/5973095#file-04-test-php-L26

"a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512, 
JSON_NOTUTF8_SUBSTITUTE),
"a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE)


The one way of perfomance improvement is adding json_utf8_to_utf32. 
I posted  another patch.

https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode-
patch

I created unsigned int *utf32 data type 
for not changing unsigned short *utf16 data type.

If you want to provide a common variable  
for json_utf8_to_utf16 and json_utf8_to_utf32, 
the modification for JSON_parser.c is also needed.

The one of candidate for the name of variable is 
unsigned int *code_codes.

http://www.unicode.org/glossary/#code_unit


I also updated the previous patch.
https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode-
patch

if (options & PHP_JSON_UNESCAPED_UNICODE) {
+if (us < 0x20) {
+smart_str_appendl(buf, "\\u", 2);
+smart_str_appendc(buf, digits[(us >> 12) & 0xf]);
+smart_str_appendc(buf, digits[(us >> 8) & 0xf]);
+smart_str_appendc(buf, digits[(us >> 4) & 0xf]);
+smart_str_appendc(buf, digits[(us & 0xf)]);
+} else if (us < 0x80) {


[2013-07-15 07:31:49] r...@php.net

> Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
> The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my 
patch.

Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now, 
but converting to utf16, then back to utf8 seems really... messy. Need 
something simpler.

Notice: this bug is only for json_encode. Other issue have their own bug for 
tracking (especially the json_decode one, as I dont plan to alter it)


[2013-07-14 12:45:47] masakielastic at gmail dot com

As for JSON_NOTUTF8_IGNORE, the description for security is needed in the 
manual 
like htmlspecialchars's ENT_IGNORE

http://www.php.net/manual/en/function.htmlspecialchars.php

That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's 
link
as resource.

UNICODE SECURITY CONSIDERATIONS
http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters
IDS11-J. Eliminate noncharacter code points before validation
https://www.securecoding.cert.org/confluence/display/java/IDS11-
J.+Eliminate+noncharacter+code+points+before+validation



[PHP-BUG] Req #65323 [NEW]: improvement for counting ill-formed byte sequences

2013-07-24 Thread masakielastic at gmail dot com
From: masakielastic at gmail dot com
Operating system: 
PHP version:  5.5.1
Package:  Strings related
Bug Type: Feature/Change Request
Bug description:improvement for counting ill-formed byte sequences  

Description:

Consider the number of substitute characters (U+FFFD)
when the range of UTF-8 string of second byte is narrow (such as 0xA0 -
0xBF)

//  Code Points   First Byte Second Byte Third Byte Fourth Byte
//   U+0800 -   U+0FFF   E0 A0 - BF 80 - BF
//   U+D000 -   U+D7FF   ED 80 - 9F 80 - BF
//  U+1 -  U+3   F0 90 - BF 80 - BF80 - BF
// U+10 - U+10   F4 80 - 8F 80 - BF80 - BF

If you follow the recommended policy describled in "Table 3-8. Use of
U+FFFD in 
UTF-8 Conversion" of The Unicode Standard,
"\xE0\x80" should be converted to "\xEF\xBF\xBD"."\xEF\xBF\xBD".
The actual result is "\xEF\xBF\xBD".

The one of solution for that purpose is introducing a macro that checks
second 
byte by first byte.

https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/html.p
atch
https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/test.p
hp

Test script:
---
// https://bugs.php.net/bug.php?id=65081
function str_scrub($str)
{
return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE,
'UTF-8'));
}

$ufffd_x2 = "\xEF\xBF\xBD"."\xEF\xBF\xBD";
$ufffd_x3 = $ufffd_x2."\xEF\xBF\xBD";

var_dump(
$ufffd_x2 === str_scrub("\xE0\x80"),
$ufffd_x3 === str_scrub("\xE0\x80\x80")
);

Expected result:

bool(true)
bool(true)

Actual result:
--
bool(false)
bool(false)

-- 
Edit bug report at https://bugs.php.net/bug.php?id=65323&edit=1
-- 
Try a snapshot (PHP 5.4):   
https://bugs.php.net/fix.php?id=65323&r=trysnapshot54
Try a snapshot (PHP 5.3):   
https://bugs.php.net/fix.php?id=65323&r=trysnapshot53
Try a snapshot (trunk): 
https://bugs.php.net/fix.php?id=65323&r=trysnapshottrunk
Fixed in SVN:   https://bugs.php.net/fix.php?id=65323&r=fixed
Fixed in release:   https://bugs.php.net/fix.php?id=65323&r=alreadyfixed
Need backtrace: https://bugs.php.net/fix.php?id=65323&r=needtrace
Need Reproduce Script:  https://bugs.php.net/fix.php?id=65323&r=needscript
Try newer version:  https://bugs.php.net/fix.php?id=65323&r=oldversion
Not developer issue:https://bugs.php.net/fix.php?id=65323&r=support
Expected behavior:  https://bugs.php.net/fix.php?id=65323&r=notwrong
Not enough info:
https://bugs.php.net/fix.php?id=65323&r=notenoughinfo
Submitted twice:
https://bugs.php.net/fix.php?id=65323&r=submittedtwice
register_globals:   https://bugs.php.net/fix.php?id=65323&r=globals
PHP 4 support discontinued: https://bugs.php.net/fix.php?id=65323&r=php4
Daylight Savings:   https://bugs.php.net/fix.php?id=65323&r=dst
IIS Stability:  https://bugs.php.net/fix.php?id=65323&r=isapi
Install GNU Sed:https://bugs.php.net/fix.php?id=65323&r=gnused
Floating point limitations: https://bugs.php.net/fix.php?id=65323&r=float
No Zend Extensions: https://bugs.php.net/fix.php?id=65323&r=nozend
MySQL Configuration Error:  https://bugs.php.net/fix.php?id=65323&r=mysqlcfg



Req #65323 [Opn]: improvement for counting ill-formed byte sequences

2013-07-24 Thread masakielastic at gmail dot com
Edit report at https://bugs.php.net/bug.php?id=65323&edit=1

 ID: 65323
 User updated by:masakielastic at gmail dot com
 Reported by:masakielastic at gmail dot com
 Summary:improvement for counting ill-formed byte sequences
 Status: Open
 Type:   Feature/Change Request
 Package:Strings related
 PHP Version:5.5.1
 Block user comment: N
 Private report: N

 New Comment:

Table 3-8. Use of U+FFFD in UTF-8 Conversion" of The Unicode Standard
http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf


Previous Comments:

[2013-07-24 10:59:34] masakielastic at gmail dot com

Description:

Consider the number of substitute characters (U+FFFD)
when the range of UTF-8 string of second byte is narrow (such as 0xA0 - 0xBF)

//  Code Points   First Byte Second Byte Third Byte Fourth Byte
//   U+0800 -   U+0FFF   E0 A0 - BF 80 - BF
//   U+D000 -   U+D7FF   ED 80 - 9F 80 - BF
//  U+1 -  U+3   F0 90 - BF 80 - BF80 - BF
// U+10 - U+10   F4 80 - 8F 80 - BF80 - BF

If you follow the recommended policy describled in "Table 3-8. Use of U+FFFD in 
UTF-8 Conversion" of The Unicode Standard,
"\xE0\x80" should be converted to "\xEF\xBF\xBD"."\xEF\xBF\xBD".
The actual result is "\xEF\xBF\xBD".

The one of solution for that purpose is introducing a macro that checks second 
byte by first byte.

https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/html.p
atch
https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/test.p
hp

Test script:
---
// https://bugs.php.net/bug.php?id=65081
function str_scrub($str)
{
return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 
'UTF-8'));
}

$ufffd_x2 = "\xEF\xBF\xBD"."\xEF\xBF\xBD";
$ufffd_x3 = $ufffd_x2."\xEF\xBF\xBD";

var_dump(
$ufffd_x2 === str_scrub("\xE0\x80"),
$ufffd_x3 === str_scrub("\xE0\x80\x80")
);

Expected result:

bool(true)
bool(true)

Actual result:
--
bool(false)
bool(false)






-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65323&edit=1