From:             jmichae3 at yahoo dot com
Operating system: linux
PHP version:      5.3.19
Package:          Scripting Engine problem
Bug Type:         Bug
Bug description:unicode strings not handled correctly

Description:
------------
I am getting russian characters in my meail forms. I want to compare the
characters to see if they are > '~' which is the last visible character in
the ascii character set.
this comparison does not work. in UNICODE, these characters are about 1024,
 and ~ is 126 according to ord().

ord() thinks EVERY character is ascii. this is far from true.  there are mb
characters from utf8.

this is russian random characters from charmap:
ЋϊЁγϋГИБЫЫЏАДрмдп

in fact, I don't have any working way to detect whether a character is
KOI8-R or ASCII, or cyrillic, or whether the character ordinal number is
actually beyond 127 or not. because according to ord(), it's all within
0-255.





Test script:
---------------
<?php
$s="п"; //russian character
echo substr_compare($s,"~",0,1);
    echo "\n";
$i=0;
for ($i=0; $i < strlen($s); $i++) {
        if (substr_compare($s[$i],"~",0,1) > 0) {
                echo "OK";
        } else {
                echo "fail";
        }
        if (ord($s[$i]) > 126) {
                echo "OK";
        } else {
                echo "fail";
        }
        if ($s[$i] > '~') {
                echo "OK";
        } else {
                echo "fail";
        }
        echo ord($s[$i]);
}
echo "\n";
$i=0;
/*
strangely enough, I get 2 outputs with only 1 character.
Sat 12/08/2012 23:12:46.76||E:\www\jimm|>php t.php
1
OKOKOK208OKOKOK191

Sat 12/08/2012 23:14:27.34||E:\www\jimm|>
*/
?>


Expected result:
----------------
whole characters as a single unit. 1 result.

Actual result:
--------------
got 2 results from 1 UNICODE russian character in a string. should only get
1. 
this file was encoded with utf8 without bom.
php is splitting the utf8 characters into a byte stream when it gets to
strlen(). or it just treats unicode and utf8 characters like ascii.
this does not work well when trying to use mb_detect_encoding() - that
breaks ability to detect encodings when it breaks up characters like that.
nearly everything with strings actually.
this also breaks ability to detect foreign spam.

-- 
Edit bug report at https://bugs.php.net/bug.php?id=63732&edit=1
-- 
Try a snapshot (PHP 5.4):   
https://bugs.php.net/fix.php?id=63732&r=trysnapshot54
Try a snapshot (PHP 5.3):   
https://bugs.php.net/fix.php?id=63732&r=trysnapshot53
Try a snapshot (trunk):     
https://bugs.php.net/fix.php?id=63732&r=trysnapshottrunk
Fixed in SVN:               https://bugs.php.net/fix.php?id=63732&r=fixed
Fixed in release:           https://bugs.php.net/fix.php?id=63732&r=alreadyfixed
Need backtrace:             https://bugs.php.net/fix.php?id=63732&r=needtrace
Need Reproduce Script:      https://bugs.php.net/fix.php?id=63732&r=needscript
Try newer version:          https://bugs.php.net/fix.php?id=63732&r=oldversion
Not developer issue:        https://bugs.php.net/fix.php?id=63732&r=support
Expected behavior:          https://bugs.php.net/fix.php?id=63732&r=notwrong
Not enough info:            
https://bugs.php.net/fix.php?id=63732&r=notenoughinfo
Submitted twice:            
https://bugs.php.net/fix.php?id=63732&r=submittedtwice
register_globals:           https://bugs.php.net/fix.php?id=63732&r=globals
PHP 4 support discontinued: https://bugs.php.net/fix.php?id=63732&r=php4
Daylight Savings:           https://bugs.php.net/fix.php?id=63732&r=dst
IIS Stability:              https://bugs.php.net/fix.php?id=63732&r=isapi
Install GNU Sed:            https://bugs.php.net/fix.php?id=63732&r=gnused
Floating point limitations: https://bugs.php.net/fix.php?id=63732&r=float
No Zend Extensions:         https://bugs.php.net/fix.php?id=63732&r=nozend
MySQL Configuration Error:  https://bugs.php.net/fix.php?id=63732&r=mysqlcfg

Reply via email to