Bug #63732 [Nab]: unicode strings not handled correctly

rasmus Tue, 11 Dec 2012 14:23:05 -0800

Edit report at https://bugs.php.net/bug.php?id=63732&edit=1


 ID:                 63732
 Updated by:         ras...@php.net
 Reported by:        jmichae3 at yahoo dot com
 Summary:            unicode strings not handled correctly
 Status:             Not a bug
 Type:               Bug
 Package:            Scripting Engine problem
 Operating System:   linux
 PHP Version:        5.3.19
 Block user comment: N
 Private report:     N

 New Comment:

This is a bug reporting system. You reported a bug on a function that is 
behaving 
as intended and as documented. This is not a support forum. There are plenty of 
ways to do what you need. Start by reading about the mbstring functions.


Previous Comments:
------------------------------------------------------------------------
[2012-12-11 17:22:40] jmichae3 at yahoo dot com

it may be documented behavior, but it still doesn't provide a solution to the 
problem.

------------------------------------------------------------------------
[2012-12-10 02:24:33] ahar...@php.net

PHP strings are effectively byte arrays, and ord() only looks at the first 
byte. This is documented behaviour.

------------------------------------------------------------------------
[2012-12-09 07:38:05] jmichae3 at yahoo dot com

Description:
------------
I am getting russian characters in my meail forms. I want to compare the 
characters to see if they are > '~' which is the last visible character in the 
ascii character set.
this comparison does not work. in UNICODE, these characters are about 1024,  
and ~ is 126 according to ord().

ord() thinks EVERY character is ascii. this is far from true.  there are mb 
characters from utf8.

this is russian random characters from charmap: 
ÐÏÐÎ³ÏÐÐÐÐ«Ð«ÐÐÐÑÐ¼Ð´Ð¿

in fact, I don't have any working way to detect whether a character is KOI8-R 
or ASCII, or cyrillic, or whether the character ordinal number is actually 
beyond 127 or not. because according to ord(), it's all within 0-255.





Test script:
---------------
<?php
$s="Ð¿"; //russian character
echo substr_compare($s,"~",0,1);
    echo "\n";
$i=0;
for ($i=0; $i < strlen($s); $i++) {
        if (substr_compare($s[$i],"~",0,1) > 0) {
                echo "OK";
        } else {
                echo "fail";
        }
        if (ord($s[$i]) > 126) {
                echo "OK";
        } else {
                echo "fail";
        }
        if ($s[$i] > '~') {
                echo "OK";
        } else {
                echo "fail";
        }
        echo ord($s[$i]);
}
echo "\n";
$i=0;
/*
strangely enough, I get 2 outputs with only 1 character.
Sat 12/08/2012 23:12:46.76||E:\www\jimm|>php t.php
1
OKOKOK208OKOKOK191

Sat 12/08/2012 23:14:27.34||E:\www\jimm|>
*/
?>


Expected result:
----------------
whole characters as a single unit. 1 result.

Actual result:
--------------
got 2 results from 1 UNICODE russian character in a string. should only get 1. 
this file was encoded with utf8 without bom.
php is splitting the utf8 characters into a byte stream when it gets to 
strlen(). or it just treats unicode and utf8 characters like ascii.
this does not work well when trying to use mb_detect_encoding() - that breaks 
ability to detect encodings when it breaks up characters like that. nearly 
everything with strings actually.
this also breaks ability to detect foreign spam.


------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=63732&edit=1

Bug #63732 [Nab]: unicode strings not handled correctly

Reply via email to