Bug #63732 [Nab]: unicode strings not handled correctly

rasmus Tue, 11 Dec 2012 18:39:32 -0800

Edit report at https://bugs.php.net/bug.php?id=63732&edit=1


 ID:                 63732
 Updated by:         ras...@php.net
 Reported by:        jmichae3 at yahoo dot com
 Summary:            unicode strings not handled correctly
 Status:             Not a bug
 Type:               Bug
 Package:            Scripting Engine problem
 Operating System:   linux
 PHP Version:        5.3.19
 Block user comment: N
 Private report:     N

 New Comment:

Personally I'd just convert from utf8 to iso-8959-1 or whichever encoding you 
are looking for here instead of checking each character. But if you really do 
want to do it, it isn't very hard. You just need to understand what UTF-8 looks 
like and it becomes a simple 5-line function in userspace:

function utf8_ord($c) {
    $b0 = ord($c[0]);
    if($b0 < 0x10) return $b0;
    $b1 = ord($c[1]);
    if($b0 < 0xE0 )return (($b0 & 0x1F) << 6) + ($b1 & 0x3F);
    return (($b0 & 0x0F) << 12) + (($b1 & 0x3F) << 6) + (ord($c[2]) & 0x3F);
}

But you have to understand that there is absolutely no way to accurately detect 
the encoding of a short sequence of bytes. The above will work if you know the 
input is UTF-8. There is no way to write a magic function which will tell you 
the encoding from a couple of bytes of data which you seem to imply we should 
provide you.


Previous Comments:
------------------------------------------------------------------------
[2012-12-12 00:34:00] jmichae3 at yahoo dot com

if you were to take the time to do the research, there is no function in PHP 
except ord() for converting a character [from a string] to a number. maybe 
strings need to be handled differently internally in php to handle UNICODE. or 
maybe ord simply needs to be rewritten so it works so matter what character 
encoding is thrown at it. it would be difficult, but extremely useful, since it 
is the only function. I took the time to look through the mb functions. there 
was nothing to help me. 

I tried looking through the mb functions, there wasn't a compare. there wasn't 
a way to compare. I consider a function like that to be crucial if relops are 
not safe or capable of doing it. if that is the case, please make one, and an 
mb function for returning the ordinal value of an mb char. the functionality is 
just not there. thanks. much appreciated.

unicode/mb-related bug database stuff:
https://bugs.php.net/bug.php?id=49439
https://bugs.php.net/bug.php?id=63732
just search the database for anything with mb_encode or unicode. there are a 
number of bugs related to this problem.

------------------------------------------------------------------------
[2012-12-11 22:22:24] ras...@php.net

This is a bug reporting system. You reported a bug on a function that is 
behaving 
as intended and as documented. This is not a support forum. There are plenty of 
ways to do what you need. Start by reading about the mbstring functions.

------------------------------------------------------------------------
[2012-12-11 17:22:40] jmichae3 at yahoo dot com

it may be documented behavior, but it still doesn't provide a solution to the 
problem.

------------------------------------------------------------------------
[2012-12-10 02:24:33] ahar...@php.net

PHP strings are effectively byte arrays, and ord() only looks at the first 
byte. This is documented behaviour.

------------------------------------------------------------------------
[2012-12-09 07:38:05] jmichae3 at yahoo dot com

Description:
------------
I am getting russian characters in my meail forms. I want to compare the 
characters to see if they are > '~' which is the last visible character in the 
ascii character set.
this comparison does not work. in UNICODE, these characters are about 1024,  
and ~ is 126 according to ord().

ord() thinks EVERY character is ascii. this is far from true.  there are mb 
characters from utf8.

this is russian random characters from charmap: 
ÐÏÐÎ³ÏÐÐÐÐ«Ð«ÐÐÐÑÐ¼Ð´Ð¿

in fact, I don't have any working way to detect whether a character is KOI8-R 
or ASCII, or cyrillic, or whether the character ordinal number is actually 
beyond 127 or not. because according to ord(), it's all within 0-255.





Test script:
---------------
<?php
$s="Ð¿"; //russian character
echo substr_compare($s,"~",0,1);
    echo "\n";
$i=0;
for ($i=0; $i < strlen($s); $i++) {
        if (substr_compare($s[$i],"~",0,1) > 0) {
                echo "OK";
        } else {
                echo "fail";
        }
        if (ord($s[$i]) > 126) {
                echo "OK";
        } else {
                echo "fail";
        }
        if ($s[$i] > '~') {
                echo "OK";
        } else {
                echo "fail";
        }
        echo ord($s[$i]);
}
echo "\n";
$i=0;
/*
strangely enough, I get 2 outputs with only 1 character.
Sat 12/08/2012 23:12:46.76||E:\www\jimm|>php t.php
1
OKOKOK208OKOKOK191

Sat 12/08/2012 23:14:27.34||E:\www\jimm|>
*/
?>


Expected result:
----------------
whole characters as a single unit. 1 result.

Actual result:
--------------
got 2 results from 1 UNICODE russian character in a string. should only get 1. 
this file was encoded with utf8 without bom.
php is splitting the utf8 characters into a byte stream when it gets to 
strlen(). or it just treats unicode and utf8 characters like ascii.
this does not work well when trying to use mb_detect_encoding() - that breaks 
ability to detect encodings when it breaks up characters like that. nearly 
everything with strings actually.
this also breaks ability to detect foreign spam.


------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=63732&edit=1

Bug #63732 [Nab]: unicode strings not handled correctly

Reply via email to