Edit report at https://bugs.php.net/bug.php?id=63732&edit=1
ID: 63732 Updated by: ahar...@php.net Reported by: jmichae3 at yahoo dot com Summary: unicode strings not handled correctly -Status: Open +Status: Not a bug Type: Bug Package: Scripting Engine problem Operating System: linux PHP Version: 5.3.19 Block user comment: N Private report: N New Comment: PHP strings are effectively byte arrays, and ord() only looks at the first byte. This is documented behaviour. Previous Comments: ------------------------------------------------------------------------ [2012-12-09 07:38:05] jmichae3 at yahoo dot com Description: ------------ I am getting russian characters in my meail forms. I want to compare the characters to see if they are > '~' which is the last visible character in the ascii character set. this comparison does not work. in UNICODE, these characters are about 1024, and ~ is 126 according to ord(). ord() thinks EVERY character is ascii. this is far from true. there are mb characters from utf8. this is russian random characters from charmap: ÐÏÐγÏÐÐÐЫЫÐÐÐÑмдп in fact, I don't have any working way to detect whether a character is KOI8-R or ASCII, or cyrillic, or whether the character ordinal number is actually beyond 127 or not. because according to ord(), it's all within 0-255. Test script: --------------- <?php $s="п"; //russian character echo substr_compare($s,"~",0,1); echo "\n"; $i=0; for ($i=0; $i < strlen($s); $i++) { if (substr_compare($s[$i],"~",0,1) > 0) { echo "OK"; } else { echo "fail"; } if (ord($s[$i]) > 126) { echo "OK"; } else { echo "fail"; } if ($s[$i] > '~') { echo "OK"; } else { echo "fail"; } echo ord($s[$i]); } echo "\n"; $i=0; /* strangely enough, I get 2 outputs with only 1 character. Sat 12/08/2012 23:12:46.76||E:\www\jimm|>php t.php 1 OKOKOK208OKOKOK191 Sat 12/08/2012 23:14:27.34||E:\www\jimm|> */ ?> Expected result: ---------------- whole characters as a single unit. 1 result. Actual result: -------------- got 2 results from 1 UNICODE russian character in a string. should only get 1. this file was encoded with utf8 without bom. php is splitting the utf8 characters into a byte stream when it gets to strlen(). or it just treats unicode and utf8 characters like ascii. this does not work well when trying to use mb_detect_encoding() - that breaks ability to detect encodings when it breaks up characters like that. nearly everything with strings actually. this also breaks ability to detect foreign spam. ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=63732&edit=1