Edit report at https://bugs.php.net/bug.php?id=63732&edit=1
ID: 63732 User updated by: jmichae3 at yahoo dot com Reported by: jmichae3 at yahoo dot com Summary: unicode strings not handled correctly Status: Not a bug Type: Bug Package: Scripting Engine problem Operating System: linux PHP Version: 5.3.19 Block user comment: N Private report: N New Comment: this code might be moreuseful, I am going to give it to you. I know there is unicode-16 and unicode-32 and such. if the string can hanbdle stuff like that, there really should be an internal function for that which also handles this internally. because although this is useful and I can use it, it is a workaround rather than a real and complete solution for multiple encodings such as you would find listed with mb_list_encodings(). //returns ordinal value of character in string $str at $index //and increments $index past current utf-8 character. function utf8_ord_next_char($str, &$index) { $b0 = ord($str[$index + 0]); if ($b0 < 0x10) { $index++; return $b0; } $b1 = ord($str[$index + 1]); if ($b0 < 0xE0) { $index += 2; return (($b0 & 0x1F) << 6) + ($b1 & 0x3F); } $index += 3; return (($b0 & 0x0F) << 12) + (($b1 & 0x3F) << 6) + (ord($str[$index + 2]) & 0x3F); } so for detecting non-ascii languages, //detect foreign languages for ($i=0;$i < strlen($comment);) { if (utf8_ord_next_char($comment,$i) > 126) { echo "<div style='color:red;'>ERRORb</div>"; return true; //error } } Previous Comments: ------------------------------------------------------------------------ [2012-12-12 02:38:50] ras...@php.net Personally I'd just convert from utf8 to iso-8959-1 or whichever encoding you are looking for here instead of checking each character. But if you really do want to do it, it isn't very hard. You just need to understand what UTF-8 looks like and it becomes a simple 5-line function in userspace: function utf8_ord($c) { $b0 = ord($c[0]); if($b0 < 0x10) return $b0; $b1 = ord($c[1]); if($b0 < 0xE0 )return (($b0 & 0x1F) << 6) + ($b1 & 0x3F); return (($b0 & 0x0F) << 12) + (($b1 & 0x3F) << 6) + (ord($c[2]) & 0x3F); } But you have to understand that there is absolutely no way to accurately detect the encoding of a short sequence of bytes. The above will work if you know the input is UTF-8. There is no way to write a magic function which will tell you the encoding from a couple of bytes of data which you seem to imply we should provide you. ------------------------------------------------------------------------ [2012-12-12 00:34:00] jmichae3 at yahoo dot com if you were to take the time to do the research, there is no function in PHP except ord() for converting a character [from a string] to a number. maybe strings need to be handled differently internally in php to handle UNICODE. or maybe ord simply needs to be rewritten so it works so matter what character encoding is thrown at it. it would be difficult, but extremely useful, since it is the only function. I took the time to look through the mb functions. there was nothing to help me. I tried looking through the mb functions, there wasn't a compare. there wasn't a way to compare. I consider a function like that to be crucial if relops are not safe or capable of doing it. if that is the case, please make one, and an mb function for returning the ordinal value of an mb char. the functionality is just not there. thanks. much appreciated. unicode/mb-related bug database stuff: https://bugs.php.net/bug.php?id=49439 https://bugs.php.net/bug.php?id=63732 just search the database for anything with mb_encode or unicode. there are a number of bugs related to this problem. ------------------------------------------------------------------------ [2012-12-11 22:22:24] ras...@php.net This is a bug reporting system. You reported a bug on a function that is behaving as intended and as documented. This is not a support forum. There are plenty of ways to do what you need. Start by reading about the mbstring functions. ------------------------------------------------------------------------ [2012-12-11 17:22:40] jmichae3 at yahoo dot com it may be documented behavior, but it still doesn't provide a solution to the problem. ------------------------------------------------------------------------ [2012-12-10 02:24:33] ahar...@php.net PHP strings are effectively byte arrays, and ord() only looks at the first byte. This is documented behaviour. ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=63732 -- Edit this bug report at https://bugs.php.net/bug.php?id=63732&edit=1