Edit report at http://bugs.php.net/bug.php?id=34776&edit=1
ID: 34776 Comment by: me+phpbugs at ryanmccue dot info Reported by: narzeczony at zabuchy dot net Summary: mb_convert_encoding() - wrong convertion from UTF-16 (problem with BOM) Status: No Feedback Type: Bug Package: mbstring related Operating System: Linux, Windows PHP Version: 5.0.5 Block user comment: N Private report: N New Comment: We're also able to reproduce this, with a much smaller test case: Reproduce code: --------------- mb_convert_encoding("\xfe\xff\x22\x1e", 'UTF-8', 'UTF-16'); Expected result: ---------------- \xe2\x88\x9e Actual result: -------------- \xef\xbb\xbf\xe2\x88\x9e Previous Comments: ------------------------------------------------------------------------ [2008-02-18 17:20:00] jdephix at polenord dot com I forgot to add that I did manage to deal with the UTF-16BE file by reversing everything. $s = file_get_contents($fileUTF16BE); $s = mb_convert_encoding($s, 'UTF-8', "UTF-16LE"); //some operations on $s file_put_contents($anotherUTF16BEfile, mb_convert_encoding($s, 'UTF-16LE', "UTF-8")); I need to specify "UTF-16LE" in order to be sure I work with "UTF-16BE". ------------------------------------------------------------------------ [2008-02-18 17:16:32] jdephix at polenord dot com UTF-16LE and UTF-16BE seem mixed up when using mb_convert_encoding. I want to read the content of a file in UTF-16BE (starts with \xFE\xFF) and convert it into UTF-8: $s = file_get_contents($fileUTF16BE); $s = mb_convert_encoding($s, 'UTF-8', "UTF-16BE"); //some operations on $s file_put_contents($anotherUTF16BEfile, mb_convert_encoding($s, 'UTF-16BE', "UTF-8")); The second file is in Little Endian (starts with \xFF\FE)!!! I have to specify LE if I want BE. file_put_contents($anotherUTF16BEfile, mb_convert_encoding($s, 'UTF-16LE', "UTF-8")); How come it's reversed? ------------------------------------------------------------------------ [2006-06-23 16:11:32] markl at lindenlab dot com There are two problems when mb_convert_encoding is converting from UTF-16: 1) It is including the (transcoded) BOM in the result, rather than stripping it 2) If the source UTF-16 string was little endian, then the second character of the conversion will be wrong; it is converted as if the character code had 0xFF00 or'd into it. Problem 1 occurs with any UTF-16 variant (though it is arguably correct behavior for UTF-16LE and UTF-16BE). Problem 2 only occurs when converting from UTF-16. This PHP program demonstrates this all clearly: function dump($s) { for ($i = 0; $i < strlen($s); ++$i) { echo substr(dechex(256+ord(substr($s, $i, 1))), 1, 2), ' '; } var_dump($s); } $utf16le = "\xFF\xFE\x41\x00\x42\x00\x43\x00"; $utf16be = "\xFE\xFF\x00\x41\x00\x42\x00\x43"; // these strings are both valid UTF-16, the BOM at the start indicates // the endianness. We don't expect the BOM to be included in a conversion echo "The UTF-16LE and UTF-16BE sequences:\n"; dump($utf16le); dump($utf16be); echo "\n"; $encodings = array("ascii", "iso-8859-1", "utf-8", "utf-16", "utf-16le", "utf-16be"); foreach ($encodings as $enc) { echo "Converting to $enc:\n"; dump(mb_convert_encoding($utf16le, $enc, "utf-16")); dump(mb_convert_encoding($utf16be, $enc, "utf-16")); echo "\n"; } ------------------------------------------------------------------------ [2005-10-15 01:00:03] php-bugs at lists dot php dot net No feedback was provided for this bug for over a week, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open". ------------------------------------------------------------------------ [2005-10-07 21:58:46] sni...@php.net Please try using this CVS snapshot: http://snaps.php.net/php5-latest.tar.gz For Windows: http://snaps.php.net/win32/php5-win32-latest.zip ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/bug.php?id=34776 -- Edit this bug report at http://bugs.php.net/bug.php?id=34776&edit=1