Edit report at https://bugs.php.net/bug.php?id=45993&edit=1
ID: 45993 Comment by: Apollo880 at gmail dot com Reported by: mtrojan at transline dot de Summary: mb_detect_encoding and mb_check_encoding results are dissonant Status: Open Type: Bug Package: mbstring related Operating System: Windows XP PHP Version: 5.2.6 Block user comment: N Private report: N New Comment: Bug with correct encoding detection. function detect_enc($str) { $awe = mb_list_encodings(); unset($awe[0], $awe[1], $awe[2]); foreach ($awe as $enctype) { if (mb_check_encoding($str, $enctype) === true) return $enctype; } return false; } echo detect_enc('String_encoded_to_Windows-1251'); // Return 'byte2be'. It's a fail. Previous Comments: ------------------------------------------------------------------------ [2008-11-10 07:30:32] mtrojan at transline dot de Of course, comparing the beginning of a file with the UTF-16 BOM can be used to detect UTF-16 encoding. But what do you do with UTF-16 encoded files where no BOM is set? ------------------------------------------------------------------------ [2008-11-08 02:20:46] hirok...@php.net mb_detect_encoding does not support the UTF-16/UTF-16BE encoding detection. Because UTF-16 isn't byte stream encoding like UTF-8, we cannot detect the encoding as other byte stream encoding. The file encoded in UTF-16 can be detected easily using BOM, it is like, if ($content[0]==chr(0xff) && $content[1]==chr(0xfe)) { echo 'UTF-16'; } else if ($content[0]==chr(0xfe) && $content[1]==chr(0xff)) { echo 'UTF-16BE'; } ------------------------------------------------------------------------ [2008-10-26 23:01:49] j...@php.net Assigned to the mbstring maintainer. ------------------------------------------------------------------------ [2008-09-04 11:47:39] mtrojan at transline dot de Description: ------------ mb_detect_encoding does not seem to recognize UTF-16 encoded files properly. Even if it is assured by using mb_check_encoding that a file is truly UTF-16LE, mb_detect_encoding does not detect the same file as UTF-16 and is returning ISO-8859-1 instead. Activating/deactivating strict mode has no influence on the result. Reproduce code: --------------- $content = file_get_contents($src_path); $encodings = array('UTF-16', 'UTF-16LE', 'UTF-16BE', 'UTF-8', 'UNICODE', 'ISO-8859-1'); $enc = mb_detect_encoding($content, $encodings); print "encoding: $enc\n"; print 'checked: ' . intval(mb_check_encoding($content, 'UTF-16LE')); Expected result: ---------------- encoding: UTF-16LE checked: 1 Actual result: -------------- encoding: ISO-8859-1 checked: 1 ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=45993&edit=1