Edit report at https://bugs.php.net/bug.php?id=45993&edit=1

 ID:                 45993
 Comment by:         Apollo880 at gmail dot com
 Reported by:        mtrojan at transline dot de
 Summary:            mb_detect_encoding and mb_check_encoding results are
                     dissonant
 Status:             Open
 Type:               Bug
 Package:            mbstring related
 Operating System:   Windows XP
 PHP Version:        5.2.6
 Block user comment: N
 Private report:     N

 New Comment:

Bug with correct encoding detection.

function detect_enc($str)
{
        $awe = mb_list_encodings();
        unset($awe[0], $awe[1], $awe[2]);
        foreach ($awe as $enctype)
        {
                if (mb_check_encoding($str, $enctype) === true) return $enctype;
        }
        return false;
}

echo detect_enc('String_encoded_to_Windows-1251'); // Return 'byte2be'. It's a 
fail.


Previous Comments:
------------------------------------------------------------------------
[2008-11-10 07:30:32] mtrojan at transline dot de

Of course, comparing the beginning of a file with the UTF-16 BOM can be used to 
detect UTF-16 encoding. But what do you do with UTF-16 encoded files where no 
BOM is set?

------------------------------------------------------------------------
[2008-11-08 02:20:46] hirok...@php.net

mb_detect_encoding does not support the UTF-16/UTF-16BE 
encoding detection. Because UTF-16 isn't byte stream encoding like UTF-8, we 
cannot detect the encoding as other byte stream encoding.

The file encoded in UTF-16 can be detected easily using BOM, 
it is like,

if ($content[0]==chr(0xff) && $content[1]==chr(0xfe)) {
  echo 'UTF-16';
} else if ($content[0]==chr(0xfe) && $content[1]==chr(0xff)) {
  echo 'UTF-16BE';
}







------------------------------------------------------------------------
[2008-10-26 23:01:49] j...@php.net

Assigned to the mbstring maintainer.

------------------------------------------------------------------------
[2008-09-04 11:47:39] mtrojan at transline dot de

Description:
------------
mb_detect_encoding does not seem to recognize UTF-16 encoded files properly. 
Even if it is assured by using mb_check_encoding that a file is truly UTF-16LE, 
mb_detect_encoding does not detect the same file as UTF-16 and is returning 
ISO-8859-1 instead. Activating/deactivating strict mode has no influence on the 
result.

Reproduce code:
---------------
$content = file_get_contents($src_path);
        
$encodings = array('UTF-16', 'UTF-16LE', 'UTF-16BE', 'UTF-8', 'UNICODE', 
'ISO-8859-1');

$enc = mb_detect_encoding($content, $encodings);
print "encoding: $enc\n";
        
print 'checked: ' . intval(mb_check_encoding($content, 'UTF-16LE'));

Expected result:
----------------
encoding: UTF-16LE
checked: 1

Actual result:
--------------
encoding: ISO-8859-1
checked: 1


------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=45993&edit=1

Reply via email to