From:             lyngvi at gmail dot com
Operating system: linux
PHP version:      5.3.0
PHP Bug Type:     DOM XML related
Bug description:  DOMDocument::loadHTML should have a way to override charset

Description:
------------
I propose that DOMDocument::loadHTML($data) be extended to
DOMDocument::loadHTML($data, $forceCharset=null); loadXML might be able to
use the same feature, though fixing the XML charset would be easier than
HTML's.

Requiring the charset to be specified as a meta http-equiv content-type
inside the raw HTML data is clumsy, especially since HTML is often so
poorly formed. Generally I try to know my charset a priori, a good practice
usually, but, in this case, one that I am being punished for.

The situation I most recently came across was a in loading data off a site
serving proper UTF-8 data, with *HTTP* content-type text/html charset
utf-8, but the redundant meta http-equiv reporting charset iso-8859-1. See
the repro code below.

Ideally I could fix the serving site, I know. I can't in this case.
Ideally, there would be no famine and no war.

Thanks!

Reproduce code:
---------------
<?php

header("Content-Type: text/html; charset=utf-8");

$htmldata = <<<HTMLDATA
<HTMl><head><title>i our pooryl writn web page
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1;"
/>
</head >
<body>this is a utf8 apostrophe: ’</body>
</html>
HTMLDATA;

$doc = DOMDocument::loadHTML($htmldata);
echo $doc->getElementsByTagName("body")->item(0)->textContent;

?>



Expected result:
----------------
this is a utf8 apostrophe: ’
(the apostrophe shows up correctly - I don't want DOMDocument to mutilate
my text)

Actual result:
--------------
this is a utf8 apostrophe: â&#128;&#153;
(I get a with a ^ on top, and the illegal characters \u0080 and \u0099 -
that is, loadHTML re-encoded \u2019 (e2 80 99) to get \u00e2 \u0080 \u0099
(c3 a2 c2 80 c2 93))

-- 
Edit bug report at http://bugs.php.net/?id=49705&edit=1
-- 
Try a snapshot (PHP 5.2):            
http://bugs.php.net/fix.php?id=49705&r=trysnapshot52
Try a snapshot (PHP 5.3):            
http://bugs.php.net/fix.php?id=49705&r=trysnapshot53
Try a snapshot (PHP 6.0):            
http://bugs.php.net/fix.php?id=49705&r=trysnapshot60
Fixed in SVN:                        
http://bugs.php.net/fix.php?id=49705&r=fixed
Fixed in SVN and need be documented: 
http://bugs.php.net/fix.php?id=49705&r=needdocs
Fixed in release:                    
http://bugs.php.net/fix.php?id=49705&r=alreadyfixed
Need backtrace:                      
http://bugs.php.net/fix.php?id=49705&r=needtrace
Need Reproduce Script:               
http://bugs.php.net/fix.php?id=49705&r=needscript
Try newer version:                   
http://bugs.php.net/fix.php?id=49705&r=oldversion
Not developer issue:                 
http://bugs.php.net/fix.php?id=49705&r=support
Expected behavior:                   
http://bugs.php.net/fix.php?id=49705&r=notwrong
Not enough info:                     
http://bugs.php.net/fix.php?id=49705&r=notenoughinfo
Submitted twice:                     
http://bugs.php.net/fix.php?id=49705&r=submittedtwice
register_globals:                    
http://bugs.php.net/fix.php?id=49705&r=globals
PHP 4 support discontinued:          http://bugs.php.net/fix.php?id=49705&r=php4
Daylight Savings:                    http://bugs.php.net/fix.php?id=49705&r=dst
IIS Stability:                       
http://bugs.php.net/fix.php?id=49705&r=isapi
Install GNU Sed:                     
http://bugs.php.net/fix.php?id=49705&r=gnused
Floating point limitations:          
http://bugs.php.net/fix.php?id=49705&r=float
No Zend Extensions:                  
http://bugs.php.net/fix.php?id=49705&r=nozend
MySQL Configuration Error:           
http://bugs.php.net/fix.php?id=49705&r=mysqlcfg

Reply via email to