From: lyngvi at gmail dot com Operating system: linux PHP version: 5.3.0 PHP Bug Type: DOM XML related Bug description: DOMDocument::loadHTML should have a way to override charset
Description: ------------ I propose that DOMDocument::loadHTML($data) be extended to DOMDocument::loadHTML($data, $forceCharset=null); loadXML might be able to use the same feature, though fixing the XML charset would be easier than HTML's. Requiring the charset to be specified as a meta http-equiv content-type inside the raw HTML data is clumsy, especially since HTML is often so poorly formed. Generally I try to know my charset a priori, a good practice usually, but, in this case, one that I am being punished for. The situation I most recently came across was a in loading data off a site serving proper UTF-8 data, with *HTTP* content-type text/html charset utf-8, but the redundant meta http-equiv reporting charset iso-8859-1. See the repro code below. Ideally I could fix the serving site, I know. I can't in this case. Ideally, there would be no famine and no war. Thanks! Reproduce code: --------------- <?php header("Content-Type: text/html; charset=utf-8"); $htmldata = <<<HTMLDATA <HTMl><head><title>i our pooryl writn web page <meta http-equiv="content-type" content="text/html; charset=iso-8859-1;" /> </head > <body>this is a utf8 apostrophe: </body> </html> HTMLDATA; $doc = DOMDocument::loadHTML($htmldata); echo $doc->getElementsByTagName("body")->item(0)->textContent; ?> Expected result: ---------------- this is a utf8 apostrophe: (the apostrophe shows up correctly - I don't want DOMDocument to mutilate my text) Actual result: -------------- this is a utf8 apostrophe: ’ (I get a with a ^ on top, and the illegal characters \u0080 and \u0099 - that is, loadHTML re-encoded \u2019 (e2 80 99) to get \u00e2 \u0080 \u0099 (c3 a2 c2 80 c2 93)) -- Edit bug report at http://bugs.php.net/?id=49705&edit=1 -- Try a snapshot (PHP 5.2): http://bugs.php.net/fix.php?id=49705&r=trysnapshot52 Try a snapshot (PHP 5.3): http://bugs.php.net/fix.php?id=49705&r=trysnapshot53 Try a snapshot (PHP 6.0): http://bugs.php.net/fix.php?id=49705&r=trysnapshot60 Fixed in SVN: http://bugs.php.net/fix.php?id=49705&r=fixed Fixed in SVN and need be documented: http://bugs.php.net/fix.php?id=49705&r=needdocs Fixed in release: http://bugs.php.net/fix.php?id=49705&r=alreadyfixed Need backtrace: http://bugs.php.net/fix.php?id=49705&r=needtrace Need Reproduce Script: http://bugs.php.net/fix.php?id=49705&r=needscript Try newer version: http://bugs.php.net/fix.php?id=49705&r=oldversion Not developer issue: http://bugs.php.net/fix.php?id=49705&r=support Expected behavior: http://bugs.php.net/fix.php?id=49705&r=notwrong Not enough info: http://bugs.php.net/fix.php?id=49705&r=notenoughinfo Submitted twice: http://bugs.php.net/fix.php?id=49705&r=submittedtwice register_globals: http://bugs.php.net/fix.php?id=49705&r=globals PHP 4 support discontinued: http://bugs.php.net/fix.php?id=49705&r=php4 Daylight Savings: http://bugs.php.net/fix.php?id=49705&r=dst IIS Stability: http://bugs.php.net/fix.php?id=49705&r=isapi Install GNU Sed: http://bugs.php.net/fix.php?id=49705&r=gnused Floating point limitations: http://bugs.php.net/fix.php?id=49705&r=float No Zend Extensions: http://bugs.php.net/fix.php?id=49705&r=nozend MySQL Configuration Error: http://bugs.php.net/fix.php?id=49705&r=mysqlcfg