Edit report at https://bugs.php.net/bug.php?id=47875&edit=1
ID: 47875 Comment by: crmalibu at gmail dot com Reported by: thomas dot koch at ymc dot ch Summary: No option to set HTML input encoding Status: Open Type: Feature/Change Request Package: DOM XML related Operating System: Debian Lenny PHP Version: 5.2.9 Block user comment: N Private report: N New Comment: I also stumbled upon libxml2's htmlSetMetaEncoding() here: http://www.xmlsoft.org/encoding.html#implemente and http://www.xmlsoft.org/html/libxml-HTMLtree.html This would be a very welcome feature addition. Currently, hacky php code like this festers in the wild due to the lack of being able to specify the encoding: $encodingHint = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">'; $dom->loadHTML($encodingHint . $html); // lol make it utf8 or maybe some str_replace() or use of html tidy if the developer was feeling robust that day... This really sucks, because to me it looks like the functionality is totally there in libxml2. Previous Comments: ------------------------------------------------------------------------ [2012-07-04 08:02:05] julien at go-on-web dot com I have another test case for you, using HTML5 : <?php // ----- // FAIL CASE $html = <<<HTML <!DOCTYPE html> <html lang="fr"> <head> <meta charset="UTF-8"/> </head> <body> <p id="accent">Test case with simple accent (é) : é</p> </body> </html> HTML; $doc = new DomDocument( 1.0, 'UTF-8' ); $doc->loadHTML( $html ); var_dump( $doc->getElementById('accent')->textContent ); //=> string(40) "Test case with simple accent (é) : é" // ---- // ----- // SUCCESS CASE (but invalid html5) $html = <<<HTML <!DOCTYPE html> <html lang="fr"> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"/> </head> <body> <p id="accent">Test case with simple accent (é) : é</p> </body> </html> HTML; $doc = new DomDocument( 1.0, 'UTF-8' ); $doc->loadHTML( $html ); var_dump( $doc->getElementById('accent')->textContent ); //=> string(38) "Test case with simple accent (é) : é" // ----- ?> Regards, Julien ------------------------------------------------------------------------ [2009-04-02 09:07:32] thomas dot koch at ymc dot ch Description: ------------ Enhancement request. I need a possibility to indicate the html input encoding (as parsed from the HTTP headers) when parsing a html string with DOMDocument::loadHTML. Using loadHTMLFile is not always an option. libxml2 honors the content-type meta tag, but this may not always be present. How should the input encoding be indicated? In DOMDocument::__construct() or in DOMDocument::encoding or is that both the same? One could look in libxml2/HTMLparser.c#5580, function htmlCreateFileParserCtxt(const char *filename, const char *encoding) There the encoding is set by first building a "charset=$encoding" string and passing it to htmlCheckEncoding, which in turn parses the encoding out of the string again. This may be worth cleaning up together with upstream. Reproduce code: --------------- <?php $html = <<<EOT <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head> <!--meta http-equiv="content-type" content="text/html; charset=utf-8" --> </head> <body id="umlaut">süÃ</body> </html> EOT; $dom = new DOMDocument; var_dump( $dom->loadHTML( $html ) ); $elem = $dom->getElementById( 'umlaut' ); echo $elem->textContent; ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=47875&edit=1