ID: 47108 User updated by: terrafr...@php.net Reported By: terrafr...@php.net Status: Open Bug Type: DOM XML related Operating System: Windows XP PHP Version: 5.2.8 New Comment:
That makes sense. I updated the script to iterate through the problem characters and the ones you mentioned are included. Other problem characters include 0x26, 0x3C, 0x3E, 0xA4, 0xA5 and 0xAA. The first three make sense - they correspond to &, <, and >, respectively. The latter three don't make as much sense to me. Also, it seems to me that it ought to fail more gracefully than it does - you wouldn't expect your browser to ignore all HTML after an invalid character is encountered and it seems to me like this shouldn't, either. Per your suggestion, I've filed a bug report on libxml2 here: http://bugzilla.gnome.org/show_activity.cgi?id=567885 Not sure if that's the appropriate bug tracker, though. Also, it seems like reproducing the bug using the language libxml2 is intended as a library for would be prudent, but alas, I don't have any C/C++ compilers on this computer. Previous Comments: ------------------------------------------------------------------------ [2009-01-15 02:53:45] typoon at gmail dot com The explanation to this might be the fact that ISO-8859-7 does not have the character 0xAE. When libxml tries to convert it, an error is thrown because of this. References: http://www.itscj.ipsj.or.jp/ISO-IR/227.pdf http://en.wikipedia.org/wiki/ISO_8859-7 Checking the PDF you will see 0xAE is not assigned. Quoting wikipedia: "Code values 001F, 7F, 809F, AE, D2 and FF are not assigned to characters by ISO/IEC 8859-7." More information and other reference can also be found on google. My 2 cents then are that this is not a bug at all. If you still think it is, the we might need to open a bug report for the libxml team as this is an error generated inside libxml, not PHP. Regards, Henrique ------------------------------------------------------------------------ [2009-01-14 20:08:27] terrafr...@php.net Description: ------------ All HTML after chr(0xAE) (if present) is ignored by DOMDocument's loadHTML(), even if chr(0xAE) is a valid character per the HTML's charset. In the Reproduce code, replace chr(0xAE) with chr(0xAF) or chr(0xAD) or just remove it all together, and it works. Further, if you echo out $str and copy / paste the HTML into validator.w3.org, it's valid HTML, even with the chr(0xAE). Reproduce code: --------------- <?php $str = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=iso-8859-7"> <title>test</title> </head> <body><p>aaaaa' . chr(0xAE) . 'zzzzz</p></body> </html>'; $xml = new DOMDocument(); $xml->loadHTML($str); echo $xml->saveHTML(); Expected result: ---------------- aaaaa�zzzzz Actual result: -------------- Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in C:\htdocs\test.php on line 14 Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in C:\htdocs\test.php on line 14 Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlCheckEncoding: encoder error in Entity, line: 4 in C:\htdocs\test.php on line 14 Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in C:\htdocs\test.php on line 14 aaaaa ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=47108&edit=1