ID:               47108
 User updated by:  terrafr...@php.net
 Reported By:      terrafr...@php.net
 Status:           Open
 Bug Type:         DOM XML related
 Operating System: Windows XP
 PHP Version:      5.2.8
 New Comment:

That makes sense.  I updated the script to iterate through the problem
characters and the ones you mentioned are included.  Other problem
characters include 0x26, 0x3C, 0x3E, 0xA4, 0xA5 and 0xAA.  The first
three make sense - they correspond to &, <, and >, respectively.  The
latter three don't make as much sense to me.

Also, it seems to me that it ought to fail more gracefully than it does
- you wouldn't expect your browser to ignore all HTML after an invalid
character is encountered and it seems to me like this shouldn't,
either.

Per your suggestion, I've filed a bug report on libxml2 here:

http://bugzilla.gnome.org/show_activity.cgi?id=567885

Not sure if that's the appropriate bug tracker, though.  Also, it seems
like reproducing the bug using the language libxml2 is intended as a
library for would be prudent, but alas, I don't have any C/C++ compilers
on this computer.


Previous Comments:
------------------------------------------------------------------------

[2009-01-15 02:53:45] typoon at gmail dot com

The explanation to this might be the fact that ISO-8859-7 does not have
the character 0xAE. When libxml tries to convert it, an error is thrown
because of this.
References:
http://www.itscj.ipsj.or.jp/ISO-IR/227.pdf
http://en.wikipedia.org/wiki/ISO_8859-7

Checking the PDF you will see 0xAE is not assigned.
Quoting wikipedia:
"Code values 00–1F, 7F, 80–9F, AE, D2 and FF are not assigned to
characters by ISO/IEC 8859-7."

More information and other reference can also be found on google.
My 2 cents then are that this is not a bug at all.
If you still think it is, the we might need to open a bug report for
the libxml team as this is an error generated inside libxml, not PHP.

Regards,

Henrique

------------------------------------------------------------------------

[2009-01-14 20:08:27] terrafr...@php.net

Description:
------------
All HTML after chr(0xAE) (if present) is ignored by DOMDocument's
loadHTML(), even if chr(0xAE) is a valid character per the HTML's
charset.  In the Reproduce code, replace chr(0xAE) with chr(0xAF) or
chr(0xAD) or just remove it all together, and it works.  Further, if you
echo out $str and copy / paste the HTML into validator.w3.org, it's
valid HTML, even with the chr(0xAE).

Reproduce code:
---------------
<?php
$str = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd";>
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=iso-8859-7">
<title>test</title>
</head>
<body><p>aaaaa' . chr(0xAE) . 'zzzzz</p></body>
</html>';

$xml = new DOMDocument();
$xml->loadHTML($str);
echo $xml->saveHTML();

Expected result:
----------------
aaaaa&#65533;zzzzz

Actual result:
--------------
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input
conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in
C:\htdocs\test.php on line 14

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input
conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in
C:\htdocs\test.php on line 14

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]:
htmlCheckEncoding: encoder error in Entity, line: 4 in
C:\htdocs\test.php on line 14

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input
conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in
C:\htdocs\test.php on line 14

aaaaa


------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=47108&edit=1

Reply via email to