From:             cataphract
Operating system: Irrelevant
PHP version:      trunk-SVN-2010-09-16 (SVN)
Package:          *General Issues
Bug Type:         Feature/Change Request
Bug description:htmlspecialchars/htmlentities stripping invalid characters

Description:
------------
htmlspecialchars() and htmlentities() are commonly used to convert
user-supplied text into text that's safe to output in an HTML or XML
document.



Actually, they are insufficient for this purpose, because characters that
are invalid in XML or XHTML are not stripped out.



In HTML, this results in an invalid document.



In XML, the result is worse because one will end-up with malformed XML.
Therefore, sanitation with htmlspecialchars can result in corrupted data.



Additionaly, when passed $double_encode == true, invalid character entities
(i.e. those which refer to invalid characters) should also be stripped
out.



See

* http://www.w3.org/TR/REC-xml/#NT-Char

* http://www.w3.org/TR/REC-xml/#NT-CharRef

Test script:
---------------
<?php

$mode = @$_GET["mode"];

if ($mode == "xhtml") {

header("Content-type: application/xhtml+xml; charset=utf-8");

$templ = <<<XML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>

<html xmlns="http://www.w3.org/1999/xhtml";>

<head>

<title>Test</title>

<meta http-equiv="Content-type" content="application/xhtml+xml;
charset=utf-8" />

</head>

<body>

%s

</body>

</html>

XML;

}

elseif ($mode == "html") {

header("Content-type: text/html; charset=utf-8");

$templ = <<<HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd";>

<html>

<head>

<title>Test</title>

</head>

<body>

%s

</body>

</html>

HTML;

}

else die("bad mode");



$data = "My data: <\x1F";



echo sprintf($templ, htmlentities($data, ENT_NOQUOTES, "UTF-8"));

Expected result:
----------------
At minimum, this should be documented in the manual pages for
htmlspecialchars and htmlentities.



A better solution would be to change those two functions to strip
characters outside the allowed range:



#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]



Another alternative, which wouldn't break BC, would be to add another
function or another flag to htmlentities/htmlspecialchars (in addition to
ENT_NOQUOTES/ENT_QUOTES/ENT_COMPAT) that would strip out these characters,
possible plus those that authors are "encouraged to avoid":



[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],

[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],

[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],

[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],

[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],

[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],

[#x10FFFE-#x10FFFF].





Actual result:
--------------
The W3C validator gives an error:



You have used an illegal character in your text. HTML uses the standard
UNICODE Consortium character repertoire, and it leaves undefined (among
others) 65 character codes (0 to 31 inclusive and 127 to 159 inclusive)
that are sometimes used for typographical quote marks and similar in
proprietary character sets. The validator has found one of these undefined
characters in your document. The character may appear on your browser as a
curly quote, or a trademark symbol, or some other fancy glyph; on a
different computer, however, it will likely appear as a completely
different character, or nothing at all.





-- 
Edit bug report at http://bugs.php.net/bug.php?id=52860&edit=1
-- 
Try a snapshot (PHP 5.2):            
http://bugs.php.net/fix.php?id=52860&r=trysnapshot52
Try a snapshot (PHP 5.3):            
http://bugs.php.net/fix.php?id=52860&r=trysnapshot53
Try a snapshot (trunk):              
http://bugs.php.net/fix.php?id=52860&r=trysnapshottrunk
Fixed in SVN:                        
http://bugs.php.net/fix.php?id=52860&r=fixed
Fixed in SVN and need be documented: 
http://bugs.php.net/fix.php?id=52860&r=needdocs
Fixed in release:                    
http://bugs.php.net/fix.php?id=52860&r=alreadyfixed
Need backtrace:                      
http://bugs.php.net/fix.php?id=52860&r=needtrace
Need Reproduce Script:               
http://bugs.php.net/fix.php?id=52860&r=needscript
Try newer version:                   
http://bugs.php.net/fix.php?id=52860&r=oldversion
Not developer issue:                 
http://bugs.php.net/fix.php?id=52860&r=support
Expected behavior:                   
http://bugs.php.net/fix.php?id=52860&r=notwrong
Not enough info:                     
http://bugs.php.net/fix.php?id=52860&r=notenoughinfo
Submitted twice:                     
http://bugs.php.net/fix.php?id=52860&r=submittedtwice
register_globals:                    
http://bugs.php.net/fix.php?id=52860&r=globals
PHP 4 support discontinued:          http://bugs.php.net/fix.php?id=52860&r=php4
Daylight Savings:                    http://bugs.php.net/fix.php?id=52860&r=dst
IIS Stability:                       
http://bugs.php.net/fix.php?id=52860&r=isapi
Install GNU Sed:                     
http://bugs.php.net/fix.php?id=52860&r=gnused
Floating point limitations:          
http://bugs.php.net/fix.php?id=52860&r=float
No Zend Extensions:                  
http://bugs.php.net/fix.php?id=52860&r=nozend
MySQL Configuration Error:           
http://bugs.php.net/fix.php?id=52860&r=mysqlcfg

Reply via email to