From:             [EMAIL PROTECTED]
Operating system: All
PHP version:      4.3.0RC3
PHP Bug Type:     Scripting Engine problem
Bug description:  htmlspecialchars() misbehaviour

htmlspecialchars() handles '&' char incorrectly - it doesn't care if it is
aready part of entity or not. It results in very "funny" things when this
function is being called several times for the same string. For example:

echo
htmlspecialchars(htmlspecialchars(htmlspecialchars(htmlspecialchars(htmlspecialchars('text
& text')))));

will produce: 
text & text 

Most correct bahaviour will be to check, if it is followed by any valid
entity as they're described in HTML specification. However it can be quite
hard to do, because there is lots of entities. So another way is also
possible (it should be faster but more dirdy): just check if '&' char is
started some abstract entity. Here is 2 regular expressions which are
implements correct '&' char handling:

1. This is correct way to handle entities:
preg_replace('/\&(?!((#\d{1,5})|(#(x|X)[\dA-Fa-f]{1,4})|[aA]acute|[aA]circ|acute|(ae|AE)lig|

[aA]grave|alefsym|[aA]lpha|amp|an[dg]|[aA]ring|asymp|[aA]tilde|[aA]uml|
bdquo|[bB]eta|brvbar|bull|cap|[cC]cedil|cedil|cent|[cC]hi|circ|clubs|cong|
copy|crarr|cup|curren|[dD]agger|d[aA]rr|deg|[dD]elta|diams|divide|[eE]acute|
[eE]circ|[eE]grave|empty|e[mn]sp|[eE]psilon|equiv|[eE]ta|eth|ETH|[eE]uml|
euro|exist|fnof|forall|frac1[24]|frac34|frasl|[gG]amma|g[et]|h[aA]rr|hearts|
hellip|[iI]acute|[iI]circ|iexcl|[iI]grave|image|infin|int|[iI]ota|iquest|
isin|[iI]uml|[kK]appa|[lL]ambda|lang|laquo|l[aA]rr|lceil|ldquo|le|lfloor|
lowast|loz|lrm|lsa?quo|lt|macr|mdash|micro|middot|minus|[mM]u|nabla|nbsp|
ndash|n[ei]|not(in)?|nsub|[nN]tilde|[nN]u|[oO]acute|[oO]circ|(oe|OE)lig|
[oO]grave|oline|[oO]mega|[oO]micron|oplus|or|ord[fm]|[oO]slash|[oO]tilde|
otimes|[oO]uml|par[at]|permil|perp|[pP]hi|[pP]i|piv|plusmn|pound|[pP]rime|
pro[dp]|[pP]si|quot|radic|rang|raquo|r[aA]rr|rceil|rdquo|real|reg|rfloor|
[rR]ho|rlm|rsaquo|rsquo|sbquo|[sS]caron|sdot|sect|shy|[sS]igma|sigmaf|sim|
spades|sube?|sum|sup[123e]?|szlig|[tT]au|there4|[tT]heta|thetasym|thinsp|
thorn|THORN|tilde|times|trade|[uU]acute|u[aA]rr|[uU]circ|[uU]grave|uml|
upsih|[uU]psilon|[uU]uml|weierp|[xX]i|[yY]acute|yen|[yY]uml|[zZ]eta|zwn?j);)/','&',$str);


2. This is less correct, but still better way to handle them:
preg_replace('/&(?!(([A-Za-z_:][A-Za-z0-9\.\-_:]*)|(#\d+)|(#(x|X)[\dA-Fa-f]+));)/','&',$str);


 Good thing about second regexp is that in a case this way will be
implemented by htmlspecialchars() function - it will be possible to use it
to handle XML entities aswell.
-- 
Edit bug report at http://bugs.php.net/?id=21027&edit=1
-- 
Try a CVS snapshot:         http://bugs.php.net/fix.php?id=21027&r=trysnapshot
Fixed in CVS:               http://bugs.php.net/fix.php?id=21027&r=fixedcvs
Fixed in release:           http://bugs.php.net/fix.php?id=21027&r=alreadyfixed
Need backtrace:             http://bugs.php.net/fix.php?id=21027&r=needtrace
Try newer version:          http://bugs.php.net/fix.php?id=21027&r=oldversion
Not developer issue:        http://bugs.php.net/fix.php?id=21027&r=support
Expected behavior:          http://bugs.php.net/fix.php?id=21027&r=notwrong
Not enough info:            http://bugs.php.net/fix.php?id=21027&r=notenoughinfo
Submitted twice:            http://bugs.php.net/fix.php?id=21027&r=submittedtwice
register_globals:           http://bugs.php.net/fix.php?id=21027&r=globals
PHP 3 support discontinued: http://bugs.php.net/fix.php?id=21027&r=php3
Daylight Savings:           http://bugs.php.net/fix.php?id=21027&r=dst
IIS Stability:              http://bugs.php.net/fix.php?id=21027&r=isapi

Reply via email to