William Heymann <[EMAIL PROTECTED]> wrote:
> How do I decode a string back to useful unicode that has xml numeric
> character references in it?
>
> Things like 占
>
Try something like this:
import re
from htmlentitydefs import name2codepoint
name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")
EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
def decodeEntities(s, encoding='utf-8'):
def unescape(match):
code = match.group(1)
if code:
return unichr(int(code, 10))
else:
code = match.group(2)
if code:
return unichr(int(code, 16))
else:
code = match.group(3)
if code in name2codepoint:
return unichr(name2codepoint[code])
return match.group(0)
return EntityPattern.sub(unescape, s.decode(encoding))
Obviously if you really do only want numeric references you can take out
the lines using name2codepoint and simplify the regex.
--
http://mail.python.org/mailman/listinfo/python-list