Package: tidy Version: 20091223cvs-1 Severity: normal "tidy -asxhtml -utf8 --add-xml-decl yes" doesn't specify the encoding. The consequence is that the XML processor cannot reliably determine the encoding at that time. For instance, libxml2 will assume that the output encoding should be US-ASCII (though it will be able to read UTF-8 sequences as required), so that
echo é | tidy -asxhtml -utf8 --add-xml-decl yes | xmllint - gives: <?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta name="generator" content="HTML Tidy for Linux (vers 25 March 2009), see www.w3.org" /> <title></title> </head> <body> é </body> </html> See the "é" that has been written as a character reference due to the absence of declared encoding. Note that the behavior of xmllint won't change: https://bugzilla.gnome.org/show_bug.cgi?id=350208 -- System Information: Debian Release: squeeze/sid APT prefers unstable APT policy: (500, 'unstable'), (500, 'testing'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 2.6.32-5-amd64 (SMP w/8 CPU cores) Locale: LANG=POSIX, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages tidy depends on: ii libc6 2.11.2-7 Embedded GNU C Library: Shared lib ii libtidy-0.99-0 20091223cvs-1 HTML syntax checker and reformatte tidy recommends no packages. Versions of packages tidy suggests: ii tidy-doc 20091223cvs-1 HTML syntax checker and reformatte -- no debconf information -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org