Hi Lucas,

On Sat, Oct 01, 2016 at 10:45:00AM +0200, Lucas Nussbaum wrote:
> During a rebuild of all packages in sid, your package failed to build
> on amd64. [...]
>> expected testauto/output.html to contain 'expand(char(ø))'

Thanks for reporting this. After a bit of digging, I've found that it's
caused by a non-backwards-compatible change in libtidy (which rawdog
uses via the python-libtidy bindings): in libtidy 0.99, the input and
output encodings defaulted to ASCII, whereas libtidy 5 defaults them to
UTF-8. The result is that libtidy takes the HTML that rawdog has already
converted to ASCII, and expands the character references into UTF-8
characters.

On jessie, with libtidy-0.99.0 20091223cvs-1.4+deb8u1:
$ python -c 'import tidylib; print repr(tidylib.tidy_document("È", 
{"numeric_entities": 1, "output_html": 1})[0])'
'<html>\n  <head>\n    <title></title>\n  </head>\n  <body>\n    &#200;\n  
</body>\n</html>\n'

On sid, with libtidy5 5.2.0-2:
$ python -c 'import tidylib; print repr(tidylib.tidy_document("&#200;", 
{"numeric_entities": 1, "output_html": 1})[0])'
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">\n<html>\n  <head>\n    
<title></title>\n  </head>\n  <body>\n    \xc3\x88\n  </body>\n</html>\n'

Specifying the input and output encodings explicitly as ASCII fixes
this. I've made the following change in upstream rawdog, and it'll be
fixed in rawdog 2.22:

diff --git a/rawdoglib/rawdog.py b/rawdoglib/rawdog.py
index d1d4e4c..8a6702a 100644
--- a/rawdoglib/rawdog.py
+++ b/rawdoglib/rawdog.py
@@ -136,6 +136,8 @@ def sanitise_html(html, baseurl, inline, config):
        if config["tidyhtml"]:
                args = {
                        "numeric_entities": 1,
+                       "input_encoding": "ascii",
+                       "output_encoding": "ascii",
                        "output_html": 1,
                        "output_xhtml": 0,
                        "output_xml": 0,

Cheers,

-- 
Adam Sampson <a...@offog.org>                         <http://offog.org/>

Reply via email to