Hi Lucas, On Sat, Oct 01, 2016 at 10:45:00AM +0200, Lucas Nussbaum wrote: > During a rebuild of all packages in sid, your package failed to build > on amd64. [...] >> expected testauto/output.html to contain 'expand(char(ø))'
Thanks for reporting this. After a bit of digging, I've found that it's caused by a non-backwards-compatible change in libtidy (which rawdog uses via the python-libtidy bindings): in libtidy 0.99, the input and output encodings defaulted to ASCII, whereas libtidy 5 defaults them to UTF-8. The result is that libtidy takes the HTML that rawdog has already converted to ASCII, and expands the character references into UTF-8 characters. On jessie, with libtidy-0.99.0 20091223cvs-1.4+deb8u1: $ python -c 'import tidylib; print repr(tidylib.tidy_document("È", {"numeric_entities": 1, "output_html": 1})[0])' '<html>\n <head>\n <title></title>\n </head>\n <body>\n È\n </body>\n</html>\n' On sid, with libtidy5 5.2.0-2: $ python -c 'import tidylib; print repr(tidylib.tidy_document("È", {"numeric_entities": 1, "output_html": 1})[0])' '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">\n<html>\n <head>\n <title></title>\n </head>\n <body>\n \xc3\x88\n </body>\n</html>\n' Specifying the input and output encodings explicitly as ASCII fixes this. I've made the following change in upstream rawdog, and it'll be fixed in rawdog 2.22: diff --git a/rawdoglib/rawdog.py b/rawdoglib/rawdog.py index d1d4e4c..8a6702a 100644 --- a/rawdoglib/rawdog.py +++ b/rawdoglib/rawdog.py @@ -136,6 +136,8 @@ def sanitise_html(html, baseurl, inline, config): if config["tidyhtml"]: args = { "numeric_entities": 1, + "input_encoding": "ascii", + "output_encoding": "ascii", "output_html": 1, "output_xhtml": 0, "output_xml": 0, Cheers, -- Adam Sampson <a...@offog.org> <http://offog.org/>