* Thomas Koch <tho...@koch.ro>, 2012-05-06, 20:56:
Tesseract tends to produce non utf-8 characters from time to time. I
tried only german (deu) so far. Even if that seems to be an error with
tesseract, it would be good, if ocrodjvu could continue working.
first exception without --html5 option
Traceback (most recent call last):
[snip]
File "/usr/share/ocrodjvu/lib/hocr.py", line 191, in get_children
if node.text:
File "lxml.etree.pyx", line 897, in lxml.etree._Element.text.__get__
(src/lxml/lxml.etree.c:37022)
File "apihelpers.pxi", line 691, in lxml.etree._collectText
(src/lxml/lxml.etree.c:16626)
File "apihelpers.pxi", line 1344, in lxml.etree.funicode
(src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 1: invalid
start byte
and with html5 option:
Traceback (most recent call last):
[snip]
File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree.py", line 114,
in insertText
self._element.text += data
File "lxml.etree.pyx", line 904, in lxml.etree._Element.text.__set__
(src/lxml/lxml.etree.c:37110)
File "apihelpers.pxi", line 721, in lxml.etree._setNodeText
(src/lxml/lxml.etree.c:16855)
File "apihelpers.pxi", line 1366, in lxml.etree._utf8
(src/lxml/lxml.etree.c:22060)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes
or control characters
I checked, that the corresponding html files did indeed contain non utf8
characters.
Could you attach the HTML files to the bug report (or, alternatively,
send them to me in a private mail)?
--
Jakub Wilk
--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org