* Thomas Koch <tho...@koch.ro>, 2012-05-06, 20:56:
Tesseract tends to produce non utf-8 characters from time to time. I tried only german (deu) so far. Even if that seems to be an error with tesseract, it would be good, if ocrodjvu could continue working.

first exception without --html5 option

Traceback (most recent call last):
[snip]
 File "/usr/share/ocrodjvu/lib/hocr.py", line 191, in get_children
   if node.text:
 File "lxml.etree.pyx", line 897, in lxml.etree._Element.text.__get__ 
(src/lxml/lxml.etree.c:37022)
 File "apihelpers.pxi", line 691, in lxml.etree._collectText 
(src/lxml/lxml.etree.c:16626)
 File "apihelpers.pxi", line 1344, in lxml.etree.funicode 
(src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 1: invalid 
start byte

and with html5 option:

Traceback (most recent call last):
[snip]
 File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree.py", line 114, 
in insertText
   self._element.text += data
 File "lxml.etree.pyx", line 904, in lxml.etree._Element.text.__set__ 
(src/lxml/lxml.etree.c:37110)
 File "apihelpers.pxi", line 721, in lxml.etree._setNodeText 
(src/lxml/lxml.etree.c:16855)
 File "apihelpers.pxi", line 1366, in lxml.etree._utf8 
(src/lxml/lxml.etree.c:22060)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes 
or control characters

I checked, that the corresponding html files did indeed contain non utf8 
characters.

Could you attach the HTML files to the bug report (or, alternatively, send them to me in a private mail)?

--
Jakub Wilk



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to