Bug#671764: ocrodjvu crashes on non-utf8 from tesseract

Jakub Wilk Sun, 06 May 2012 13:54:15 -0700

* Thomas Koch <tho...@koch.ro>, 2012-05-06, 20:56:

Tesseract tends to produce non utf-8 characters from time to time. Itried only german (deu) so far. Even if that seems to be an error withtesseract, it would be good, if ocrodjvu could continue working.
first exception without --html5 option

Traceback (most recent call last):

[snip]

 File "/usr/share/ocrodjvu/lib/hocr.py", line 191, in get_children
   if node.text:
 File "lxml.etree.pyx", line 897, in lxml.etree._Element.text.__get__ 
(src/lxml/lxml.etree.c:37022)
 File "apihelpers.pxi", line 691, in lxml.etree._collectText 
(src/lxml/lxml.etree.c:16626)
 File "apihelpers.pxi", line 1344, in lxml.etree.funicode 
(src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 1: invalid 
start byte

and with html5 option:

Traceback (most recent call last):

[snip]

 File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree.py", line 114, 
in insertText
   self._element.text += data
 File "lxml.etree.pyx", line 904, in lxml.etree._Element.text.__set__ 
(src/lxml/lxml.etree.c:37110)
 File "apihelpers.pxi", line 721, in lxml.etree._setNodeText 
(src/lxml/lxml.etree.c:16855)
 File "apihelpers.pxi", line 1366, in lxml.etree._utf8 
(src/lxml/lxml.etree.c:22060)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes 
or control characters

I checked, that the corresponding html files did indeed contain non utf8 
characters.

Could you attach the HTML files to the bug report (or, alternatively,send them to me in a private mail)?


--
Jakub Wilk



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#671764: ocrodjvu crashes on non-utf8 from tesseract

Reply via email to