Package: ocrodjvu Version: 0.7.9-1 Severity: normal -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Hi, I already reported the same problem on gscan2pdf. Tesseract tends to produce non utf-8 characters from time to time. I tried only german (deu) so far. Even if that seems to be an error with tesseract, it would be good, if ocrodjvu could continue working. first exception without --html5 option Traceback (most recent call last): File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 363, in page_thread result = self.process_page(page) File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 343, in process_page page_size=size File "/usr/share/ocrodjvu/lib/engines/tesseract.py", line 216, in extract_text return self._hocr.extract_text(stream, **kwargs) File "/usr/share/ocrodjvu/lib/hocr.py", line 434, in extract_text scan_result = scan(doc.find('/body'), settings) File "/usr/share/ocrodjvu/lib/hocr.py", line 366, in scan for zone in _scan(node, settings, settings.page_size): File "/usr/share/ocrodjvu/lib/hocr.py", line 223, in _scan return get_children(node) File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children result += _scan(child, settings, page_size) File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan children = get_children(node) File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children result += _scan(child, settings, page_size) File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan children = get_children(node) File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children result += _scan(child, settings, page_size) File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan children = get_children(node) File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children result += _scan(child, settings, page_size) File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan children = get_children(node) File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children result += _scan(child, settings, page_size) File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan children = get_children(node) File "/usr/share/ocrodjvu/lib/hocr.py", line 191, in get_children if node.text: File "lxml.etree.pyx", line 897, in lxml.etree._Element.text.__get__ (src/lxml/lxml.etree.c:37022) File "apihelpers.pxi", line 691, in lxml.etree._collectText (src/lxml/lxml.etree.c:16626) File "apihelpers.pxi", line 1344, in lxml.etree.funicode (src/lxml/lxml.etree.c:21864) UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 1: invalid start byte and with html5 option: Traceback (most recent call last): File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 363, in page_thread result = self.process_page(page) File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 343, in process_page page_size=size File "/usr/share/ocrodjvu/lib/engines/tesseract.py", line 216, in extract_text return self._hocr.extract_text(stream, **kwargs) File "/usr/share/ocrodjvu/lib/hocr.py", line 416, in extract_text doc = html5_support.parse(stream) File "/usr/share/ocrodjvu/lib/html5_support.py", line 24, in parse namespaceHTMLElements=False File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 38, in parse return p.parse(doc, encoding=encoding) File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 211, in parse parseMeta=parseMeta, useChardet=useChardet) File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 111, in _parse self.mainLoop() File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 174, in mainLoop self.phase.processCharacters(token) File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 948, in processCharacters self.tree.insertText(token["data"]) File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/_base.py", line 288, in insertText parent.insertText(data) File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree_lxml.py", line 225, in insertText builder.Element.insertText(self, data, insertBefore) File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree.py", line 114, in insertText self._element.text += data File "lxml.etree.pyx", line 904, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:37110) File "apihelpers.pxi", line 721, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:16855) File "apihelpers.pxi", line 1366, in lxml.etree._utf8 (src/lxml/lxml.etree.c:22060) ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters I checked, that the corresponding html files did indeed contain non utf8 characters. Best regards, Thomas Koch - -- System Information: Debian Release: wheezy/sid APT prefers testing APT policy: (500, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 3.2.0-2-amd64 (SMP w/4 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages ocrodjvu depends on: ii djvulibre-bin 3.5.25.2-4 ii python 2.7.2-10 ii python-argparse 1.2.1-2 ii python-djvu 0.3.9-1 ii python2.7 [python-argparse] 2.7.3~rc2-2.1 Versions of packages ocrodjvu recommends: ii ocropus <none> ii python-lxml 2.3.2-1 ii python-pyicu 1.3-1 ii tesseract-ocr 3.02.01-4 Versions of packages ocrodjvu suggests: pn cuneiform <none> pn gocr <none> pn ocrad <none> - -- no debconf information -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAEBCAAGBQJPpslmAAoJEAf8SJEEK6Za6sgQALourAgqH2xqSPjFpvVoYeZ+ k65JAdaIC2rEYoOF4hSt5wt/fIyAWMLZlaZX3pFkroVazfcyqFK2lYbmiT0x+q/9 XDuKqoaxISmCZ7QF1YJqGNnT36s98HW5VP6aupSezETCCyZOqgLd+aXVkwFi7NDW 9ItKvzWySfIU3HzOF1xxjipJYu9/698rb+DBUUd0ilmIdLJx+x3wT+gFBWbA5xJr 24m3kdJrox86zKANBRzlznpSRUZNIGKjJS8Y5M/JDpKlK9NVYOgjRBwZoq0JAyLx cgM4MYaUDxOw4BcTFCpwi5CjB9DDKcFhaCV2EeUtxur4pDrZs9sAGmDdqenfptap Dt9fcaW9GFs8mNbnf9cjrOOtL1f9o2CDZBi2MQ5RdnwjIPpRH1jKxOYL0Mq44PUA E2MqaVdRU1at009luvLVy/PQntzZrualByzcboOEkh7TUjjfQSjBc7k3piKwie1b q+wRbNu0Ifz5jJqTzKRk2pdNOviJTWuV3LlbMdM0l0pHbpumFs0lunUaulvTS+zy ttG52UpC2m+6ngJ81v+cbmv3uXF3N8Kp7LwlcyKxgROcKi8T+y30HfUjRUrhna2B xxaUtaCysSmc84pS3ps29XHv7B6TwQQ9kyV7d/nf28r9CmDMevn6S+Wz8KDFAAeL EvOww4b8vJ7IOS66394u =WLq+ -----END PGP SIGNATURE----- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org