Bug#671764: ocrodjvu crashes on non-utf8 from tesseract

Thomas Koch Sun, 06 May 2012 11:57:17 -0700

Package: ocrodjvu
Version: 0.7.9-1
Severity: normal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256


Hi,

I already reported the same problem on gscan2pdf. Tesseract tends to produce non
utf-8 characters from time to time. I tried only german (deu) so far. Even if
that seems to be an error with tesseract, it would be good, if ocrodjvu could
continue working.

first exception without --html5 option

Traceback (most recent call last):
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 363, in page_thread
    result = self.process_page(page)
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 343, in process_page
    page_size=size
  File "/usr/share/ocrodjvu/lib/engines/tesseract.py", line 216, in extract_text
    return self._hocr.extract_text(stream, **kwargs)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 434, in extract_text
    scan_result = scan(doc.find('/body'), settings)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 366, in scan
    for zone in _scan(node, settings, settings.page_size):
  File "/usr/share/ocrodjvu/lib/hocr.py", line 223, in _scan
    return get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 191, in get_children
    if node.text:
  File "lxml.etree.pyx", line 897, in lxml.etree._Element.text.__get__ 
(src/lxml/lxml.etree.c:37022)
  File "apihelpers.pxi", line 691, in lxml.etree._collectText 
(src/lxml/lxml.etree.c:16626)
  File "apihelpers.pxi", line 1344, in lxml.etree.funicode 
(src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 1: invalid 
start byte

and with html5 option:

Traceback (most recent call last):
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 363, in page_thread
    result = self.process_page(page)
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 343, in process_page
    page_size=size
  File "/usr/share/ocrodjvu/lib/engines/tesseract.py", line 216, in extract_text
    return self._hocr.extract_text(stream, **kwargs)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 416, in extract_text
    doc = html5_support.parse(stream)
  File "/usr/share/ocrodjvu/lib/html5_support.py", line 24, in parse
    namespaceHTMLElements=False
  File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 38, in parse
    return p.parse(doc, encoding=encoding)
  File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 211, in 
parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 111, in 
_parse
    self.mainLoop()
  File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 174, in 
mainLoop
    self.phase.processCharacters(token)
  File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 948, in 
processCharacters
    self.tree.insertText(token["data"])
  File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/_base.py", line 288, 
in insertText
    parent.insertText(data)
  File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree_lxml.py", line 
225, in insertText
    builder.Element.insertText(self, data, insertBefore)
  File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree.py", line 114, 
in insertText
    self._element.text += data
  File "lxml.etree.pyx", line 904, in lxml.etree._Element.text.__set__ 
(src/lxml/lxml.etree.c:37110)
  File "apihelpers.pxi", line 721, in lxml.etree._setNodeText 
(src/lxml/lxml.etree.c:16855)
  File "apihelpers.pxi", line 1366, in lxml.etree._utf8 
(src/lxml/lxml.etree.c:22060)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes 
or control characters

I checked, that the corresponding html files did indeed contain non utf8 
characters.

Best regards,

Thomas Koch

- -- System Information:
Debian Release: wheezy/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages ocrodjvu depends on:
ii  djvulibre-bin                3.5.25.2-4
ii  python                       2.7.2-10
ii  python-argparse              1.2.1-2
ii  python-djvu                  0.3.9-1
ii  python2.7 [python-argparse]  2.7.3~rc2-2.1

Versions of packages ocrodjvu recommends:
ii  ocropus        <none>
ii  python-lxml    2.3.2-1
ii  python-pyicu   1.3-1
ii  tesseract-ocr  3.02.01-4

Versions of packages ocrodjvu suggests:
pn  cuneiform  <none>
pn  gocr       <none>
pn  ocrad      <none>

- -- no debconf information

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iQIcBAEBCAAGBQJPpslmAAoJEAf8SJEEK6Za6sgQALourAgqH2xqSPjFpvVoYeZ+
k65JAdaIC2rEYoOF4hSt5wt/fIyAWMLZlaZX3pFkroVazfcyqFK2lYbmiT0x+q/9
XDuKqoaxISmCZ7QF1YJqGNnT36s98HW5VP6aupSezETCCyZOqgLd+aXVkwFi7NDW
9ItKvzWySfIU3HzOF1xxjipJYu9/698rb+DBUUd0ilmIdLJx+x3wT+gFBWbA5xJr
24m3kdJrox86zKANBRzlznpSRUZNIGKjJS8Y5M/JDpKlK9NVYOgjRBwZoq0JAyLx
cgM4MYaUDxOw4BcTFCpwi5CjB9DDKcFhaCV2EeUtxur4pDrZs9sAGmDdqenfptap
Dt9fcaW9GFs8mNbnf9cjrOOtL1f9o2CDZBi2MQ5RdnwjIPpRH1jKxOYL0Mq44PUA
E2MqaVdRU1at009luvLVy/PQntzZrualByzcboOEkh7TUjjfQSjBc7k3piKwie1b
q+wRbNu0Ifz5jJqTzKRk2pdNOviJTWuV3LlbMdM0l0pHbpumFs0lunUaulvTS+zy
ttG52UpC2m+6ngJ81v+cbmv3uXF3N8Kp7LwlcyKxgROcKi8T+y30HfUjRUrhna2B
xxaUtaCysSmc84pS3ps29XHv7B6TwQQ9kyV7d/nf28r9CmDMevn6S+Wz8KDFAAeL
EvOww4b8vJ7IOS66394u
=WLq+
-----END PGP SIGNATURE-----



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#671764: ocrodjvu crashes on non-utf8 from tesseract

Reply via email to