Bug#671764: ocrodjvu crashes on non-utf8 from tesseract

2012-05-11 Thread Thomas Koch
Jakub Wilk: > I believe this is now fixed in upstream VCS. Would you mind giving it a > try? You can download the snapshot at: > https://bitbucket.org/jwilk/ocrodjvu/get/6a8a22af7232.tar.gz Hi Jakub, I tested the current HEAD (9bf208e1f372) from the mercurial repo and the --fix- utf8 option works

Bug#671764: ocrodjvu crashes on non-utf8 from tesseract

2012-05-07 Thread Jakub Wilk
I believe this is now fixed in upstream VCS. Would you mind giving it a try? You can download the snapshot at: https://bitbucket.org/jwilk/ocrodjvu/get/6a8a22af7232.tar.gz -- Jakub Wilk -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Tro

Bug#671764: ocrodjvu crashes on non-utf8 from tesseract

2012-05-07 Thread Jakub Wilk
* Thomas Koch , 2012-05-07, 08:35: attached the minimal hocr test file that Jeffrey Ratcliffe uses. Thanks. Ideally the HTML parser would take care of handling such errors, but it's not the case: https://bugs.launchpad.net/lxml/+bug/690110 http://bugs.debian.org/671842 I'll probably implemen

Bug#671764: ocrodjvu crashes on non-utf8 from tesseract

2012-05-06 Thread Thomas Koch
Jakub Wilk: > Could you attach the HTML files to the bug report (or, alternatively, > send them to me in a private mail)? Hi Jakub, thank you for responding so quickly. I reported the same issue to gscan2pdf and attached the minimal hocr test file that Jeffrey Ratcliffe uses. The tesseract utf-

Bug#671764: ocrodjvu crashes on non-utf8 from tesseract

2012-05-06 Thread Jakub Wilk
* Thomas Koch , 2012-05-06, 20:56: Tesseract tends to produce non utf-8 characters from time to time. I tried only german (deu) so far. Even if that seems to be an error with tesseract, it would be good, if ocrodjvu could continue working. first exception without --html5 option Traceback (mos

Bug#671764: ocrodjvu crashes on non-utf8 from tesseract

2012-05-06 Thread Thomas Koch
Package: ocrodjvu Version: 0.7.9-1 Severity: normal -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi, I already reported the same problem on gscan2pdf. Tesseract tends to produce non utf-8 characters from time to time. I tried only german (deu) so far. Even if that seems to be an error with te