Great to see the most recent test run passed, even if it is with liberal
application of "expect failure". The Canonical powers that be should be
appeased for the moment. I appreciate the last minute effort to get this,
and Tesseract, into the next Ubuntu.

Thinking about the failures, I suspect that the endian issues are now
within Tesseract not Leptonica. test_deskew passes, and this test skips
Tesseract entirely. It uses Leptonica to deskew a monochrome image and
confirm it was deskewed. I think it's extremely unlikely all that bit
twiddling would work if Leptonica were in the wrong endian. (Although there
could be individual Leptonica APIs might not work on big endian.)

The failure that surprised me is "test_tesseract_config_notfound". It
passes Tesseract a configuration file that doesn't exist, but it turns out
Tesseract proceeds with OCR rather than aborts in this case, so this isn't
informative.

Based on the failures I suspect the following command line will exit with
SIGSEGV:
  tesseract -l eng -c textonly_pdf=1 --user-words wordlist.txt
tests/resources/crom.png out pdf txt

where wordlist.txt is a file containing some words separated by newlines
and tests/resources/crom.png is distributed with OCRmyPDF. (If it does work
the PDF will be a blank page containing text with no image.)

Most of the failing tests have something to do with setting non-default
configuration variables for Tesseract.

Reply via email to