I'm the ocrmypdf upstream author.

First, be aware that the output of OCR and autorotate is cached in the test
suite and the results are persisted between test cases and runs of the test
suite in the tests/cache folder. The cache hit/miss check is not smart
enough to pick up changes that aren't reflected in leptonica's version
number, that is, debian changes. However, it looks to like me the test
suite is being run to target a temporary folder and that should remove
cache effects. Nuke the test/cache folder between test suite runs to be
sure.

All the failing tests relate to "check_monochrome_correlation", a function
that checks for close but not identical visual output compared to a
reference. Because of a now-fixed leptonica bug in one of the underlying
functions, I actually have a separate test that validates that this helper
function, and that passes on big endian.

The log shows that tesseract failed to properly detect page orientation and
came back with a low confidence answer. I interpret that to mean there are
endian issues in either tesseract or leptonica; the test isn't able to
distinguish.

It seems that the problem may be either a big endian issue in tesseract
alone (perhaps affecting multiple versions, since tesseract does not have
much a test suite) or it's some leptonica API that tesseract invokes while
doing a page orientation check. Tesseract's test suite is very limited and
probably doesn't check for consistency here.

I looks like the patch is safe to apply and would be a net improvement even
though it doesn't fix all of the issues my test suite finds.


You can check orientation (skipping full OCR) in tesseract 3.04.01 with:

$ tesseract -l eng -psm 0 test_image.png stdout

The output for LinnSequencer.jpg on my macOS-x64 machine is:

$ tesseract -l eng -psm 0 tests/resources/LinnSequencer.jpg stdout
Warning in pixReadMemJpeg: work-around: writing to a temp file
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 31.48
Script: Latin
Script confidence: 100.95

>From the logs, tesseract reports (orientation, confidence) = (0, 1.32) for
the same page on big endian, which means whatever data it is examining is
much noisier, i.e. probably corrupted by endian swizzling. Quite likely the
OCR output is garbage as well.

It might be interesting to see what the behavior differences are for
leptonica 1.73-patched, 1.74 and tesseract 3.04.01 and 4.00alpha all on big
endian. The results matrix from those combinations would probably indicate
whether to blame tesseract or leptonica.

Reply via email to