I'm the ocrmypdf upstream author. First, be aware that the output of OCR and autorotate is cached in the test suite and the results are persisted between test cases and runs of the test suite in the tests/cache folder. The cache hit/miss check is not smart enough to pick up changes that aren't reflected in leptonica's version number, that is, debian changes. However, it looks to like me the test suite is being run to target a temporary folder and that should remove cache effects. Nuke the test/cache folder between test suite runs to be sure.
All the failing tests relate to "check_monochrome_correlation", a function that checks for close but not identical visual output compared to a reference. Because of a now-fixed leptonica bug in one of the underlying functions, I actually have a separate test that validates that this helper function, and that passes on big endian. The log shows that tesseract failed to properly detect page orientation and came back with a low confidence answer. I interpret that to mean there are endian issues in either tesseract or leptonica; the test isn't able to distinguish. It seems that the problem may be either a big endian issue in tesseract alone (perhaps affecting multiple versions, since tesseract does not have much a test suite) or it's some leptonica API that tesseract invokes while doing a page orientation check. Tesseract's test suite is very limited and probably doesn't check for consistency here. I looks like the patch is safe to apply and would be a net improvement even though it doesn't fix all of the issues my test suite finds. You can check orientation (skipping full OCR) in tesseract 3.04.01 with: $ tesseract -l eng -psm 0 test_image.png stdout The output for LinnSequencer.jpg on my macOS-x64 machine is: $ tesseract -l eng -psm 0 tests/resources/LinnSequencer.jpg stdout Warning in pixReadMemJpeg: work-around: writing to a temp file Page number: 0 Orientation in degrees: 0 Rotate: 0 Orientation confidence: 31.48 Script: Latin Script confidence: 100.95 >From the logs, tesseract reports (orientation, confidence) = (0, 1.32) for the same page on big endian, which means whatever data it is examining is much noisier, i.e. probably corrupted by endian swizzling. Quite likely the OCR output is garbage as well. It might be interesting to see what the behavior differences are for leptonica 1.73-patched, 1.74 and tesseract 3.04.01 and 4.00alpha all on big endian. The results matrix from those combinations would probably indicate whether to blame tesseract or leptonica.