from:"debbug . tesseract"

Bug#1025856: tesseract-ocr: (regression) File descriptors no longer work for input (“Error in fopenReadStream: file not found”)

2022-12-10 Thread debbug . tesseract

Package: tesseract-ocr
Version: 4.1.1-2.1
Severity: normal
X-Debbugs-Cc: debbug.tesser...@sideload.33mail.com

I have this line in an old shell script:

  $ tesseract <(convert "$jpgFile" +dither -colors 2 -normalize -resize 1 
pbm:-) - -l eng

Today that line fails with this output:

  =8<--
  Error in fopenReadStream: file not found
  Error in pixRead: image file not found: P4
  Image file P4 cannot be read!
  Error during processing.
  =8<--

The fact that the command was in an old shell script suggests that it
likely worked at one point in time. But certainly version 4.1.1-2.1 of
tesseract-ocr cannot handle shell-substituted files.

This report actually covers two bugs:

  1) tesseract-ocr fails to process shell-substituted files.
  
  2) tesseract-ocr does not inform the user. It should give a graceful
 error msg. That is, if there is no intent to support
 shell-substituted files, then the app should detect when such a
 file is specified and inform the user using plain English in the
 error msg stating that shell-substituted files are unsupported.
 The man page should also disclose this limitation either in the
 paragraph that covers the input file spec and/or in a new section
 titled “LIMITATIONS”.  Or if there is intent to support
 substitution files, then it should be explicitly stated in the
 man page.

Workaround:

  If ImageMagick is executed separately to populate a regular file,
  tesseract has no problem with using that regular file as input.

-- System Information:
Debian Release: 11.5
  APT prefers stable-updates
  APT policy: (990, 'stable-updates'), (990, 'stable-security'), (990, 
'testing'), (990, 'stable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-19-amd64 (SMP w/2 CPU threads)
Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages tesseract-ocr depends on:
ii  libarchive13 3.4.3-2+deb11u1
ii  libc62.31-13+deb11u5
ii  libcairo21.16.0-5
ii  libfontconfig1   2.13.1-4.2
ii  libgcc-s110.2.1-6
ii  libglib2.0-0 2.66.8-1
ii  libicu67 67.1-7
ii  liblept5 1.79.0-1.1
ii  libpango-1.0-0   1.46.2-3
ii  libpangocairo-1.0-0  1.46.2-3
ii  libpangoft2-1.0-01.46.2-3
ii  libstdc++6   10.2.1-6
ii  libtesseract44.1.1-2.1
ii  tesseract-ocr-eng1:4.00~git30-7274cfa-1.1
ii  tesseract-ocr-osd1:4.00~git30-7274cfa-1.1

tesseract-ocr recommends no packages.

tesseract-ocr suggests no packages.

-- no debconf information

Bug#1027985: tesseract-ocr: document gets rotated on its side when converting from jpg to pdf

2023-01-05 Thread debbug . tesseract

Package: tesseract-ocr
Version: 4.1.1-2.1
Severity: normal
X-Debbugs-Cc: debbug.tesser...@sideload.33mail.com

When tesseract is fed a JPG image of an upright document and
instructed to produce a searchable PDF, it flips the image on its
side. The rotation apparently happens before OCR is performed judging
from the text produced (as pdf2txt shows it as one character per
line). This is the syntax used:

  $ tesseract color_document.jpg sideways_doc -l eng+nld pdf

The workaround is quite ugly:

  $ pdftk doc_sideways_doc.pdf cat 1-r1east output upright_doc.pdf
  $ ocrmypdf --force-ocr -l eng+nld upright_doc.pdf proper.pdf

I don’t think this bug affects every document. It’s perhaps trying to
be smart and detect the orientation of the doc & misjudging it. If
that’s true, it’s a shame that tesseract does this automatically and
beyond the control of the user. There is no option to force tesseract
to leave the orientation as-is.

-- System Information:
Debian Release: 11.5
  APT prefers stable-updates
  APT policy: (990, 'stable-updates'), (990, 'stable-security'), (990, 
'testing'), (990, 'stable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-19-amd64 (SMP w/2 CPU threads)
Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages tesseract-ocr depends on:
ii  libarchive13 3.4.3-2+deb11u1
ii  libc62.31-13+deb11u5
ii  libcairo21.16.0-5
ii  libfontconfig1   2.13.1-4.2
ii  libgcc-s110.2.1-6
ii  libglib2.0-0 2.66.8-1
ii  libicu67 67.1-7
ii  liblept5 1.79.0-1.1
ii  libpango-1.0-0   1.46.2-3
ii  libpangocairo-1.0-0  1.46.2-3
ii  libpangoft2-1.0-01.46.2-3
ii  libstdc++6   10.2.1-6
ii  libtesseract44.1.1-2.1
ii  tesseract-ocr-eng1:4.00~git30-7274cfa-1.1
ii  tesseract-ocr-osd1:4.00~git30-7274cfa-1.1

tesseract-ocr recommends no packages.

tesseract-ocr suggests no packages.

-- no debconf information

Bug#1025856: tesseract-ocr: (regression) File descriptors no longer work for input (“Error in fopenReadStream: file not found”)

Bug#1027985: tesseract-ocr: document gets rotated on its side when converting from jpg to pdf

2 matches

Site Navigation

Mail list logo

Footer information