branch: externals/scanner
commit e49b4ae7fbf404937b5459661e307f388812ec79
Author: Raffael Stocker <r.stoc...@mnet-mail.de>
Commit: Raffael Stocker <r.stoc...@mnet-mail.de>

    add recommendations for improving scan and OCR quality
---
 scanner.texi | 134 +++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 108 insertions(+), 26 deletions(-)

diff --git a/scanner.texi b/scanner.texi
index d4b5a176aa..12c3c91b88 100644
--- a/scanner.texi
+++ b/scanner.texi
@@ -828,35 +828,117 @@ format as for @code{scanner-scanimage-switches}.  The 
default is nil.
 @cindex scan quality, improving
 @cindex quality, improving
 
-include tips for improving quality, ask for contributions
+This chapter comprises recommendations for improving the scan or OCR
+quality.  If you know about any additional tips and tricks to improve
+quality, please let the author know about them.
 
-@itemize
-@item
-tesseract v4
-@item
-see @url{https://tesseract-ocr.github.io/tessdoc/} for tesseract documentation
-@item
-set left and top margin in scanimage
-@item
-use brightness and contrast for scanimage
-@item
-use at least 300 DPI for OCR, but more than 600 only increase document
-size without improving OCR
-@item
-use the right language, not multiple languages for a single-language document
-@item
-use unpaper to deskew the page
-@item
-use unpaper deskewing options
-@item
-use unpaper middle wipe etc.
-@item
-have a look at
+Besides checking the following sections, you might also want to consult
+the documentation for @command{tesseract},
+@url{https://tesseract-ocr.github.io/tessdoc/}, and @command{unpaper},
 @url{https://github.com/unpaper/unpaper/blob/main/doc/basic-concepts.md}
-and
+(basics) and
 @url{https://github.com/unpaper/unpaper/blob/main/doc/image-processing.md}
-for detailed information on how to use unpaper
-@end itemize
+(details).
+
+
+@menu
+* Improving General Scan Quality::
+* Improving OCR::
+@end menu
+
+@node Improving General Scan Quality
+@section Improving General Scan Quality
+@cindex improving general scan quality
+
+@table @asis
+@item Image format
+As a lossy format, JPEG is not a good basis for later OCR.  Therefore,
+use PNG, PNM, or TIFF for document scanning.  If you also use
+@command{unpaper}, the image format is forced to PNM, as required by
+this tool.
+
+@item Scan area
+Besides setting the size of the scan area, @command{scanimage} allows
+you to specify offsets to the top and left edges.  The device-specific
+switches @option{-l} and @option{-t}, if available, allow you to specify
+the top-left x and y positions, respectively.  This can be used to get
+rid of some blacked out parts in corners due to the mis-alignment of
+scan area and scanned sheet.
+
+@item Resolution
+For document scans, use at least 300 DPI to achieve acceptable OCR
+results.  A resolution above 600 DPI will not enhance OCR quality any
+further and only leads to larger files.  For most documents, 300 DPI
+should be ok.
+
+@item Brightness and contrast
+Good document quality (and especially good OCR results) require
+sufficient contrast and a good reproduction of the document's background
+color.  If the defaults of your device are inadequate, use the
+brightness and contrast settings of @command{scanimage} to provide
+sensible values.  See @ref{Configuring scanimage} and the options
+@code{scanner-brightness} and @code{scanner-contrast} there.  You may
+want to try a low brightness setting (for example, 20) and a medium
+contrast setting (for example, 50) as a start.
+
+Note that the underlying parameters to @command{scanimage} are
+device-specific.  If the two mentioned options are not supported by your
+device, you may be able to use @code{scanner-scanimage-switches} to
+supply the specific switches to @command{scanimage}.
+
+@item Dark areas and shadows
+If your scan shows dark (black/gray) areas or shadows, for example in
+the fold when scanning a book, use @command{unpaper} to remove them.  If
+it cannot remove these areas automatically, you can manually specify an
+area to be wiped out using the @option{--wipe} switch of
+@command{unpaper}.  If you scan a page with the ``double'' layout and
+want to remove the shadow of a book fold, use the @option{--middle-wipe}
+switch.  You can put these switches into the
+@code{scanner-unpaper-switches} option.  See also the @command{unpaper}
+documentation.
+
+@item Page borders
+If you use @command{unpaper}, it will try to remove dark areas around
+the edges of the page.  If this does not work automatically, use the
+@code{scanner-unpaper-border} option to specify a border (in pixels)
+around the edges of the page that is to be wiped, see @ref{Configuring
+unpaper}.
+
+@end table
+
+@node Improving OCR
+@section Improving OCR
+@cindex improving ocr
+
+@table @asis
+@item Tesseract version
+Use version 4 or higher of @command{tesseract}.  This version includes a
+new OCR engine that delivers better results than the previous one.
+Also, @command{tesseract} is multithreaded starting from version 4 and
+is therefore faster on multi-core machines.
+
+@item Language setting
+Tesseract allows you to use multiple languages.  For single-language
+documents, however, this doesn't seem to be optimal.  It's best to
+choose a single language when possible.
+
+@item Page deskewing
+OCR is quite sensitive to any skew of a page.  Use @command{unpaper} to
+deskew the pages.  See @ref{Configuring unpaper}.
+
+In some cases, @command{unpaper} may not be able to deskew a page
+automatically.  If so, have a look at the deskewing switches of
+@command{unpaper}.  Especially @option{--deskew-scan-step},
+@option{--deskew-scan-deviation}, and @option{--deskew-scan-range} can
+be helpful.  You can put those switches into
+@code{scanner-unpaper-switches}.  See the @command{unpaper}
+documentation for details.
+
+@end table
+
+
+
+
 
 
 @node News

Reply via email to