branch: externals/scanner
commit eaa6deba0fb9511c1c51050c37e9d4e55bdce97d
Author: Raffael Stocker <r.stoc...@mnet-mail.de>
Commit: Raffael Stocker <r.stoc...@mnet-mail.de>

    add more option and command documentation
---
 .gitignore   |   5 +
 scanner.texi | 407 ++++++++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 309 insertions(+), 103 deletions(-)

diff --git a/.gitignore b/.gitignore
index 0381e802c3..6a7296b1ef 100644
--- a/.gitignore
+++ b/.gitignore
@@ -12,3 +12,8 @@ TAGS
 *.png
 *.pdf
 *.txt
+*.aux
+*.cp
+*.cps
+*.toc
+*.info
diff --git a/scanner.texi b/scanner.texi
index d99c9080ae..4c62296e67 100644
--- a/scanner.texi
+++ b/scanner.texi
@@ -63,7 +63,6 @@ The document was typeset with
 * User Options::
 * Improving Scan Quality::
 * Reporting Bugs::
-* Hacking::
 * GNU Free Documentation License::
 * Index::
 @end menu
@@ -76,10 +75,8 @@ The document was typeset with
 
 @menu
 * Introduction::
-* Principle of Operation::
 * Basic Setup::
-* Scanning Documents::
-* Scanning Images::
+* Scanning Documents and Images::
 @end menu
 
 @node Introduction
@@ -139,20 +136,6 @@ uses the customization system of GNU Emacs
 necessary settings and takes care of processing using the
 abovementioned programs.
 
-
-@node Principle of Operation
-@section Principle of Operation
-@cindex principle of operation
-@cindex operation, principle of
-
-@c sequence of program calls
-@c use of temporary files
-@c re-entrancy (but maybe mangled log)
-@c log buffer for diagnosis
-@c data and configuration files/directories needed by the programs
-@c use of more than one device: select new device between scans
-
-
 @node Basic Setup
 @section Basic Setup
 @cindex basic setup
@@ -160,41 +143,127 @@ abovementioned programs.
 @cindex configuration, basic
 @cindex installation
 
-Basic setup items:
-- program executables
-- paper/image size
-- nothing with unpaper
-- file formats and outputs
+To get started with Scanner, make sure the following programs are
+installed:
+@table @command
+@item scanimage
+Scanimage comes with the sane-backends distribution, see
+@url{http://sane-project.org/}. 
+
+@item tesseract
+Tesseract is used for OCR and PDF generation in document scans.  The
+source is available at @url{https://github.com/tesseract-ocr/tesseract}.
+
+@item unpaper
+Unpaper is used for post-processing the scans obtained from
+@command{scanimage} before feeding them into @command{tesseract}.  This
+is optional, but highly recommended.  The source is available at
+@url{https://github.com/unpaper/unpaper}. 
+@end table
 
-Just name the important items, refer to the detailed chapter for more
-information.
+Tesseract is usually provided without the language data files as they
+are very large.  The full set of language files is over 4@dmn{GB}.  Some
+GNU/Linux distributions offer individual language packages; if yours
+does not, you can download the language data files from
+@url{https://github.com/tesseract-ocr/tessdata}.
 
-@node Scanning Documents
-@section Scanning Documents
-@cindex scanning documents
-@cindex documents, scanning
+Make sure the options @code{scanner-scanimage-program},
+@code{scanner-tesseract-program}, and @code{scanner-unpaper-program} are
+set correctly.  Also, the options @code{scanner-tessdata-dir} and
+@code{scanner-tesseract-configdir} must be set correctly so
+@command{tesseract} can find the language data files and output
+configurations. 
 
-describe functions for document scanning
-refer also to menu entries
+Customize the basic options like @code{scanner-doc-papersize},
+@code{scanner-resolution}, @code{scanner-tesseract-languages}, and
+@code{scanner-tesseract-outputs}.  See @ref{User Options} for a detailed
+discussion of all the available options.
 
 
+@node Scanning Documents and Images
+@section Scanning Documents and Images
+@cindex scanning documents and images
+
+The Scanner package provides two commands for scanning documents and
+images.  These are described below.
+
 @table @kbd
 @item M-x scanner-scan-document
-Scan a document.
+@itemx C-u M-x scanner-scan-document
+@itemx C-u N M-x scanner-scan-document
+Scan a document.  When called without a prefix argument, this command
+will scan only one page.  When called with the default prefix argument
+(as @kbd{C-u M-x scanner-scan-document}), it will ask after each scanned
+page whether another pages should be scanned.  With a numeric prefix
+argument, it will scan that many pages, waiting a number of seconds
+between each page, as configured in @code{scanner-scan-delay}.
+
+The scan will use the resolution configured in
+@code{scanner-resolution} with the @code{:doc} key.
+
+This command interactively reads a file name that will
+be used as the base name of the output file(s).  The extension of the
+file name is ignored as it is instead specified by the
+@command{tesseract} output formats as configured with the option
+@code{scanner-tesseract-outputs} or the command
+@code{scanner-select-outputs}.
+If the specified file already exists, @code{scanner-scan-document} will
+ask for confirmation to overwrite it.
+
+This command will trigger auto-detection if no device has been
+configured.  If more than one device are available, it will offer ask
+you to select one.
+
+If you configured Scanner to use @command{unpaper}, this command will
+post-process the scans obtained from @command{scanimage} using
+@command{unpaper} before feeding the results to @command{tesseract}.
+See @ref{User Options} to find out how to configure scan and
+post-processing.
+
+The scanning and conversion processes are run asynchronously.  If you
+want to monitor progress, bring up the @code{*Scanner*} buffer which
+collects the outputs of the backend programs.
+
+This command is also available from the Scanner menu as@*
+@clicksequence{Tools @click{} Scanner @click{} Scan a document}@*
+for a single-page scan and@*
+@clicksequence{Tools @click{} Scanner @click{} Scan a multi-page document}@*
+for a multi-page scan.
+
+@item M-x scanner-scan-image
+@itemx C-u M-x scanner-scan-image
+@itemx C-u n M-x scanner-scan-image
+Scan an image.  When called without a prefix argument, this command
+will scan only one image.  When called with the default prefix argument
+(as @kbd{C-u M-x scanner-scan-image}), it will ask after each scanned
+image whether another image should be scanned.  With a numeric prefix
+argument, it will scan that many images, waiting a number of seconds
+between each image, as configured in @code{scanner-scan-delay}.
+
+The scan will use the resolution configured in
+@code{scanner-resolution} with the @code{:image} key.
+
+This command interactively reads a file name.  The extension of the file
+name specifies the output file format.  If no extension is provided, the
+default image format, as configured in @code{scanner-image-format} will
+be used.  In a multi-image scan, this command will extend the given file
+name base by @var{-number}, where @var{number} is the number of the
+scanned image.  For example, if the file name is @file{image.jpeg}, a
+multi-image scan of @var{n} images will produce the files
+@file{image-1.jpeg}, @file{image-2.jpeg} @dots{} @file{image-n.jpeg}.
+If one of these files already exists, @code{scanner-scan-image} will ask
+for confirmation to overwrite it.
+
+No post-processing with @command{unpaper} or @command{tesseract} is
+done.  See @ref{User Options} to find out how to configure scanning.
+
+This command is also available from the Scanner menu as@*
+@clicksequence{Tools @click{} Scanner @click{} Scan an image}@*
+for a single-page scan and@*
+@clicksequence{Tools @click{} Scanner @click{} Scan multiple images}@*
+for a multi-image scan.
 @end table
 
-@node Scanning Images
-@section Scanning Images
-@cindex scanning images
-@cindex images, scanning
-
-describe functions for image scanning
-
-@anchor{scanner-scan-image}
-@defun scanner-scan-image nscans filename
-@end defun
-
-
 
 @node User Options
 @chapter User Options
@@ -209,43 +278,85 @@ between Emacs sessions.  These functions are also 
available from the
 Scanner menu (@clicksequence{Tools @click{} Scanner}).
 
 @menu
+* Configuration Commands::
 * General Options::
-* Options for scanimage::
-* Options for unpaper::
-* Options for tesseract::
+* Configuring scanimage::
+* Configuring unpaper::
+* Configuring tesseract::
 @end menu
 
-@node General Options
-@section General Options
-@cindex general options
+@node Configuration Commands
+@section Configuration Commands
+@cindex Configuration Commands
+
+The following commands help you configure some of the more-often used
+options.  They only change the options for the running session; if you
+want to permanently set an option, so it will be remembered between
+Emacs sessions, use the customization interface.
 
 @table @kbd
 @item M-x scanner-set-image-resolution
 @item M-x scanner-set-document-resolution
-These commands interactively ask for a resolution (in @acronym{DPI,
+These commands interactively asks for a resolution (in @acronym{DPI,
 dots per inch}) to be used in subsequent image and document scans,
-respectively.  Note that these commands set @code{scanner-resolution}
-directly, but don't change the saved customization value, that is, a
-resolution set with these commands will not be remembered between
-Emacs sessions.
+respectively.  The corresponding user options is
+@code{scanner-resolution}.
 
 These commands are available in the Scanner menu as@*
 @clicksequence{Tools @click{} Scanner @click{} Select image
 resolution}@*
 and@*
 @clicksequence{Tools @click{} Scanner @click{} Select
-document resolution}
+document resolution}.
 
 @item M-x scanner-select-papersize
+Select a paper size from @code{scanner-paper-sizes} or
+@code{:whatever}.  See also @code{scanner-doc-papersize}.
+
+This command is available in the Scanner menu as@*
+@clicksequence{Tools @click{} Scanner @click{} Select paper size}.
+
+@item M-x scanner-select-image-size
+Select an image size.  This command interactively reads x and y
+dimensions in millimeter from the minibuffer and sets
+@code{scanner-image-size} accordingly.
+
+This command is also available in the Scanner menu as@*
+@clicksequence{Tools @click{} Scanner @click{} Select image size}.
+
+@item M-x scanner-select-outputs
+Select the document outputs.  This command reads a list of document
+output formats.  See also @code{scanner-tesseract-outputs}.
+
+This command is also available in the Scanner menu as@*
+@clicksequence{Tools @click{} Scanner @click{} Select document outputs}.
+
+@item M-x scanner-select-languages
+Select the languages assumed for OCR.  This command reads a list of
+languages used for OCR.  The necessary @command{tesseract} data files
+must be available.  See @code{scanner-tesseract-languages}.
+
+This command is also available in the Scanner menu as@*
+@clicksequence{Tools @click{} Scanner @click{} Select OCR languages}.
 
 @item M-x scanner-select-device
-Select a device, possibly triggering auto-detection.
+@itemx C-u M-x scanner-select-device
+Select a device, possibly triggering auto-detection.  Normally, manual
+device selection is not necessary as @command{scanimage} will
+auto-detect.  However, if you have multiple devices and want to change
+between them, you can use this command to do so.
+
+When called with a prefix argument, auto-detection is forced even when
+devices have already been detected before.
 
-@item C-u M-x scanner-select-device
-Select a device, forcing auto-detection.
+This command is also available in the Scanner menu as@*
+@clicksequence{Tools @click{} Scanner @click{} Select scanning device}
 @end table
 
 
+@node General Options
+@section General Options
+@cindex general options
 
 @defopt scanner-resolution
 This option specifies the resolution in DPI used for image and
@@ -255,12 +366,44 @@ document scans as a property list with the keys 
@code{:image} and
 (:image 600 :doc 300)
 @end lisp
 The available resolutions depend on your device.
+
+This option can be set per-session with the commands
+@code{scanner-select-image-resolution} and
+@code{scanner-select-document-resolution}.
 @end defopt
 
 @defopt scanner-paper-sizes
+This option holds paper sizes for document scans as a property list with
+the name of the page format as the key (e.g. @code{:a4}) and a list of
+width/height pairs in millimeters as value.  The default is:
+@lisp
+(:a3
+ (297 420)
+ :a4
+ (210 297)
+ :a5
+ (148 210)
+ :a6
+ (105 148)
+ :tabloid
+ (279.4 431.8)
+ :legal
+ (215.9 355.6)
+ :letter
+ (215.9 279.4))
+@end lisp
 @end defopt
 
+@anchor{scanner-doc-papersize}
 @defopt scanner-doc-papersize
+Use this option to select the paper size for the document scans.  The
+value must be one of the keys from @code{scanner-paper-sizes}, or the
+special value @code{:whatever} that lets @command{scanimage} select the
+paper size (usually the available scan area).  The default is
+@code{:a4}.
+
+This option can be set per-session with the command
+@code{scanner-select-papersize}.
 @end defopt
 
 @defopt scanner-image-size
@@ -270,12 +413,12 @@ and height values in millimeters.  The default is
 (200 250)
 @end lisp
 for an image of 200@dmn{mm} width and 250@dmn{mm} height.  If set to
-nil, the size is determined by scanimage (usually the available scan
+nil, the size is determined by @command{scanimage} (usually the available scan
 area.)
-@end defopt
 
-@deffn {Interactive Command} scanner-select-image-size x y
-@end deffn
+This option can be set per-session with the command
+@code{scanner-select-image-size}.
+@end defopt
 
 @defopt scanner-scan-delay
 This option specifies the delay in seconds to wait between pages in a
@@ -284,43 +427,29 @@ the next sheet to your scanner before it starts scanning 
the next
 page.  The default is 3.
 @end defopt
 
-@defopt scanner-device-name
-The device name of the scanner as reported by scanimage.  The default
-is nil, which prompts Scanner to use scanimage for automatic
-detection.  The detected device will be stored in this variable and
-used for all subsequent scans, until a new detection is forced either
-by calling @code{scanner-select-device} with a prefix argument, or by
-this device becoming unavailable.
-
-Usually you need not customize this option as auto-detection should
-work just fine.
-@end defopt
-
 @defopt scanner-reverse-pages
 This option, when set to t, causes Scanner to reverse the order of the
 scanned pages in a document scan.  The default is nil.
 @end defopt
 
-@node Options for scanimage
-@section Options for scanimage
-@cindex options for scanimage
+@node Configuring scanimage
+@section Configuring scanimage
+@cindex configuring scanimage
 
-Some of the options scanimage accepts (and Scanner uses) are
-device-dependent.  These options are marked as @emph{device-dependent}
-in their definitions.  To find out which options your scanner hardware
+Some of the options @command{scanimage} accepts (and Scanner uses) are
+device-dependent.  To find out which options your scanner hardware
 offers, run @command{scanimage --help} with your scanner plugged in.
 This incantation should print a list of general and device-dependent
 options.
 
 @defopt scanner-scanimage-program
-This option specifies the path of scanimage.  The default is given by
+This option specifies the path of @command{scanimage}.  The default is given by
 @lisp
 (executable-find "scanimage")
 @end lisp
 @end defopt
 
 @defopt scanner-scan-mode
-@emph{Device-dependent}@*
 This option specifies the scan modes for document and image scans.  It
 is a property list with the keys @code{:image} and @code{:doc}, for
 images and documents, respectively, and strings naming the scan modes
@@ -339,9 +468,9 @@ choose either ``Gray'' or ``Color''.
 @end defopt
 
 @defopt scanner-image-format
-This option sets the default format used by scanimage for image and
+This option sets the default format used by @command{scanimage} for image and
 document scans.  It is a property list similar to
-@var{scanner-scan-mode}.  For example, the default
+@code{scanner-scan-mode}.  For example, the default
 @lisp
 (:image "jpeg" :doc "pnm")
 @end lisp
@@ -349,10 +478,10 @@ configures Scanner to use the JPEG format for image scans 
and the PNM
 format for document scans.  While document scans will always use the
 format specified with this option, you can override the format used in
 image scans with the appropriate file extension, see
-@ref{scanner-scan-image}.
+@ref{Scanning Documents and Images}.
 
-The supported formats are documented in the scanimage manual page.
-For example, version 1.0.31 of scanimage supports PNM, TIFF, PNG and
+The supported formats are documented in the @command{scanimage} manual page.
+For example, version 1.0.31 of @command{scanimage} supports PNM, TIFF, PNG and
 JPEG. 
 
 Note that the document scan format specified with this option is an
@@ -360,15 +489,27 @@ intermediate format, not the document format generated at 
the end of
 the whole process.  With the PNM format used in the example above, you
 can still have a PDF output, see @ref{scanner-tesseract-outputs}.
 
-If you use unpaper for post-processing before OCR in document scans,
-the format will silently be forced to PNM, as this is required by
-unpaper.
+If you use @command{unpaper} for post-processing before OCR in document
+scans (@pxref{Configuring unpaper}), the format will silently be forced
+to PNM, as this is required by @command{unpaper}.
+@end defopt
+
+@defopt scanner-device-name
+The device name of the scanner as reported by @command{scanimage}.  The default
+is nil, which prompts Scanner to use @command{scanimage} for automatic
+detection.  The detected device will be stored in this variable and
+used for all subsequent scans, until a new detection is forced either
+by calling @code{scanner-select-device} with a prefix argument, or by
+this device becoming unavailable.
+
+Usually you need not customize this option as auto-detection should
+work just fine.
 @end defopt
 
 @defopt scanner-scanimage-switches
-You may find that additional switches to scanimage not covered by any
+You may find that additional switches to @command{scanimage} not covered by any
 of the above user options are necessary.  You can use
-@var{scanner-scanimage-switches} for these.  Specify the switches as a
+@code{scanner-scanimage-switches} for these.  Specify the switches as a
 list of switch/value pairs, such as:
 @lisp
 ("--switch1" "value1" "-s" "2")
@@ -385,9 +526,9 @@ are device-dependent.
 @end defopt
 
 
-@node Options for unpaper
-@section Options for unpaper
-@cindex options for unpaper
+@node Configuring unpaper
+@section Configuring unpaper
+@cindex configuring unpaper
 
 @defopt scanner-unpaper-program
 @end defopt
@@ -422,24 +563,61 @@ are device-dependent.
 @defopt scanner-unpaper-switches
 @end defopt
 
-@node Options for tesseract
-@section Options for tesseract
-@cindex options for tesseract
+@node Configuring tesseract
+@section Configuring tesseract
+@cindex configuring tesseract
 
 @defopt scanner-tessdata-dir
+This option specifies the @file{tessdata} directory.  This directory is
+supposed to contain the language data files for @command{tesseract}.
+The default is @file{/usr/share/tessdata/}.
 @end defopt
 
 @defopt scanner-tesseract-configdir
+This option specifies the @command{tesseract} @file{configs} directory.
+This directory is supposed to contain the language data files for
+@command{tesseract}.  The default is
+@file{/usr/share/tessdata/configs/}.
 @end defopt
 
 @defopt scanner-tesseract-languages
+This option lists the languages passed to @command{tesseract} as a list
+of strings.  The default is:
+@lisp
+("eng")
+@end lisp
+It is possible to pass more than one language to @command{tesseract},
+which can be useful if you have a multi-language document.  For
+instance,
+@lisp
+("eng" "deu")
+@end lisp
+sets @command{tesseract} up for recognizing english and german language.
+However, for single-language documents, the best results are usually
+obtained when setting only one language.
+
+This option can be set per-session with the command
+@code{scanner-select-languages}.
 @end defopt
 
 @anchor{scanner-tesseract-outputs}
 @defopt scanner-tesseract-outputs
+This option lists the output formats to produce.  The available output
+formats are provided as configuration files in the
+@file{/usr/share/tessdata/configs/} directory.  The default
+@lisp
+("pdf" "txt")
+@end lisp
+causes @command{tesseract} to output both a PDF and a text file.
+
+This option can be set per-session with the command
+@code{scanner-select-outputs}.
 @end defopt
 
 @defopt scanner-tesseract-switches
+You can use this option to specify any additional switches for
+@command{tesseract} not covered by the above options.  Use the same
+format as for @code{scanner-scanimage-switches}.  The default is nil.
 @end defopt
 
 @node Improving Scan Quality
@@ -450,6 +628,33 @@ are device-dependent.
 
 include tips for improving quality, ask for contributions
 
+@itemize
+@item
+tesseract v4
+@item
+see @url{https://tesseract-ocr.github.io/tessdoc/} for tesseract documentation
+@item
+set left and top margin in scanimage
+@item
+use brightness and contrast for scanimage
+@item
+use at least 300 DPI for OCR, but more than 600 only increase document
+size without improving OCR
+@item
+use the right language, not multiple languages for a single-language document
+@item
+use unpaper to deskew the page
+@item
+use unpaper deskewing options
+@item
+use unpaper middle wipe etc.
+@item
+have a look at
+@url{https://github.com/unpaper/unpaper/blob/main/doc/basic-concepts.md}
+and
+@url{https://github.com/unpaper/unpaper/blob/main/doc/image-processing.md}
+for detailed information on how to use unpaper
+@end itemize
 
 
 @node Reporting Bugs
@@ -458,10 +663,6 @@ include tips for improving quality, ask for contributions
 Refer to @uref{https://www.gitlab.com/rstocker/scanner/}
 mention *Scanner* log buffer
 
-@node Hacking
-@chapter Hacking
-
-document internal functions and how they are used?
 
 @node GNU Free Documentation License
 @chapter GNU Free Documentation License

Reply via email to