On Thu, 20 Jan 2022, Curt wrote: > On 2022-01-20, Siard <shi...@mailbox.org> wrote: > > Bob Bernstein wrote: > > > Executing 'apt-cache search tesseract' brings up a multitude of > > > packages. > > > > > > My need is simple enough, I think: I like to scan (using an > > > Epson scanner) pages of printed books -- almost one hundred per > > > cent text -- and then use OCR to produce pages from which I can > > > copy 'n paste snippets of text for note-taking purposes. > > > > > > What do the assembled multitudes suggest for a tesseract package > > > (that's the OCR I've been encouraged to use) on my bullseye > > > system, ... > > > > Once you have a PDF containing the images (img2pdf may be used for > > that), I think the cleverest way is to use ocrmypdf. > > It adds an OCR text layer to the PDF file, so the PDF text becomes > > selectable and can be copied. > > It uses the Tesseract OCR engine. > > > > $ ocrmypdf -f inputfile.pdf outputfile.pdf > > ocrmypdf has quite a few dependencies on my machine. > > The multitude of packages corresponds more or less to the multiple > languages of the human multitude. I guess the OP's working in English > ('tesseract-ocr-eng', pulled in with all the others here when installing > the above).
With tesseract and one tesseract language package already installed, installing ocrmypdf does not pull in more of them. At least, that's what I see on my machine.