On Thu, 20 Jan 2022, Curt wrote:
> On 2022-01-20, Siard <shi...@mailbox.org> wrote:
> > Bob Bernstein wrote:
> > > Executing 'apt-cache search tesseract' brings up a multitude of 
> > > packages.
> > >
> > > My need is simple enough, I think: I like to scan (using an 
> > > Epson scanner) pages of printed books -- almost one hundred per 
> > > cent text -- and then use OCR to produce pages from which I can 
> > > copy 'n paste snippets of text for note-taking purposes.
> > >
> > > What do the assembled multitudes suggest for a tesseract package 
> > > (that's the OCR I've been encouraged to use) on my bullseye 
> > > system, ...
> >
> > Once you have a PDF containing the images (img2pdf may be used for
> > that), I think the cleverest way is to use ocrmypdf.
> > It adds an OCR text layer to the PDF file, so the PDF text becomes
> > selectable and can be copied.
> > It uses the Tesseract OCR engine.
> >
> > $ ocrmypdf -f inputfile.pdf outputfile.pdf
>
> ocrmypdf has quite a few dependencies on my machine.
> 
> The  multitude of packages corresponds more or less to the multiple
> languages of the human multitude. I guess the OP's working in English
> ('tesseract-ocr-eng', pulled in with all the others here when installing
> the above).

With tesseract and one tesseract language package already installed,
installing ocrmypdf does not pull in more of them. At least, that's what
I see on my machine.

Reply via email to