In one message or another, Mark Ehle said something like this:
> I am using pdtotxt to extract text from pdf file in a digital newspaper 
> archive I am creating for a local public library. So far, it's working great. 
> But - I am using up a far amount of disk space and would like to figure out a 
> way to create an OCR'd pdf from an image and the bounding box data. That way 
> I would not have to store the PDF files as well as the images. Is there a way 
> to do that?


Seems like you would want to store the PDF instead of the images. Anyway, you 
should look at Tesseract:

https://code.google.com/p/tesseract-ocr/

I haven't used it myself but, my understanding is, it'll embedded the OCR'd 
data into the PDF itself allowing searching, text selection, etc. from a PDF 
viewer.

-e
--
Ed Porras
[email protected]

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to