There are many different ways to add OCR’d text to a PDF, though one of the 
most common is use of “hidden text”, where the text is drawn using Text Render 
Mode 3.  I don’t recall if Poppler exposes this information in the public APIs, 
but it certainly has it in the graphic state internally.

Leonard

From: poppler <[email protected]> on behalf of Stéphane 
Charette <[email protected]>
Date: Friday, October 14, 2022 at 2:54 PM
To: [email protected] <[email protected]>
Subject: [poppler] getting the text from PDF files

EXTERNAL: Use caution when clicking on links or opening attachments.


Using libpoppler-cpp-dev 0.86.1 on Ubuntu to read PDF files.  Works well.

doc->create_page(idx) to get the page, then page->text_list() to get all the 
boxes.  PDFs seem to either have text, or if it was a scan then I have an image 
with no text, and I fall back to other techniques to read what I need.

But...!  Some fax machines and business scanners try to do OCR, and embeds the 
text results into the PDF.  The quality of the OCR is poor, but when I attempt 
to extract the text, I do get back the expected text boxes which leads me down 
the wrong path.

Is there anything in the way the text was added to the PDF that I can use as a 
hint that the text was added to the PDF after-the-fact, and not as part of the 
original PDF creation process?  Something I can use to determine if the text 
can be trusted?  Reading up on things like Xref tables to get an understanding 
of the internals of PDF files so I can attempt to find a pattern between my 
"good" and "problematic" PDF files.  Wondered if there was a way to see if the 
text is part of the page itself, or if it was tacked on afterwards.

Thanks,

Stéphane

--
[Image removed by 
sender.]<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0>

Stéphane Charette
about.me/stephane.charette<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0>

Reply via email to