There are many different ways to add OCR’d text to a PDF, though one of the most common is use of “hidden text”, where the text is drawn using Text Render Mode 3. I don’t recall if Poppler exposes this information in the public APIs, but it certainly has it in the graphic state internally.
Leonard From: poppler <[email protected]> on behalf of Stéphane Charette <[email protected]> Date: Friday, October 14, 2022 at 2:54 PM To: [email protected] <[email protected]> Subject: [poppler] getting the text from PDF files EXTERNAL: Use caution when clicking on links or opening attachments. Using libpoppler-cpp-dev 0.86.1 on Ubuntu to read PDF files. Works well. doc->create_page(idx) to get the page, then page->text_list() to get all the boxes. PDFs seem to either have text, or if it was a scan then I have an image with no text, and I fall back to other techniques to read what I need. But...! Some fax machines and business scanners try to do OCR, and embeds the text results into the PDF. The quality of the OCR is poor, but when I attempt to extract the text, I do get back the expected text boxes which leads me down the wrong path. Is there anything in the way the text was added to the PDF that I can use as a hint that the text was added to the PDF after-the-fact, and not as part of the original PDF creation process? Something I can use to determine if the text can be trusted? Reading up on things like Xref tables to get an understanding of the internals of PDF files so I can attempt to find a pattern between my "good" and "problematic" PDF files. Wondered if there was a way to see if the text is part of the page itself, or if it was tacked on afterwards. Thanks, Stéphane -- [Image removed by sender.]<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0> Stéphane Charette about.me/stephane.charette<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0>
