On 11/20/24 3:39 AM, Lachezar Dobrev wrote:

  To the original poster 'achilles': there is no reliable way to detect whether PDF file is a result of a document scan process, or has been crafted. However Ulf Dittmer's suggestion to look for pages with just a big image per page is (probably) the best option.

I kind of do the opposite, extract the text and if there's more than a certain amount of it, treat it as a true PDF, not scanned (actually I do what the text extractor does, but bail out when I hit the threshold amount of text. You could also bail out if you see a certain number of pages with no text.)

Brian

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to