Hi, On Fri, Aug 25, 2017 at 5:58 PM, Adrian Johnson <[email protected]> wrote: > On 26/08/17 02:47, Alex wrote: >> Hi, >> I'm attempting to use pdftohtml and pdftotext on fedora25 >> (poppler-utils-0.45.0-5.fc25.x86_64) and I'm unable to get it to >> extract the text from a particular PDF I need. >> >> I'm trying to use the poppler-utils to work with a spamassassin plugin >> to extract text from PDFs that may be malicious. Here is one such >> example: >> >> https://www.dropbox.com/s/b97pcvl1wm1oocq/pdf-phish.pdf?dl=0 >> >> It appears to extract the header information (author, date, etc) but >> no text from within the PDF. > > That's because there is no text in the PDF. > > Here's the content stream ... > The only text is a single space character. The rest is an image. There > is also a link annotation. Maybe we could add an option to pdfinfo to > list the annotations in the file and for link annotations show the URL.
Yes, I thought that might have been the problem, but the URL is what I was specifically talking about. That signature alone might be very helpful for identifying these malicious PDFs. Does the presence of a single space character with an image sound like a unique pattern, and if so, how can I encapsulate that into something I can trigger on? In other words, even something like an exit code or other indication that it's the only real content would be helpful. Thanks, Alex _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
