On 26/08/17 02:47, Alex wrote: > Hi, > I'm attempting to use pdftohtml and pdftotext on fedora25 > (poppler-utils-0.45.0-5.fc25.x86_64) and I'm unable to get it to > extract the text from a particular PDF I need. > > I'm trying to use the poppler-utils to work with a spamassassin plugin > to extract text from PDFs that may be malicious. Here is one such > example: > > https://www.dropbox.com/s/b97pcvl1wm1oocq/pdf-phish.pdf?dl=0 > > It appears to extract the header information (author, date, etc) but > no text from within the PDF.
That's because there is no text in the PDF. Here's the content stream: stream /P <</MCID 0>> BDC BT /F1 11.04 Tf 1 0 0 1 72.024 63.48 Tm /GS7 gs 0 g /GS8 gs 0 G [( )] TJ ET EMC /Span <</MCID 1>> BDC q 572.04 0 0 698.52 15.96 73.92 cm /Image10 Do Q EMC endstream The only text is a single space character. The rest is an image. There is also a link annotation. Maybe we could add an option to pdfinfo to list the annotations in the file and for link annotations show the URL. > > Would someone be interested in trying to extract the URL from within > this PDF for me? Is there a big difference between version 0.45 and > the latest that may affect this? It would require compiling it here > locally. > > podofopdfinfo is able to identify the URL within the PDF, but I'm not > sure if that's helpful. > > Any ideas greatly appreciated. > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
