[poppler] pdftotext and pdftohtml and extracting text

Alex Fri, 25 Aug 2017 10:18:01 -0700

Hi,
I'm attempting to use pdftohtml and pdftotext on fedora25
(poppler-utils-0.45.0-5.fc25.x86_64) and I'm unable to get it to
extract the text from a particular PDF I need.


I'm trying to use the poppler-utils to work with a spamassassin plugin
to extract text from PDFs that may be malicious. Here is one such
example:

https://www.dropbox.com/s/b97pcvl1wm1oocq/pdf-phish.pdf?dl=0

It appears to extract the header information (author, date, etc) but
no text from within the PDF.

Would someone be interested in trying to extract the URL from within
this PDF for me? Is there a big difference between version 0.45 and
the latest that may affect this? It would require compiling it here
locally.

podofopdfinfo is able to identify the URL within the PDF, but I'm not
sure if that's helpful.

Any ideas greatly appreciated.
_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] pdftotext and pdftohtml and extracting text

Reply via email to