Re: Can PDFTextStripper be configured to skip page footers, page numbers, etc?

Jeremias Maerki Mon, 22 Dec 2008 07:58:45 -0800

Actually, there is a concept in PDF to mark content semantically. It's
called Tagged PDF. But for the extraction to profit from that the input
PDF would have to have tags in the first place (which most PDFs don't).
AFAIK, PDFBox doesn't support tagged PDF, yet.


On 19.12.2008 17:31:26 Andreas Lehmkühler wrote:
> Hi Lars
> 
> > Is it possible to configure a PDFTextStripper instance so that it does
> > not include page footer text and page numbers in the extracted text?
> As far as I know, there are no special commands for the page footer,
> header or numbers. Consequently it is imposible to determine these parts
> of a page and of course impossible to exclude them.
> 
> BR
> Andreas




Jeremias Maerki

Re: Can PDFTextStripper be configured to skip page footers, page numbers, etc?

Reply via email to