Actually, there is a concept in PDF to mark content semantically. It's called Tagged PDF. But for the extraction to profit from that the input PDF would have to have tags in the first place (which most PDFs don't). AFAIK, PDFBox doesn't support tagged PDF, yet.
On 19.12.2008 17:31:26 Andreas Lehmkühler wrote: > Hi Lars > > > Is it possible to configure a PDFTextStripper instance so that it does > > not include page footer text and page numbers in the extracted text? > As far as I know, there are no special commands for the page footer, > header or numbers. Consequently it is imposible to determine these parts > of a page and of course impossible to exclude them. > > BR > Andreas Jeremias Maerki
