Daniele Development-ML wrote:
Hello everybody,
I'm using PDFBox to try to extract some specific text from a PDF file. In
particular, I'm trying to detect the book title, author, and the
bibliographic entries (the references) - the PDF file is printed through the
pdftex command.

Extracting the raw text doesn't help too much as no data is carried with
that. I was therefore trying to browser the document structure and access
the COS objects and get the text value through them. This may just and only
work for the title, and the authors - which both might be written in a
different paragraph.

However, I'm getting a bit confused on the real feasibility of this approach
and on the use of the documentTreeStructure and the COSDictionary.

Has anybody ever faced/solved this problem?
Any comments or suggestions, or pointers to examples? The examples in the
distro seem not to cover this aspect fully, or perhaps I am wrong.

Many thanks,

Dan

Hi Dan!
I wouldn't think you can extract title, author or any "specific" text, for that matter, from what the PDF actually display; and it does not suppose to be that way too. This is simply because the content of a page in PDF does not capture any information specifying whether a piece of text is a title, author, etc. As you said earlier, if I understand correctly, you want to get the text in the first paragraph for title and the text in next paragraph for author, this is also not very feasible since again, PDF doesn't not even have knowledge about paragraph. For instance, for a title "My Title", in the content of the page, it may just say something like display "My Title" at point x,y. Moreover, for PDF generated by pdftex, the situation is even worst. In order to achieve high quality typesetting, the way TeX/LaTeX typeset text is very complex. For example, you could find your title "My Title" is specified as following in the PDF's content:
display "M" at position x1, y1
display "y" at position x2, y2
etc

Your best hope is try to get hold of PDDocumentInformation's object (by calling getDocumentInformation() on an PDDocument's object) which represented the Info dictionary in the trailer of the PDF file. This could contain the title and author of the PDF file and it's also the appropriate way to store such information in a PDF. However, I would doubt that such information is included in the PDF you are working with since this sort of information is kinda "meta information" and does not display when viewing the file, so people don't really care to put that in when making the file. Certainly in the case of pdftex, one has to use package hyperref and implicitly specifies the title and author with \hypersetup in order to produce an PDF with that "meta information".
Sorry for my lengthy explanation, just try to make it clear :-)

Cheers,
Thach


Reply via email to