Hello everybody, I'm using PDFBox to try to extract some specific text from a PDF file. In particular, I'm trying to detect the book title, author, and the bibliographic entries (the references) - the PDF file is printed through the pdftex command.
Extracting the raw text doesn't help too much as no data is carried with that. I was therefore trying to browser the document structure and access the COS objects and get the text value through them. This may just and only work for the title, and the authors - which both might be written in a different paragraph. However, I'm getting a bit confused on the real feasibility of this approach and on the use of the documentTreeStructure and the COSDictionary. Has anybody ever faced/solved this problem? Any comments or suggestions, or pointers to examples? The examples in the distro seem not to cover this aspect fully, or perhaps I am wrong. Many thanks, Dan
