He there,

as I wrote couple of weeks ago to this list, this is exactly what I'am doing in my Bachelor Thesis.
I don't refer to any meta data, so this is a best effort aproach.

My approach utilizes a custon extension of PDFTextStripper for that purpose. To get the title of a document I simply take the lines from page one with the biggest font size. Authors usually follow directly on the next lines. It was a bit of work to get the text in the right order, but you can use data from TextPosition to sort the text. The font size can be retrieved from there as well. For some documents you won't get useful results on the font size, you can use height or yscale then, but results tend to be less accurate if not using font size. Hopefully, we will see an improvement here in later versions of pdfbox.

To get the references, I use a key word search first to get the whole section from the text. I split it up, one line per reference and than the fun part begins. I use pattern matching with regex and substrings to extract title, authors and publication info from each line. I get pretty good results on some docs, worse for others, but the goal of my work is not to get all information from every single document, but to build a tool allowing users to enter their own strategies for getting the information they desire. There also will be future work on that topic further improving my results.

To sum it up, this is not done in a couple lines of code. The PDFTextStripper extension has a good thousand lines and that is only preparing the text.

Torsten


Daniele Development-ML schrieb:
Hello everybody,
I'm using PDFBox to try to extract some specific text from a PDF file. In
particular, I'm trying to detect the book title, author, and the
bibliographic entries (the references) - the PDF file is printed through the
pdftex command.

Extracting the raw text doesn't help too much as no data is carried with
that. I was therefore trying to browser the document structure and access
the COS objects and get the text value through them. This may just and only
work for the title, and the authors - which both might be written in a
different paragraph.

However, I'm getting a bit confused on the real feasibility of this approach
and on the use of the documentTreeStructure and the COSDictionary.

Has anybody ever faced/solved this problem?
Any comments or suggestions, or pointers to examples? The examples in the
distro seem not to cover this aspect fully, or perhaps I am wrong.

Many thanks,

Dan

Reply via email to