Re: Extracting paper/book title from a PDF

Torsten Petersdorf Mon, 02 Feb 2009 12:46:59 -0800

He there,

as I wrote couple of weeks ago to this list, this is exactly what I'amdoing in my Bachelor Thesis.

I don't refer to any meta data, so this is a best effort aproach.

My approach utilizes a custon extension of PDFTextStripper for thatpurpose. To get the title of a document I simply take the lines frompage one with the biggest font size. Authors usually follow directly onthe next lines. It was a bit of work to get the text in the right order,but you can use data from TextPosition to sort the text. The font sizecan be retrieved from there as well. For some documents you won't getuseful results on the font size, you can use height or yscale then, butresults tend to be less accurate if not using font size. Hopefully, wewill see an improvement here in later versions of pdfbox.

To get the references, I use a key word search first to get the wholesection from the text. I split it up, one line per reference and thanthe fun part begins. I use pattern matching with regex and substrings toextract title, authors and publication info from each line. I get prettygood results on some docs, worse for others, but the goal of my work isnot to get all information from every single document, but to build atool allowing users to enter their own strategies for getting theinformation they desire. There also will be future work on that topicfurther improving my results.

To sum it up, this is not done in a couple lines of code. ThePDFTextStripper extension has a good thousand lines and that is onlypreparing the text.


Torsten


Daniele Development-ML schrieb:

Hello everybody,
I'm using PDFBox to try to extract some specific text from a PDF file. In
particular, I'm trying to detect the book title, author, and the
bibliographic entries (the references) - the PDF file is printed through the
pdftex command.

Extracting the raw text doesn't help too much as no data is carried with
that. I was therefore trying to browser the document structure and access
the COS objects and get the text value through them. This may just and only
work for the title, and the authors - which both might be written in a
different paragraph.

However, I'm getting a bit confused on the real feasibility of this approach
and on the use of the documentTreeStructure and the COSDictionary.

Has anybody ever faced/solved this problem?
Any comments or suggestions, or pointers to examples? The examples in the
distro seem not to cover this aspect fully, or perhaps I am wrong.

Many thanks,

Dan

Re: Extracting paper/book title from a PDF

Reply via email to