He there,
I'm using pdfbox (just switched to 0.8, so some of this might be true
only for 0.7.3) for a couple of weeks now. What I'm trying to do is
analyze papers and extract the document title and authors as well as the
list of references in order to establish relationships between several
documents. Like, who references whom, and what is the one paper you got
to read.
The problems I stumbled upon:
Columns - quite often those docs use a two column-layout. Often it is
recognized and text is extracted one column after the other, which is
good. But there are documents which apparently do not contain what you
call beads, even though they use two columns. Text is extracted line by
line ignoring the columns. I realized, turning of sortByPosition,
resolves part of the problem, but only if the order is correct. Don't
know if this is due to invalid documents or an error in code.
I'm using a custom extension of PDFTextStripper. As a workaround for the
sorting problem, I wrote a method to analyze and sort the text (List of
TextPosition) while respecting the two column layout, which is called in
flushText() instead of Collections.sort(). I also changed the
TextPositionComparator to use a larger value (2) for the tolerance
comparison, so superscripts are on the correct line.
Next - Font size
TextPosition comes with several attributes, like height, yScale and
FontSize. So far I couldn't figure out which one to use to determine the
font size. Most of the time, getFontSize() retuns 1, which is no really
useful. I also came across large areas of text with height set to 0. So
I went for yScale, but for some documents this returns 1 for the whole
text as well.
I don't need absolute values, just interested in the biggest font, which
usually is used for the title of the paper.
Torsten