He Andreas,
all the issues mentioned remain the same with 0.8.
The only change I recognized is in the results of
TextPosition.getHeight(), which no longer returns 0 for certain
documents. But it isn't that useful to determine font sizes. So I'm a
little stuck here. My column detection isn't that great but it works
well for pages which contain only text. As I am only interested in the
references section this is enough for me.
I went on to the parsing of the references and will come back to fonts
when this is done.
Torsten
Andreas Lehmkühler schrieb:
Hi Torsten,
I'm using pdfbox (just switched to 0.8, so some of this might be true
only for 0.7.3) for a couple of weeks now. What I'm trying to do is
analyze papers and extract the document title and authors as well as the
list of references in order to establish relationships between several
documents. Like, who references whom, and what is the one paper you got
to read.
Do you still have the problems with 0.8? Brian made some improvements to
the TextStripper concerning the positioning and espacially the sorting.
Perhaps your problems are already gone ...
BR
Andreas