Text Extraction, Layout, Sorting, Font Sizes

Torsten Petersdorf Tue, 16 Dec 2008 01:42:53 -0800

He there,

I'm using pdfbox (just switched to 0.8, so some of this might be trueonly for 0.7.3) for a couple of weeks now. What I'm trying to do isanalyze papers and extract the document title and authors as well as thelist of references in order to establish relationships between severaldocuments. Like, who references whom, and what is the one paper you gotto read.


The problems I stumbled upon:

Columns - quite often those docs use a two column-layout. Often it isrecognized and text is extracted one column after the other, which isgood. But there are documents which apparently do not contain what youcall beads, even though they use two columns. Text is extracted line byline ignoring the columns. I realized, turning of sortByPosition,resolves part of the problem, but only if the order is correct. Don'tknow if this is due to invalid documents or an error in code.I'm using a custom extension of PDFTextStripper. As a workaround for thesorting problem, I wrote a method to analyze and sort the text (List ofTextPosition) while respecting the two column layout, which is called influshText() instead of Collections.sort(). I also changed theTextPositionComparator to use a larger value (2) for the tolerancecomparison, so superscripts are on the correct line.


Next - Font size

TextPosition comes with several attributes, like height, yScale andFontSize. So far I couldn't figure out which one to use to determine thefont size. Most of the time, getFontSize() retuns 1, which is no reallyuseful. I also came across large areas of text with height set to 0. SoI went for yScale, but for some documents this returns 1 for the wholetext as well.I don't need absolute values, just interested in the biggest font, whichusually is used for the title of the paper.



Torsten

Text Extraction, Layout, Sorting, Font Sizes

Reply via email to