Re: parsing data from pdf

Ben Litchfield Mon, 29 Jun 2009 10:33:29 -0700

The attachment didn't come through, not sure if the mailing listaccepts attachements.


If this is a regular Adobe Form, then you should use the PDAcroForm classes.

I'll assume the PDF is just regular text, and the field values arejust in certain locations.

If they are always in the same location you can try using thePDFTextStripperByArea

It extends the regular PDFTextStripper but allows you to set uprectangles to extract.


pseudo code
PDFTextStripperByArea ts...
ts.addRegion( "first name", new rectangle2D.Float(100,100,50,10));
ts.addRegion( "last name", new rectangle2D.Float(100,200,50,10));
extractRegions(page);
String firstName = ts.getTextForRegion( "first name" );
String lastName = ts.getTextForRegion( "last name" );

For this to work you'll need to know the rect dimensions, you can tryusing o.a.p.PDFReader, which will show the coordinates. Some PDFsdon't display correctly, so if that doesn't work you could also tryusing the PrintTextLocations example, but not quite as friendly.


Ben



Quoting Paul Schorfheide <[email protected]>:

We're attempting to parse some data from a US government form (I'veattached a sample page), and PDFBox seems to be the closest fit towhat we need. I'm currently using a PDFTextStripper withSortByPosition true, but it does checks the data in a very stricthorizontal manner, which doesn't play nice with some of the fields,for instance the data from fields 3 & 4 get intermingled line byline. I was wondering if there was some way to parse out the datain a better form by playing off of the lines which separate thefields. Any ideas?
Regards,
Paul Schorfheide

Re: parsing data from pdf

Reply via email to