The attachment didn't come through, not sure if the mailing list accepts attachements.

If this is a regular Adobe Form, then you should use the PDAcroForm classes.

I'll assume the PDF is just regular text, and the field values are just in certain locations.

If they are always in the same location you can try using the PDFTextStripperByArea

It extends the regular PDFTextStripper but allows you to set up rectangles to extract.

pseudo code
PDFTextStripperByArea ts...
ts.addRegion( "first name", new rectangle2D.Float(100,100,50,10));
ts.addRegion( "last name", new rectangle2D.Float(100,200,50,10));
extractRegions(page);
String firstName = ts.getTextForRegion( "first name" );
String lastName = ts.getTextForRegion( "last name" );


For this to work you'll need to know the rect dimensions, you can try using o.a.p.PDFReader, which will show the coordinates. Some PDFs don't display correctly, so if that doesn't work you could also try using the PrintTextLocations example, but not quite as friendly.

Ben



Quoting Paul Schorfheide <[email protected]>:

We're attempting to parse some data from a US government form (I've attached a sample page), and PDFBox seems to be the closest fit to what we need. I'm currently using a PDFTextStripper with SortByPosition true, but it does checks the data in a very strict horizontal manner, which doesn't play nice with some of the fields, for instance the data from fields 3 & 4 get intermingled line by line. I was wondering if there was some way to parse out the data in a better form by playing off of the lines which separate the fields. Any ideas?

Regards,
Paul Schorfheide





Reply via email to