The attachment didn't come through, not sure if the mailing list
accepts attachements.
If this is a regular Adobe Form, then you should use the PDAcroForm classes.
I'll assume the PDF is just regular text, and the field values are
just in certain locations.
If they are always in the same location you can try using the
PDFTextStripperByArea
It extends the regular PDFTextStripper but allows you to set up
rectangles to extract.
pseudo code
PDFTextStripperByArea ts...
ts.addRegion( "first name", new rectangle2D.Float(100,100,50,10));
ts.addRegion( "last name", new rectangle2D.Float(100,200,50,10));
extractRegions(page);
String firstName = ts.getTextForRegion( "first name" );
String lastName = ts.getTextForRegion( "last name" );
For this to work you'll need to know the rect dimensions, you can try
using o.a.p.PDFReader, which will show the coordinates. Some PDFs
don't display correctly, so if that doesn't work you could also try
using the PrintTextLocations example, but not quite as friendly.
Ben
Quoting Paul Schorfheide <[email protected]>:
We're attempting to parse some data from a US government form (I've
attached a sample page), and PDFBox seems to be the closest fit to
what we need. I'm currently using a PDFTextStripper with
SortByPosition true, but it does checks the data in a very strict
horizontal manner, which doesn't play nice with some of the fields,
for instance the data from fields 3 & 4 get intermingled line by
line. I was wondering if there was some way to parse out the data
in a better form by playing off of the lines which separate the
fields. Any ideas?
Regards,
Paul Schorfheide