parsing data from pdf

Paul Schorfheide Mon, 29 Jun 2009 09:27:40 -0700

We're attempting to parse some data from a  US government form (I've attached a 
sample page), and PDFBox seems to be the closest fit to what we need.  I'm 
currently using a PDFTextStripper with SortByPosition true, but it does checks 
the data in a very strict horizontal manner, which doesn't play nice with some 
of the fields, for instance the data from fields 3 & 4 get intermingled line by 
line.  I was wondering if there was some way to parse out the data in a better 
form by playing off of the lines which separate the fields.  Any ideas?


Regards,
Paul Schorfheide

parsing data from pdf

Reply via email to