We're attempting to parse some data from a US government form (I've attached a sample page), and PDFBox seems to be the closest fit to what we need. I'm currently using a PDFTextStripper with SortByPosition true, but it does checks the data in a very strict horizontal manner, which doesn't play nice with some of the fields, for instance the data from fields 3 & 4 get intermingled line by line. I was wondering if there was some way to parse out the data in a better form by playing off of the lines which separate the fields. Any ideas?
Regards, Paul Schorfheide
