Re: Extracting tables data from PDF file ?

Jeremias Maerki Sun, 22 Mar 2009 23:56:27 -0700

Actually, there is a way but for that to work, but AFAIK that's not
supported, yet, and the PDF has to be tagged (which most PDFs aren't).


Tagged PDF: http://www.planetpdf.com/enterprise/article.asp?ContentID=6067

On 22.03.2009 11:55:00 Dexter Mishra wrote:
> Hi Hanna,
> I dont think there is an way to say a data is table data. the one thing you
> can do is use the article/bead feature in the PDFTextStripper example. We
> also have similar requirement. One have a few metadata in terms of PDF
> comments. SO i am modifying thePDFBox library for using the PDF comments as
> userspace meta data.
> One apporach you can try is manipulating the x,y cooridinates of the PDF.
> 
> On Sat, Mar 21, 2009 at 9:01 PM, Hanan Harush <[email protected]> wrote:
> 
> > Hi
> >
> >
> >
> > My name is Hanan and I am developing an in-house application that requires
> > reading pdf file and extract tables text  to a  local Database.
> >
> > Of course the table number of rows might change from time to time .
> >
> >
> >
> > After reading a lot about PDF as well as pdfbox I have  succeeded to  :
> >
> >                Load a PDF document
> >
> >     Iterate through its pages
> >
> >
> >
> > My questions are:
> >
> > 1. Is there a way to identify a table in PDF file ?
> >
> > 2. What are the alternatives for extracting tables data only using pdfBox
> >  ?
> >
> >
> > 3. How is it possible to step through a table ?
> >
> >
> >
> > Best Regards,
> >
> > Hanan Harush
> >
> >
> >
> >




Jeremias Maerki

Re: Extracting tables data from PDF file ?

Reply via email to