Actually, there is a way but for that to work, but AFAIK that's not supported, yet, and the PDF has to be tagged (which most PDFs aren't).
Tagged PDF: http://www.planetpdf.com/enterprise/article.asp?ContentID=6067 On 22.03.2009 11:55:00 Dexter Mishra wrote: > Hi Hanna, > I dont think there is an way to say a data is table data. the one thing you > can do is use the article/bead feature in the PDFTextStripper example. We > also have similar requirement. One have a few metadata in terms of PDF > comments. SO i am modifying thePDFBox library for using the PDF comments as > userspace meta data. > One apporach you can try is manipulating the x,y cooridinates of the PDF. > > On Sat, Mar 21, 2009 at 9:01 PM, Hanan Harush <[email protected]> wrote: > > > Hi > > > > > > > > My name is Hanan and I am developing an in-house application that requires > > reading pdf file and extract tables text to a local Database. > > > > Of course the table number of rows might change from time to time . > > > > > > > > After reading a lot about PDF as well as pdfbox I have succeeded to : > > > > Load a PDF document > > > > Iterate through its pages > > > > > > > > My questions are: > > > > 1. Is there a way to identify a table in PDF file ? > > > > 2. What are the alternatives for extracting tables data only using pdfBox > > ? > > > > > > 3. How is it possible to step through a table ? > > > > > > > > Best Regards, > > > > Hanan Harush > > > > > > > > Jeremias Maerki
