Hi there Jeroen, Quoting Jeroen Ooms (2016-02-26 12:40:14) > We are using poppler for parsing and indexing scientific articles. For > this purpose I wrote some bindings to poppler-cpp for the R > programming language. A few questions: > > - Many of our pdf files give parsing errors, such as "Failed to get > object num from hint tables" or "Expected the optional content group > list, but wasn't able to find it" or "insufficient arguments for > Marked Content". Examples of problematic pdf files are here: > https://github.com/sckott/pdftoolspdfs. Are all of these pdf files > corrupted or are these limitations in poppler? Each of these files > seem to open just fine in any pdf reader. > > - Is there any sensible way to extract tabular data from pdf documents > in a machine readable form (such as xml or csv or html)? I noticed > that pdftotext with the -layout option does a really nice job > positioning the table contents so I suppose poppler must have picked > up on the table internally?
Unfortunately, it's not that easy. Tables in PDFs are streams of commands to paint lines and text at certain positions — as it is for most of the content in PDFs. Your best chance of getting actual information about structure of tables is using Tagged-PDFs, which include additional semantic information about the contents of the pages. We have support in Poppler to read the Tagged-PDF bits, but none of the “pdfto*” conversion tools uses it. For a rough example on how to do this, you can check the code for “pdfstructtohtml” [1] which, unfortunately, is not included in official releases. I hope that helps! -- ⌨ Adrian --- [1] https://github.com/aperezdc/poppler/blob/tagged-pdf-utils/utils/pdfstructtohtml.cc
signature.asc
Description: signature
_______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
