In a message of Wed, 25 Nov 2015 12:43:51 -0500, Francois Dion writes: >This is well beyond the scope of Tutor, but let me mention the following: > >The code to pdftables disappeared from github some time back. What is on >sourceforge is old, same with pypi. I wouldn't create a project using >pdftables based on that... > >As far as what you are trying to do, it looks like they might have the data >in excel spreadsheets. That is totally trivial to load in pandas. if you >have any choice at all, avoid PDF at all cost to get data. See some detail >of the complexity here: >http://ieg.ifs.tuwien.ac.at/pub/yildiz_iicai_2005.pdf > >For your two documents, if you cannot find the data in the excel sheets, I >think the tabula (ruby based application) approach is the best bet. > >Francois
What he said. Double. However ... you can also use see about using popplar. It has a nice pdftohtml utility. Once you get your data in as html, if you are lucky, and the table information didn't get destroyed in the process, you can then send your data to pandas, which will happily read html tables. Once you have pandas reading it, you are pretty much home free and can do whatever you like with the data. If you happen to be on ubuntu, then getting popplar and pdftohtml is easy. http://www.ubuntugeek.com/howto-convert-pdf-files-to-html-files.html It seems to be harder on windows, but there are stackoverflow questions outlining how to do it ... Laura _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor