[copied to list for posterity...]
Sorry. I am completely wrong. I've been using itext to split, fill in forms and recombine PDF so assumed (wrongly) that text extraction was possible. In fact, reading the mailing lists is quite informative - clearly PDF is not designed for this. Try this http://pdfbox.apache.org/commandlineutilities/ExtractText can be run from command line so potentially could be automated. Mark 2010/1/10 Mark Wardle <m...@wardle.org>: > If you can use a R <-> java interface, you could use itext to do this > as long as the PDF is fairly sane. > > see http://itextpdf.com/ > > It is what pdftk uses. > > b/w > > Mark > > 2010/1/9 David Kane <d...@kanecap.com>: >> I have a pdf file that I would like to parse into R: >> >> http://www.williams.edu/Registrar/geninfo/faculty.pdf >> >> For now, I open the file in Acrobat by hand, then save it "as text" >> and then use readLines(). That works fine but a) I am concerned that >> some information may be lost and b) I may be doing this a lot, so I >> would rather have R grab the information from the pdf file directly. >> >> So: is there something like readPDF() for R? >> >> Thanks, >> >> Dave Kane >> >> PS. If you're curious, here is the sort of work that I want to do with >> this data: >> http://www.ephblog.com/2010/01/08/class-update-and-faculty-ages/ >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > > > > -- > Dr. Mark Wardle > Specialist registrar, Neurology > Cardiff, UK > -- Dr. Mark Wardle Specialist registrar, Neurology Cardiff, UK ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.