Re: [R] parsing pdf files

Mark Wardle Sun, 10 Jan 2010 09:07:30 -0800

[copied to list for posterity...]


Sorry. I am completely wrong. I've been using itext to split, fill in
forms and recombine PDF so assumed (wrongly) that text extraction was
possible.

In fact, reading the mailing lists is quite informative - clearly PDF
is not designed for this.

Try this

http://pdfbox.apache.org/commandlineutilities/ExtractText

can be run from command line so potentially could be automated.

Mark

2010/1/10 Mark Wardle <m...@wardle.org>:
> If you can use a R <-> java interface, you could use itext to do this
> as long as the PDF is fairly sane.
>
> see http://itextpdf.com/
>
> It is what pdftk uses.
>
> b/w
>
> Mark
>
> 2010/1/9 David Kane <d...@kanecap.com>:
>> I have a pdf file that I would like to parse into R:
>>
>> http://www.williams.edu/Registrar/geninfo/faculty.pdf
>>
>> For now, I open the file in Acrobat by hand, then save it "as text"
>> and then use readLines(). That works fine but a) I am concerned that
>> some information may be lost and b) I may be doing this a lot, so I
>> would rather have R grab the information from the pdf file directly.
>>
>> So: is there something like readPDF() for R?
>>
>> Thanks,
>>
>> Dave Kane
>>
>> PS. If you're curious, here is the sort of work that I want to do with
>> this data:
>> http://www.ephblog.com/2010/01/08/class-update-and-faculty-ages/
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
> --
> Dr. Mark Wardle
> Specialist registrar, Neurology
> Cardiff, UK
>



-- 
Dr. Mark Wardle
Specialist registrar, Neurology
Cardiff, UK

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] parsing pdf files

Reply via email to