Re: [R] parsing pdf files

2010-01-10 Thread Mark Wardle
[copied to list for posterity...] Sorry. I am completely wrong. I've been using itext to split, fill in forms and recombine PDF so assumed (wrongly) that text extraction was possible. In fact, reading the mailing lists is quite informative - clearly PDF is not designed for this. Try this http:

Re: [R] parsing pdf files

2010-01-10 Thread John Maindonald
ep="") funentries <- paste("\\indexentry ", "{", entrymat[,1],"}{", entrymat[,2], "}",sep="") write(funentries, fdxfile) system(paste("makeindex -o", fndfile, fdxfile)) } Jo

Re: [R] parsing pdf files

2010-01-10 Thread Mark Wardle
If you can use a R <-> java interface, you could use itext to do this as long as the PDF is fairly sane. see http://itextpdf.com/ It is what pdftk uses. b/w Mark 2010/1/9 David Kane : > I have a pdf file that I would like to parse into R: > > http://www.williams.edu/Registrar/geninfo/faculty.p

Re: [R] parsing pdf files

2010-01-09 Thread Laurent Rhelp
David Kane a écrit : I have a pdf file that I would like to parse into R: http://www.williams.edu/Registrar/geninfo/faculty.pdf For now, I open the file in Acrobat by hand, then save it "as text" and then use readLines(). That works fine but a) I am concerned that some information may be lost

Re: [R] parsing pdf files

2010-01-09 Thread Albert-Jan Roskam
t-Jan ~~ In the face of ambiguity, refuse the temptation to guess. ~~ --- On Sat, 1/9/10, Barry Rowlingson wrote: From: Barry Rowlingson Subject: Re: [R] pa

Re: [R] parsing pdf files

2010-01-09 Thread Barry Rowlingson
On Sat, Jan 9, 2010 at 1:11 PM, David Kane wrote: > I have a pdf file that I would like to parse into R: > > http://www.williams.edu/Registrar/geninfo/faculty.pdf > > For now, I open the file in Acrobat by hand, then save it "as text" > and then use readLines(). That works fine but a) I am concern

[R] parsing pdf files

2010-01-09 Thread David Kane
I have a pdf file that I would like to parse into R: http://www.williams.edu/Registrar/geninfo/faculty.pdf For now, I open the file in Acrobat by hand, then save it "as text" and then use readLines(). That works fine but a) I am concerned that some information may be lost and b) I may be doing th