Hello, I was just wondering if you had found a solution? I am having the same difficulty of converting pdf's into plain text documents in R. I originally thought I could use the readLines() function, but as you can see below that did not work.
R> my.destfile <- "C:\\Documents and Settings\\clair\\Desktop\\test\\r- intro.pdf" R> my.url <- "http://cran.r-project.org/doc/manuals/R-intro.pdf" R> download.file(url = my.url, destfile=my.destfile, mode='wb') R> txt <- readLines(my.destfile) R> txt [1] "%PDF-1.4" [2] "%ÐÔÅØ" [3] "1 0 obj <<" [4] "/Length 587 " [5] "/Filter / FlateDecode" [6] ">>" [7] "stream" [8] "[EMAIL PROTECTED]&ÎÁ±?\024tBL\020$ñ°ãd4›½*´.‰\002\001<øï·_•èÌf \017’W¯_wÕ«îrðãc;Šòê`GæUŠOÛV×&³£øç¾ö\006ƒ¤Ê®\027[vïÖæ6ïWÛ7ñÑTÙÖvb \030¯“uYt/N¼.³ó5·½êÿ¢¥=\025åS‚<b¸³¿G›�" Warm Regards, Clair On 13 Nov, 15:10, Tony Breyal <[EMAIL PROTECTED]> wrote: > Dear R-Help, > > I need to convert a set of '.pdf' files into an equivalent set of > '.txt' files. This is so that i can do some text mining on the > content. > > In the latest R-News letter (http://cran.r-project.org/doc/Rnews/ > Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In > that lovely package, there is a function called 'readPDF()'. In order > to use this, ?readPDF says > > "Note that this PDF reader needs both the tools pdftotext and > pdfinfo installed and accessable on your system." > > These tools are available fromhttp://www.foolabs.com/xpdf/download.html > > I am able to download this and use it easily from a dos window to > convert a pdf file into a txt file. > > Question: how do i make these tools available to R, so that i can use > the readPDF() function? > > Thank you in advance for any help, and I hope the above made sense. > Tony Breyal > > ###OS = Windows Vista Ultimate>> sessionInfo() > > R version 2.8.0 (2008-10-20) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. > 1252;LC_MONETARY=English_United Kingdom. > 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] grid stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] tm_0.3-1 XML_1.98-1 Snowball_0.0-3 > RWeka_0.3-14 rJava_0.6-0 Matrix_0.999375-16 > lattice_0.17-15 filehash_2.0 > > loaded via a namespace (and not attached): > [1] proxy_0.4-1 > > ______________________________________________ > [EMAIL PROTECTED] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.