That's exactly the way I work. Here you are a chunk of text of my script. To put it in a nutshell I'm already extracting - by means of grep and gsub from indweb (luckily an html file) - the web addresses like http://www.terna.it/LinkClick.aspx?fileticket=TTQuOPUf%2fs0%3d&tabid=435&mid=3072 and the likes, pdf files (unfortunately for me). That's why I need to "translate" the pdf into a txt file. Ciao Vittorio
============================== indweb<-"http://www.terna.it/default/Home/SISTEMA_ELETTRICO/dispacciamento/dati_esercizio/dati_giornalieri/confronto.aspx" testo<-readLines(indweb) k<-grep("^(.)+dnn_ctr3072_DocumentTerna_grdDocuments_(.)+CategoryCell\">(\\d\\d)/(\\d\\d)/201(\\d)",testo) n<-length(k) # Poichè le date sono in ordine decrescente, ordina in ordine crescente k<-k[order(k,decreasing=TRUE)] for (i in 1:length(k) ) { data<-gsub("^(.)+dnn_ctr3072_DocumentTerna_grdDocuments_(.)+CategoryCell\">","",testo[k[1]]) data<-paste(substr(data,7,10), substr(data,4,5), substr(data,1,2), sep="-") mysel<-paste("select count(*) from richiesta where data=\"",data,"\";",sep="") dataesiste<-as.integer(dbGetQuery(con,mysel)) if (dataesiste == 0) { rif<-gsub("\">Confronto Giornaliero(.)+","",testo[k[30]]) rif<-gsub("^(.)+href=\"","",rif) pag<-paste("http://www.terna.it",rif,sep="") pagina<-readLines(pag) …………………………………………………. …………………………………………………. …………………………………………………. Il giorno 18/set/2011, alle ore 18:25, Joshua Wiley ha scritto: > On Sun, Sep 18, 2011 at 7:44 AM, Victor <vdem...@gmail.com> wrote: >> Unfortunately pdf2text doesn't seem to exist either in linux or mac osx. > > I think Jeff's main point was to search for software specific for your > task (convert a pdf to text). Formatting will be lost so once you get > your text files, I would look at regular expressions to try to find > the right part of text to grab. Some general functions that seem like > they might be relevant: > > ## for getting the text into R > ?readLines > ?scan > ## for finding the part you need > ?regexp > ?grep > > Cheers, > > Josh > > >> Ciao Vittorio >> >> Il giorno 17/set/2011, alle ore 21:00, Jeff Newmiller ha scritto: >> >>> Doesn't seen like an R task, but see pdf2text? (From pdftools, UNIX command >>> line tools) >>> --------------------------------------------------------------------------- >>> Jeff Newmiller The ..... ..... Go Live... >>> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... >>> Live: OO#.. Dead: OO#.. Playing >>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k >>> --------------------------------------------------------------------------- >>> Sent from my phone. Please excuse my brevity. >>> >>> Victor <vdem...@gmail.com> wrote: >>> In an R script I need to extract some figures from many web pages in pdf >>> format. As an example see >>> http://www.terna.it/LinkClick.aspx?fileticket=TTQuOPUf%2fs0%3d&tabid=435&mid=3072 >>> from which I would like to extract the "Totale: 1,025,823"). >>> Is there any solution? >>> Ciao >>> Vittorio >>> >>> >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Joshua Wiley > Ph.D. Student, Health Psychology > Programmer Analyst II, ATS Statistical Consulting Group > University of California, Los Angeles > https://joshuawiley.com/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.