I think you are more likely to get a helpful answer if you give a minimal example of what your lines look like. I certainly don't have a clue, though maybe someone else will.
Cheers, Bert On Wed, Nov 20, 2019 at 12:21 PM Thomas Subia via R-help < r-help@r-project.org> wrote: > Thanks all for the help. I appreciate the feedback > I've developed another method to extract my desired data from multiple > pdfs in a directory. > > # Combine all pdfs to a combined pdf > files <- list.files(pattern = "pdf$") > pdf_combine(files, output = "joined.pdf") > > # creates a text file from joined.pdf > pdf_text("joined.pdf") > txt <- pdf_text("joined.pdf") > write.table(txt,file="mydata.txt") > > # I need to extract the lines which match a line beginning with AMAT > lines <- readLines("mydata.txt") > date <- grep("AMAT",lines) > > # output for date looks like [1] 6 62 118 174 230 286 342 398 > # These are exactly the line positions I need. > > Now that I've got the desired lines, I don't know how to extract the data > from those lines. > > Any advice would be appreciated. > > All the best, > > Thomas Subia > Statistician / Quality Engineer > IMG Precision Inc. > > > > > > > > > On Wednesday, November 20, 2019, 07:58:08 AM PST, Eric Berger < > ericjber...@gmail.com> wrote: > > > > > > Hi Thomas, > As Jeff wrote, your HTML email is difficult to read. This is a "plain > text" forum. > As for "pointers", here is one suggestion. > Since you write that you can do the necessary actions with a specific > file, try to write a function that carries out those actions for that > same file. > Except when implementing the function, replace any specific data with > the value of an argument passed into the function. > e.g. > txt <- pdf_text("10619.pdf") > would be replaced by > txt <- pdf_text(pdfFile) > > and your function would have pdfFile as an argument, as in > > myfunc <- function( pdfFile ) > > Since you can accomplish the task for this file without a function, > you should be able to accomplish the task with a function. > Once you succeed to do that you can then try passing the function > arguments that refer to the other files you need to process. > > HTH, > Eric > > > On Wed, Nov 20, 2019 at 1:09 AM Jeff Newmiller <jdnew...@dcn.davis.ca.us> > wrote: > > > > Please don't spam the mailing list. Especially with HTML format > messages. See the Posting Guide. > > > > PDF is designed to present data graphically. It is literally possible to > place every character in the page in random order and still achieve this > visual readability while practically making it nearly impossible to read. I > have encountered many PDF files with the same text placed on the page > multiple times... again scrambling your option to read it digitally. Tools > like "pdftools" can sometimes work when the program that generated the file > does so in a simple and extraction-friendly way... but there are no > guarantees, and your description suggests that it is likely that you won't > be able to accomplish your goal with this file. > > > > On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help < > r-help@r-project.org> wrote: > > > > > >Colleagues, > > > > > > > > > > > >I can extract specific data from lines in a pdf using: > > > > > > > > > > > >library(pdftools) > > > > > >pdf_text("10619.pdf") > > > > > >txt <- pdf_text(".pdf") > > > > > >write.table(txt,file="mydata.txt") > > > > > >con <- file('mydata.txt') > > > > > >open(con) > > > > > >serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <- > > >read.table(con,skip=11,nrow=1)# Extract [5] > > > > > >parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5] > > > > > >parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5] > > > > > >close(con) > > > > > > > > > > > ># note here that serial has 4 variables > > > > > ># flatness had 6 variables > > > > > ># parallel1 has 5 variables > > > > > ># parallel2 has 5 variables > > > > > > > > > > > ># this outputs the specific data I need > > > > > >serial[3] > > > > > >flatness[5] > > > > > >parallel1[5] # Note here that the txt format shows 0.0007not > > >scientific, is there a way to format this to display the original data? > > > > > >parallel2[5] # Note here that the txt format shows 0.0006not > > >scientific, , is there a way to format this to display the original > > >data? > > > > > > > > > > > >I'd like to extend this code to all of the pdf files in adirectory and > > >to generate a table of all the serial, flatness, parallel1 andparallel2 > > >data. > > > > > >I'm not having a lot of success trying to build thescript for this. > > >Some pointers would be appreciated. > > >All the best. > > > > > >Thomas Subia > > > > > >Statistician / Senior Quality Engineer > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > >______________________________________________ > > >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > >https://stat.ethz.ch/mailman/listinfo/r-help > > >PLEASE do read the posting guide > > >http://www.R-project.org/posting-guide.html > > >and provide commented, minimal, self-contained, reproducible code. > > > > -- > > Sent from my phone. Please excuse my brevity. > > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.