Thank you all for your help, this has solved my problem. My main problem with using gsubfn was that i was getting confused by the square brackets in
[^]]+[^] but I now have a much better understanding of what this means. Cheers! Tony Breyal On 6 May, 19:38, Gabor Grothendieck <ggrothendi...@gmail.com> wrote: > This is very similar to the solution in Jim's post > except the regular expressions can be made > slightly simpler due to the use of strapply and a > few of the regular expressions have been made a > bit different even apart from that. Its not > always clear what the general case is based on example > so the regular expressions may need to be tweaked > once the full data is available but this does work > on the sample shown. > > Here: > > \\d+ means one or more digits > [^]]+[^] ] means one or non-] characters followed by a > final character which is neither ] nor space > \\S+ means one or more non-space characters > \\S+ . (.*) means one or more non-space characters followed by > space followed by any character followed by space followed by any > sequence of characters > > In each case the portion of the regular expression > in parentheses is captured and returned by > strapply. > > library(gsubfn) > > # input is input data as in Jim's post > data.frame(ID = strapply(input, "ID: (\\d+)", c, simplify = TRUE), > Writer = strapply(input, "Writer: ([^]]+[^] ])", c, simplify = TRUE), > Rating = strapply(input, "Rating: (\\S+)", c, simplify = TRUE), > Text = strapply(input, "Rating: \\S+ . (.*)", c, simplify = TRUE), > stringsAsFactors = FALSE) > > > > > > On Thu, May 6, 2010 at 12:24 PM, jim holtman <jholt...@gmail.com> wrote: > > Try this: > > >> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who", > > + "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy", > > + "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 > > ]Babylon"), > > + sep = "\n", file = "tmp.txt") > > >> # read in the data and parse it assuming it has the same structure > >> input <- readLines('tmp.txt') > >> # parse it item by item > >> x.id <- sub(".*\\[ID: ([[:digit:]]+).*", "\\1 <file://0.0.0.1/>", input) > >> x.writer <- sub(".*\\[Writer:([^]]+).*", '\\1', input) > >> x.rating <- sub(".*\\[Rating: ([0-9.]+).*", '\\1', input) > >> x.prog <- sub(".*\\](.*)", '\\1', input) > >> #create dataframe > >> data.frame(id=x.id, writer=x.writer, rating=x.rating, prog=x.prog) > > id writer rating prog > > 1 001 Steven Moffat 8.9 Doctor Who > > 2 002 Joss Whedon 8.8 Buffy > > 3 003 J. Michael Straczynski 7.4 Babylon > > > On Thu, May 6, 2010 at 9:58 AM, Tony B <tony.bre...@googlemail.com> wrote: > > >> Dear all > > >> Lets say I have a plain text file as follows: > > >> > cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who", > >> + "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy", > >> + "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ] > >> Babylon [5]"), > >> + sep = "\n", file = "tmp.txt") > > >> I would somehow like to read in this file to R and covert it into a > >> data frame like this: > > >> > DF <- data.frame(ID = c("001", "002", "003"), > >> + Writer = c("Steven Moffat", "Joss Whedon", "J. > >> Michael Straczynski"), > >> + Rating = c("8.9", "8.8", "7.4"), > >> + Text = c("Doctor Who", "Buffy", "Babylon [5]"), > >> stringsAsFactors = FALSE) > > >> My initial thoughts were to use readLines on the text file and maybe > >> do some regular expressions and also use strsplit(..); but having > >> confused myself after several attempts I was wondering if there is a > >> way, perhaps using maybe read.table instead? My end goal is to > >> hopefully convert DF into an XML structure. > > >> Thank you kindly in advance for your time, > >> Tony Breyal > > >> # Windows Vista > >> > sessionInfo() > >> R version 2.11.0 (2010-04-22) > >> i386-pc-mingw32 > > >> locale: > >> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United > >> Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 > >> LC_NUMERIC=C LC_TIME=English_United Kingdom. > >> 1252 > > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods > >> base > > >> other attached packages: > >> [1] XML_2.8-1 > > >> loaded via a namespace (and not attached): > >> [1] tools_2.11.0 > > >> ______________________________________________ > >> r-h...@r-project.org mailing list > >>https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >>http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > >> and provide commented, minimal, self-contained, reproducible code. > > > -- > > Jim Holtman > > Cincinnati, OH > > +1 513 646 9390 > > > What is the problem that you are trying to solve? > > > [[alternative HTML version deleted]] > > > ______________________________________________ > > r-h...@r-project.org mailing list > >https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- > You received this message because you are subscribed to the Google Groups > "R-help-archive" group. > To post to this group, send email to r-help-arch...@googlegroups.com. > To unsubscribe from this group, send email to > r-help-archive+unsubscr...@googlegroups.com. > For more options, visit this group > athttp://groups.google.com/group/r-help-archive?hl=en. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.