Hi Tony, On Thu, May 6, 2010 at 9:58 AM, Tony B <tony.bre...@googlemail.com> wrote: > Dear all > > Lets say I have a plain text file as follows: > >> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who", > + "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy", > + "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ] > Babylon [5]"), > + sep = "\n", file = "tmp.txt") > > I would somehow like to read in this file to R and covert it into a > data frame like this: > >> DF <- data.frame(ID = c("001", "002", "003"), > + Writer = c("Steven Moffat", "Joss Whedon", "J. > Michael Straczynski"), > + Rating = c("8.9", "8.8", "7.4"), > + Text = c("Doctor Who", "Buffy", "Babylon [5]"), > stringsAsFactors = FALSE) > > > My initial thoughts were to use readLines on the text file and maybe > do some regular expressions and also use strsplit(..); but having > confused myself after several attempts I was wondering if there is a > way, perhaps using maybe read.table instead? My end goal is to > hopefully convert DF into an XML structure.
I can't think of an easy way to do it with a simple read.table call. As you suggested, I'd try to whip this into shape by loading into a character vector using "readLines" / strsplit / regular expression. If your data is so well behaved, why not try splitting your lines by "]", then do some mincing. For instance: ## Simulate a readLines on your file lines<- c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who", + "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy", + "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ] + Babylon [5]") ## Create an empty data.frame df <- data.frame(id=character(length(lines)), writer=character(length(lines)), rating=numeric(length(lines)), text=character(length(lines))) pieces <- strsplit(lines, "]", fixed=TRUE) ## Store into their seperate pieces for more processing ids <- sapply(pieces, '[[', 1) writers <- sapply(pieces, '[[', 2) ratings <- sapply(pieces, '[[', 3) texts <- sapply(pieces, '[[', 4) ## You can use regexes again, or strsplit judiciously clean.ids <- sapply(strsplit(ids, ' '), '[', 2) clean.writers <- sapply(strsplit(writers, ':', fixed=TRUE), '[', 2) ... Honestly, if your data isn't all that well behaved, I'd probably do this in another language like Python to whip it into a "cleaner" tab separated file that can easily be read into R. I tend to like Python's matching behavior with regex's a bit better ... -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.