[R] What is the best way to efficiently construct a data frame from multiple source files?

Mark Connolly Thu, 29 Oct 2009 11:46:10 -0700

I have an arbitrary number of spreadsheets that I want to consolidateinto a single data frame. The spreadsheets all have the same structure:

location, depth1Reading, depth2reading, depth3reading, depth4reading,depth5reading

The spreadsheets have their reading date in their file name. This getsparsed out and added to the data frame as a factor. The file name getsrecorded in the data frame as a reference factor. The depth readingsare put into a normal form in the data frame.


The target data frame structure is
str(df)
'data.frame':   23100 obs. of  5 variables:

$ measurement.date: Factor w/ 77 levels "2005/01/07","2003/01/08",..: 11 1 1 1 1 1 1 1 1 ...$ source : Factor w/ 77 levels "TDRREADINGS010705.xls",..: 1 11 1 1 1 1 1 1 1 ...

$ location        : int  1 1 1 1 1 2 2 2 2 2 ...
$ position        : num  1 2 3 4 5 1 2 3 4 5 ...
$ theta.percent   : num  24.8 23.5 30.7 26.6 NA 20.7 28.2 24.3 20.6 10 ...

I am successfully using the following (nested) looping-and-rbindingmethod, but it is very slow. I tried allocating the whole data frameand replacing rows, but ran into issues with new factors. I would liketo know if there is a general R approach for efficiently doing this sortof data frame construction (assuming R is generally consideredappropriate for data cleanup and restructuring).


code (works but slow):

   require("gdata") # for reading spreadsheet
   tdrs <- readLines(pipe("ls TDR*.xls"))

na.strings=c("n", " n", "n ", " ", "jn", "N", "bent", "bent pin","skip")) # ugly but not important


   # allocate empty data frame
   df <- data.frame(measurement.date=character(0),

source=character(0),location=numeric(0),position=numeric(0),theta.percent=numeric(0))


   # iterate over spreadsheets
   for (i in 1:length(tdrs)) {
       source <- tdrs[i]  # slightly optimistic
       tdr <- read.xls(source, na.strings=na.strings)

       # standardize column names
       names(tdr)<-c("probe", "X1", "X2", "X3", "X4", "X5")

       # create a date that is nicely sortable

measurement.date <- paste(paste(20,substr(source,16,17),sep=""),# year

                                 substr(source,12,13), # month
                                 substr(source,14,15), sep="/") # day

       for (j in 1:nrow(tdr)) {    # iterate over each spreadsheet row
           tdrrow <- tdr[j,]
           location <- tdrrow$probe

           for (pos in 2:6) {    # normalize the readings
               if (is.na(tdrrow[,pos])) {
                       theta.percent<-NA }
               else theta.percent<-as.numeric(tdrrow[,pos])
               position <- pos - 1

df <- rbind(df,data.frame(measurement.date=measurement.date,

                                          source=source,
                                          location=location,
                                          position=position,
                                          theta.percent=theta.percent))
           }
       }
   }

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] What is the best way to efficiently construct a data frame from multiple source files?

Reply via email to