I have an arbitrary number of spreadsheets that I want to consolidate into a single data frame. The spreadsheets all have the same structure:

location, depth1Reading, depth2reading, depth3reading, depth4reading, depth5reading

The spreadsheets have their reading date in their file name. This gets parsed out and added to the data frame as a factor. The file name gets recorded in the data frame as a reference factor. The depth readings are put into a normal form in the data frame.

The target data frame structure is
str(df)
'data.frame':   23100 obs. of  5 variables:
$ measurement.date: Factor w/ 77 levels "2005/01/07","2003/01/08",..: 1 1 1 1 1 1 1 1 1 1 ... $ source : Factor w/ 77 levels "TDRREADINGS010705.xls",..: 1 1 1 1 1 1 1 1 1 1 ...
$ location        : int  1 1 1 1 1 2 2 2 2 2 ...
$ position        : num  1 2 3 4 5 1 2 3 4 5 ...
$ theta.percent   : num  24.8 23.5 30.7 26.6 NA 20.7 28.2 24.3 20.6 10 ...


I am successfully using the following (nested) looping-and-rbinding method, but it is very slow. I tried allocating the whole data frame and replacing rows, but ran into issues with new factors. I would like to know if there is a general R approach for efficiently doing this sort of data frame construction (assuming R is generally considered appropriate for data cleanup and restructuring).

code (works but slow):

   require("gdata") # for reading spreadsheet
   tdrs <- readLines(pipe("ls TDR*.xls"))
na.strings=c("n", " n", "n ", " ", "jn", "N", "bent", "bent pin", "skip")) # ugly but not important

   # allocate empty data frame
   df <- data.frame(measurement.date=character(0),
source=character(0), location=numeric(0), position=numeric(0), theta.percent=numeric(0))

   # iterate over spreadsheets
   for (i in 1:length(tdrs)) {
       source <- tdrs[i]  # slightly optimistic
       tdr <- read.xls(source, na.strings=na.strings)

       # standardize column names
       names(tdr)<-c("probe", "X1", "X2", "X3", "X4", "X5")

       # create a date that is nicely sortable
measurement.date <- paste(paste(20,substr(source,16,17),sep=""), # year
                                 substr(source,12,13), # month
                                 substr(source,14,15), sep="/") # day

       for (j in 1:nrow(tdr)) {    # iterate over each spreadsheet row
           tdrrow <- tdr[j,]
           location <- tdrrow$probe

           for (pos in 2:6) {    # normalize the readings
               if (is.na(tdrrow[,pos])) {
                       theta.percent<-NA }
               else theta.percent<-as.numeric(tdrrow[,pos])
               position <- pos - 1
df <- rbind(df, data.frame(measurement.date=measurement.date,
                                          source=source,
                                          location=location,
                                          position=position,
                                          theta.percent=theta.percent))
           }
       }
   }

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to