I'm reading a file and using the file to populate a data frame. The way the
file is laid out, I need to fill in the data frame one row at a time.

When I start reading my file, I don't know how many rows I will need. It's
on the order of a million.

Being mindful of the time expense of reallocation, I decided on a strategy
of doubling the data frame size every time I needed to expand it ...
therefore memory is never more than 50% wasted, and it should still finish
in O(N) time.

But it still, somehow has an O(N^2) performance characteristic. It seems
like just setting a single element is slow in a larger data frame as
compared to a smaller one. Here is a toy function to illustrate,
reallocating and filling in single rows in a data frame, and shows the
slowdown:

populate.data.frame.test <- function(n=1000000, chunk=1000) {
  i = 0;
  df <- data.frame(a=numeric(0), b=numeric(0), c=numeric(0));
  t <- proc.time()[2]
  for (i in 1:n) {
    if (i %% chunk == 0) {
      elapsed <- -(t - (t <- proc.time()[2]))
      cat(sprintf("%d rows: %g rows per sec, nrows = %d\n", i,
chunk/elapsed, nrow(df)))

    }

    ##double data frame size if necessary
    while (nrow(df)<i) {
      df[max(i, 2*nrow(df)),] <- NA
      cat(sprintf("Doubled to %d rows\n", nrow(df)));
    }

    ##fill in one row
    df[i, c('a', 'b', 'c')] <- list(runif(1), i, runif(1))
  }
}

Is there a way to do this that avoids the slowdown? The data cannot be
represented as a matrix (different columns have different types.)

Peter

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to