Re: [R] alternative to rbind within a loop

Don MacQueen Thu, 23 Jul 2009 16:01:18 -0700

Another approach that might be worth trying is tocreate an empty data frame with lots and lots ofrows before looping, and then replace rather thanappend. Of course, this requires knowing at leastapproximately how many rows total you will have.This suggestion comes from the help page forread.table(), which says;


    Using 'nrows', even as a mild over-estimate, will help memory usage.

You may be doing a lot of unnecessary processingif you are allowing your character variables tobe automatically converted to factors. This wouldespecially be the case if each data frame has newcharacter values not in the previous ones, sincemore levels would be added to the factorvariables each time a data frame is appended.

Another approach would be to concatenate thefiles outside of R (in unix, this would be the"cat" command) and then read the single largefile into R. This can be controlled from withinR, i.e., using the system() command. It can evenbe done without actually writing the extra file,with something like


  read.csv( pipe( 'cat *.csv') )

Despite those ideas, I like Greg Snow's approach;I'd try it before any of these.

Finally, if you really want to find out where thecpu time is being spent, look into profiling; see?Rprof.


-Don

At 3:53 PM -0400 7/23/09, Denis Chabot wrote:

Hi,

I often have to do this:
select a folder (directory) containing a fewhundred data files in csv format (up to 1000files, in fact)
open each file, transform some character variables in date-tiime format

make into a dataframe (involves getting rid of a few variables I don't need
concatenate to the master dataframe that willeventually contain the data from all the filesin the folder.
I use a loop going from 1 to the number offiles. I have added a command to print anincrementing number to the R console each timethe loop completes one iteration, to judge thespeed of the process.
At the beginning, 3-4 files are processed eachsecond. After a few hundred iterations it slowsdown to about 1 file per second. Before I reachthe last file (898 in the case at hand), it hasbecome much slower, about 1 file every 2-3seconds.
This progressive slowing down suggests theproblem is linked to the size of the growing"master" dataframe that rbind combines with eachnew file.
In fact, the small script below confirms this asnothing at all happens within the loop butrbind. You can cut the size of this example notto waste to much of your time:
# create a dummy data.frame and copy it in a large number of csv files

test  <- file.path("test")

a <- 1:350
b <- rnorm(350,100,10)
c <- runif(350, 0, 100)
d <- month.name[runif(350,1,12)]

the.data <- data.frame(a,b,c,d)

for(i in 1:850){
        write.csv(the.data, file=paste(test, "/file_", i, ".csv", sep=""))
}

# now lets make a single dataframe from all these csv files

all.files <- list.files(path=test,full.names=T,pattern=".csv")

new.data <- NULL

system.time({
        for(i in all.files){
        in.data <- read.csv(i)
if (is.null(new.data)) {new.data =in.data} else {new.data = rbind(new.data,in.data)}
        cat(paste(i, ", ", sep=""))
} # end for
}) # end system.time

utilisateur     système      écoulé
    156.206      44.859     202.150
This is with

sessionInfo()
R version 2.9.1 Patched (2009-07-16 r48939)
x86_64-apple-darwin9.7.0

locale:
fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] doBy_3.7        chron_2.3-30    timeDate_290.84

loaded via a namespace (and not attached):
[1] cluster_1.12.0 grid_2.9.1 Hmisc_3.5-2lattice_0.17-25 tools_2.9.1
Would it be better to somehow save all 850 filesin one dataframe each, and then rbind them allin a single operation?
Can I combine all my files without using a loop?I've never quite mastered the "apply" family offunctions but have not seen examples to readfiles.
Thanks in advance,

Denis Chabot

______________________________________________
R-help@r-project.org mailing list
https://*stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://*www.*R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
925-423-1062

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] alternative to rbind within a loop

Reply via email to