Another approach that might be worth trying is to
create an empty data frame with lots and lots of
rows before looping, and then replace rather than
append. Of course, this requires knowing at least
approximately how many rows total you will have.
This suggestion comes from the help page for
read.table(), which says;
Using 'nrows', even as a mild over-estimate, will help memory usage.
You may be doing a lot of unnecessary processing
if you are allowing your character variables to
be automatically converted to factors. This would
especially be the case if each data frame has new
character values not in the previous ones, since
more levels would be added to the factor
variables each time a data frame is appended.
Another approach would be to concatenate the
files outside of R (in unix, this would be the
"cat" command) and then read the single large
file into R. This can be controlled from within
R, i.e., using the system() command. It can even
be done without actually writing the extra file,
with something like
read.csv( pipe( 'cat *.csv') )
Despite those ideas, I like Greg Snow's approach;
I'd try it before any of these.
Finally, if you really want to find out where the
cpu time is being spent, look into profiling; see
?Rprof.
-Don
At 3:53 PM -0400 7/23/09, Denis Chabot wrote:
Hi,
I often have to do this:
select a folder (directory) containing a few
hundred data files in csv format (up to 1000
files, in fact)
open each file, transform some character variables in date-tiime format
make into a dataframe (involves getting rid of a few variables I don't need
concatenate to the master dataframe that will
eventually contain the data from all the files
in the folder.
I use a loop going from 1 to the number of
files. I have added a command to print an
incrementing number to the R console each time
the loop completes one iteration, to judge the
speed of the process.
At the beginning, 3-4 files are processed each
second. After a few hundred iterations it slows
down to about 1 file per second. Before I reach
the last file (898 in the case at hand), it has
become much slower, about 1 file every 2-3
seconds.
This progressive slowing down suggests the
problem is linked to the size of the growing
"master" dataframe that rbind combines with each
new file.
In fact, the small script below confirms this as
nothing at all happens within the loop but
rbind. You can cut the size of this example not
to waste to much of your time:
# create a dummy data.frame and copy it in a large number of csv files
test <- file.path("test")
a <- 1:350
b <- rnorm(350,100,10)
c <- runif(350, 0, 100)
d <- month.name[runif(350,1,12)]
the.data <- data.frame(a,b,c,d)
for(i in 1:850){
write.csv(the.data, file=paste(test, "/file_", i, ".csv", sep=""))
}
# now lets make a single dataframe from all these csv files
all.files <- list.files(path=test,full.names=T,pattern=".csv")
new.data <- NULL
system.time({
for(i in all.files){
in.data <- read.csv(i)
if (is.null(new.data)) {new.data =
in.data} else {new.data = rbind(new.data,
in.data)}
cat(paste(i, ", ", sep=""))
} # end for
}) # end system.time
utilisateur système écoulé
156.206 44.859 202.150
This is with
sessionInfo()
R version 2.9.1 Patched (2009-07-16 r48939)
x86_64-apple-darwin9.7.0
locale:
fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] doBy_3.7 chron_2.3-30 timeDate_290.84
loaded via a namespace (and not attached):
[1] cluster_1.12.0 grid_2.9.1 Hmisc_3.5-2
lattice_0.17-25 tools_2.9.1
Would it be better to somehow save all 850 files
in one dataframe each, and then rbind them all
in a single operation?
Can I combine all my files without using a loop?
I've never quite mastered the "apply" family of
functions but have not seen examples to read
files.
Thanks in advance,
Denis Chabot
______________________________________________
R-help@r-project.org mailing list
https://*stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://*www.*R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
925-423-1062
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.