[R] mergeing a large number of large .csvs

Benjamin Caldwell Fri, 02 Nov 2012 16:03:45 -0700

Dear R help;
I'm currently trying to combine a large number (about 30 x 30) of large
.csvs together (each at least 10000 records). They are organized by plots,
hence 30 X 30, with each group of csvs in a folder which corresponds to the
plot. The unmerged csvs all have the same number of columns (5). The fifth
column has a different name for each csv. The number of rows is different.


The combined csvs are of course quite large, and the code I'm running is
quite slow - I'm currently running it on a computer with 10 GB ram, ssd,
and quad core 2.3 ghz processor; it's taken 8 hours and it's only  75% of
the way through (it's hung up on one of the largest data groupings now for
an hour, and using 3.5 gigs of RAM.

I know that R isn't the most efficient way of doing this, but I'm not
familiar with sql or C. I wonder if anyone has suggestions for a different
way to do this in the R environment. For instance, the key function now is
merge, but I haven't tried join from the plyr package or rbind from base.
I'm willing to provide a dropbox link to a couple of these files if you'd
like to see the data. My code is as follows:


#multmerge is based on code by Tony cookson,
http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/;
The function takes a path. This path should be the name of a folder that
contains all of the files you would like to read and merge together and
only those files you would like to merge.

multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = try(lapply(filenames, function(x){read.csv(file=x,header=T)}))
try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist))
}

#this function renames files using a fixed list and outputs a .csv

merepk <- function (path, nf.name) {

output<-multmerge(mypath=path)
name <- list("x", "y", "z", "depth", "amplitude")
try(names(output) <- name)

write.csv(output, nf.name)
}

#assumes all folders are in the same directory, with nothing else there

merge.by.folder <- function (folderpath){

foldernames<-list.files(path=folderpath)
n<- length(foldernames)
setwd(folderpath)

for (i in 1:n){
path<-paste(folderpath,foldernames[i], sep="\\")
 nf.name <- as.character(paste(foldernames[i],".csv", sep=""))
merepk (path,nf.name)
 }
}

folderpath <- "yourpath"

merge.by.folder(folderpath)


Thanks for looking, and happy friday!



*Ben Caldwell*

PhD Candidate
University of California, Berkeley

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] mergeing a large number of large .csvs

Reply via email to