Re: [R] mergeing a large number of large .csvs

Uwe Ligges Thu, 08 Nov 2012 13:36:06 -0800


On 06.11.2012 05:24, Benjamin Caldwell wrote:

Thanks to you all. Modifying the code to use rbind instead of merge worked
like a charm - I can only speculate why.


Because merge is not intended for appending. See its help page.

Uwe Ligges

Best

*Ben Caldwell*

PhD Candidate
University of California, Berkeley
130 Mulford Hall #3114
Berkeley, CA 94720
Office 223 Mulford Hall
(510)859-3358



On Sat, Nov 3, 2012 at 2:19 PM, jim holtman <jholt...@gmail.com> wrote:

It easier than that.  I forgot I can do it entirely within R:

setwd("/temp/csv")
files <- Sys.glob("daily*csv")
output <- file('Rcombined.csv', 'w')
for (i in files){
     cat(i, '\n')  # write out file processing
     input <- readLines(i)
     input <- input[-1L]  # delete header
     writeLines(input, output)
}
close(output)



On Sat, Nov 3, 2012 at 4:56 PM, jim holtman <jholt...@gmail.com> wrote:

These are not commands, but programs you can use.  Here is a file copy
program in "perl" (I spelt it wrong in the email);  This will copy all
the files that have "daily" in their names.  It also skips the first
line of each file assuming that it is the header.

perl  can be found on most systems.  www.activestate.com  has a
version that runs under Windows and that is what I am using.


chdir "/temp/csv";  # my directory with files
@files = glob "daily*csv";  # get files to copy (daily......csv)
open OUTPUT, ">combined.csv"; # output file
# loop for each file
foreach $file (@files) {
     print $file, "\n";  # print file being processed
     open INPUT, "<" . $file;
     # assume that the first line is a header, so skip it
     $header = <INPUT>;
     @all = <INPUT>;  # read rest of the file
     close INPUT;
     print OUTPUT @all;  # append to the output
}
close OUTPUT;

Here is what was printed on the console:


C:\Users\Owner>perl copyFiles.pl
daily.BO.csv
daily.C.csv
daily.CL.csv
daily.CT.csv
daily.GC.csv
daily.HO.csv
daily.KC.csv
daily.LA.csv
daily.LN.csv
daily.LP.csv
daily.LX.csv
daily.NG.csv
daily.S.csv
daily.SB.csv
daily.SI.csv
daily.SM.csv

Which was a list of all the files copied.

On Sat, Nov 3, 2012 at 4:08 PM, Benjamin Caldwell
<btcaldw...@berkeley.edu> wrote:

Jim,

Where can I find documentation of the commands you mention?
Thanks





On Sat, Nov 3, 2012 at 12:15 PM, jim holtman <jholt...@gmail.com>

wrote:


A faster way would be to use something like 'per', 'awk' or 'sed'.
You can strip off the header line of each CSV (if it has one) and then
concatenate the files together.  This is very efficient use of memory
since you are just reading one file at a time and then writing it out.
  Will probably be a lot faster since no conversions have to be done.
Once you have the one large file, then you can play with it (load it
if you have enough memory, or load it into a database).

On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller
<jdnew...@dcn.davis.ca.us> wrote:

On the absence of any data examples from you per the posting

guidelines,

I will refer you to the help files for the melt function in the

reshape2

package.  Note that there can be various mixtures of wide versus

long...

such as a wide file with one date column and columns representing

all stock

prices and all trade volumes. The longest format would be what melt

gives

(date, column name, and value) but an in-between format would have

one

distinct column each for dollar values and volume values with a

column

indicating ticker label and of course another for date.

If your csv files can be grouped according to those with similar

column

"types", then as you read them in you can use cbind(

csvlabel="somelabel",

csvdf) to distinguish it and then rbind those data frames together

to create

an intermediate-width data frame. When dealing with large amounts of

data

you will want to minimize the amount of reshaping you do, but it

would

require knowledge of your data and algorithms to say any more.

---------------------------------------------------------------------------

Jeff Newmiller                        The     .....       .....  Go
Live...
DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.

  Live

Go...
                                       Live:   OO#.. Dead: OO#..

  Playing

Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.
rocks...1k

---------------------------------------------------------------------------

Sent from my phone. Please excuse my brevity.

Benjamin Caldwell <btcaldw...@berkeley.edu> wrote:

Jeff,
If you're willing to educate, I'd be happy to learn what wide vs long
format means. I'll give rbind a shot in the meantime.
Ben
On Nov 2, 2012 4:31 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us>
wrote:

I would first confirm that you need the data in wide format... many
algorithms are more efficient in long format anyway, and rbind is

way

more

efficient than merge.

If you feel this is not negotiable, you may want to consider sqldf.

Yes,

you need to learn a bit of SQL, but it is very well integrated into

R.

---------------------------------------------------------------------------

Jeff Newmiller                        The     .....       .....  Go

Live...

DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.

  Live

Go...
                                       Live:   OO#.. Dead: OO#..

Playing

Research Engineer (Solar/Batteries            O.O#.       #.O#.

  with

/Software/Embedded Controllers)               .OO#.       .OO#.

rocks...1k

---------------------------------------------------------------------------

Sent from my phone. Please excuse my brevity.

Benjamin Caldwell <btcaldw...@berkeley.edu> wrote:

Dear R help;
I'm currently trying to combine a large number (about 30 x 30) of

large

.csvs together (each at least 10000 records). They are organized

by

plots,
hence 30 X 30, with each group of csvs in a folder which

corresponds

to

the
plot. The unmerged csvs all have the same number of columns (5).

The

fifth
column has a different name for each csv. The number of rows is
different.

The combined csvs are of course quite large, and the code I'm

running

is
quite slow - I'm currently running it on a computer with 10 GB

ram,

ssd,
and quad core 2.3 ghz processor; it's taken 8 hours and it's only

75%

of
the way through (it's hung up on one of the largest data groupings

now

for
an hour, and using 3.5 gigs of RAM.

I know that R isn't the most efficient way of doing this, but I'm

not

familiar with sql or C. I wonder if anyone has suggestions for a
different
way to do this in the R environment. For instance, the key

function

now

is
merge, but I haven't tried join from the plyr package or rbind

from

base.
I'm willing to provide a dropbox link to a couple of these files

if

you'd
like to see the data. My code is as follows:


#multmerge is based on code by Tony cookson,

http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/

The function takes a path. This path should be the name of a

folder

that
contains all of the files you would like to read and merge

together

and

only those files you would like to merge.

multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = try(lapply(filenames,
function(x){read.csv(file=x,header=T)}))
try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist))
}

#this function renames files using a fixed list and outputs a .csv

merepk <- function (path, nf.name) {

output<-multmerge(mypath=path)
name <- list("x", "y", "z", "depth", "amplitude")
try(names(output) <- name)

write.csv(output, nf.name)
}

#assumes all folders are in the same directory, with nothing else

there


merge.by.folder <- function (folderpath){

foldernames<-list.files(path=folderpath)
n<- length(foldernames)
setwd(folderpath)

for (i in 1:n){
path<-paste(folderpath,foldernames[i], sep="\\")
nf.name <- as.character(paste(foldernames[i],".csv", sep=""))
merepk (path,nf.name)
}
}

folderpath <- "yourpath"

merge.by.folder(folderpath)


Thanks for looking, and happy friday!



*Ben Caldwell*

PhD Candidate
University of California, Berkeley

       [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.




--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.




--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] mergeing a large number of large .csvs

Reply via email to