> Any comments are very welcome, So I give it a shot, although I don't have answers but only some ideas which avenues I would explore, not being an expert at all:
1. I would try to be more restrictive with the columns used for merge, trying something like m1 <- merge( x, y, by.x = "V1", by.y = "V1", all = TRUE ) 2. It may be an option to use match() directly: indices <- match( y$V1, x$V1 ) That should give you a vector of 300,000 indices mapping the y values to their corresponding x records. I assume that there is always one record in y matching one record in x. You would still need to write some code to add the corresponding y values to a new column in x. 3. If that fails, and nobody else has a better idea, I would consider using a database engine for the job. Again, no expert advice, just a few ideas! Rgds, Rainer On Tuesday 04 October 2011 01:01:45 Aurélien PHILIPPOT wrote: > Dear all, > I am new in R and I have been faced with the following problem, that slows > me down a lot. I am short of ideas to circumvent it. So, any help would be > highly appreciated: > > I have 2 dataframes x and y. x is very big (70 million observations), > whereas y is smaller (300000 observations). > All the observations of y are present in x. But y has one additional > variable that I would like to incorporate to the dataframe x. > > For instance, imagine they have the following variable names: > colnames(x)<- c("V1", "V2", "V3", "V4") and colnames(y)<- c("V1", "V2", > "V5") > > -Since the observations of y are present in x, my strategy was to merge x > and y so that the dataframe x would get the values of the variable V5 for > the observations that are both in x and y. > > -So, I did the following: > dat<- merge(x, y, all=TRUE). > > On a small example, it works fine. The only problem is that when I apply it > to my big dataframe x, it really take for ever (several days and not done > yet) and I have a very fast computer. So, I don't know whether I should > stop now or keep on waiting. > > Does anyone have any idea to perform this operation in a more efficient way > (in terms of computation time)? > In addition, does anyone know how to incoporate some sort of counter in a > program to check what how much work has been done at a given point of time? > > Any comments are very welcome, > Thanks, > > Best, > Aurelien > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.