merge() is not optimized for large data frames. To do things on this scale you really want to be using a DBMS not R. See the `R Data Import/Export Manual'.
Sorting is not really relevant, especially as merge is not assuming that the match is unique. Hashing could be used, but is not. As R is open source, you have the source code and it would be kinder to read it yourself rather than expect this list to read it for you. A useful contribution to the R project would be to contribute a more efficient version, and we look forwards to seeing your contribution. On Mon, 8 May 2006, Charles Cheung wrote: > Hello all, > > A problem I encounter today is the speed which takes to sort two huge data > frames... > > I wish to sort by (X,Y) > > Dataframe One consists of variables: > X, Y, sequence, position > having ~700 000 records > > another dataframe consists of > X,Y, intensities > having ~900 000 records > > > Every (X,Y) pair in dataframe One is included in dataframe Two, > however, the reverse is not true. > Furthermore, (X,Y, position) in data frame One makes the record unique. > (That means there can be multiple records with the same (X,Y) records!) > > Added together, it makes it hard to just combine the two data frames > together by simply going > data.frame(dataFrameOne, dataFrameTwo) because the mapping won't correspond > even in sorted records by X and Y. > > > Intuitive, it should only require very little time <O(n) complexity> after > the data records are sorted. > However, it takes so long (I haven't finished the process in 20 minutes.. it > should only take <1 min) to merge the list by X and Y using > > merge(dataFrameOne, dataFrameTwo, by=c("X","Y") , which leads me to suspect > this process is not optimized for already sorted list. > > * assuming the two frames have been sorted, I would be able to do the > following: > > > X Y seq Pos > 1 1 AA 32 > 1 2 AG 44 > 1 3 GC 65 > > > X Y intensities > 1 1 0.4 > 1 3 0.552 > >>> Cursor at beginning (1,1) (1,1) -->merge the (1,1) pair.. then cursor >>> moves to (1,2) (1,3) --> can't find.. cursor moves to (1,3) (1,3) .. >>> merge that pair > > Is the merge function doing that already? > > > Is there an efficient way to merge the data frames? (What do you suggest I > should do?) > > > (to produce) > X Y seq pos intensities > 1 1 AA 32 0.4 > 1 3 GC 65 0.552 > > Thank you in advance! > > > Charles Cheung > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel