Hello all, A problem I encounter today is the speed which takes to sort two huge data frames...
I wish to sort by (X,Y) Dataframe One consists of variables: X, Y, sequence, position having ~700 000 records another dataframe consists of X,Y, intensities having ~900 000 records Every (X,Y) pair in dataframe One is included in dataframe Two, however, the reverse is not true. Furthermore, (X,Y, position) in data frame One makes the record unique. (That means there can be multiple records with the same (X,Y) records!) Added together, it makes it hard to just combine the two data frames together by simply going data.frame(dataFrameOne, dataFrameTwo) because the mapping won't correspond even in sorted records by X and Y. Intuitive, it should only require very little time <O(n) complexity> after the data records are sorted. However, it takes so long (I haven't finished the process in 20 minutes.. it should only take <1 min) to merge the list by X and Y using merge(dataFrameOne, dataFrameTwo, by=c("X","Y") , which leads me to suspect this process is not optimized for already sorted list. * assuming the two frames have been sorted, I would be able to do the following: X Y seq Pos 1 1 AA 32 1 2 AG 44 1 3 GC 65 X Y intensities 1 1 0.4 1 3 0.552 >>Cursor at beginning (1,1) (1,1) -->merge the (1,1) pair.. then cursor >>moves to (1,2) (1,3) --> can't find.. cursor moves to (1,3) (1,3) .. >>merge that pair Is the merge function doing that already? Is there an efficient way to merge the data frames? (What do you suggest I should do?) (to produce) X Y seq pos intensities 1 1 AA 32 0.4 1 3 GC 65 0.552 Thank you in advance! Charles Cheung ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel