How about an improvement to 16 seconds. The first thing to do is to convert you data to a matrix because accessing data in a dataframe is very expensive. If you run Rprof on your code you will see that all the time is spent in retrieving the information. Converting to a matrix and using matrix accessing is considerably faster. I did convert the POSIXct to Date. You were also paying a lot in the constant conversion of POSIXct to Date for your comparisons. I just replicated your CALL to 1 million rows for testing.
> CALL <- + structure(list(NAME = c("STK", "STK", "STK", "STK", "STK", + "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029, + 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L, + 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30", + "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3, + 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L, + 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977, + 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422, + 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct", + "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME", + "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM", + "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame") > > VOL <- + structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800, + 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt" + ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05, + 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865, + 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA, + 6L), class = "data.frame") > > # convert to matrices for faster testing > mCALL <- cbind(CALL$DATE, CALL$EXPDATE) > mVOL <- cbind(as.Date(VOL$DATE), VOL$VOL) # convert POSIXct to Date > > # create 1M rows in mCALL > mCALL <- rbind(mCALL, mCALL[rep(1L, 1e6),]) > > result <- numeric(nrow(mCALL)) > system.time({ + for (i in 1:nrow(mCALL)){ + result[i] <- sum(mVOL[(mVOL[, 1L] >= mCALL[i,1L]) + & (mVOL[, 1L] <= mCALL[i, 2L]), 2L]) + } + }) user system elapsed 15.94 0.00 16.07 > > > > On Sun, Sep 9, 2012 at 2:58 PM, Shivam <shivamsi...@gmail.com> wrote: > Dear All, > > I need to sum a column from another dataframe based on the row values > of one dataframe. I am stuck in a loop trying to accomplish it and at > current speed it will take more than 80 hours to complete. Needless to > say I am looking for a more elegant/quicker solution. Really need some > help here. Here is the issue: > > I have a dataframe CALL (the dput of head is given below) which has > close to a million rows. There are 2 date columns which are of > importance, DATE and EXPDATE. There is another dataframe, VOL (dput of > head given), which has 2 columns, DATE and VOL. It has the volatility > corresponding to each day and it has a total of 124 records > (corresponding to 6 months). I want to add another column in the CALL > dataframe which would contain the sum of all the volatilities from the > VOL df for the period specified by the interval of DATE and EXPDATE in > each row of CALL df. > > For ex: In the first row, DATE is '03-01-2011' and EXPDATE is > '27-01-2011'. So I want the SUM column (A new column in CALL df) to > contain the sum of volatilities of 03-01, 04-01, 05-01 .... till 27-01 > from the VOL dataframe. > > I have to repeat this process for all the rows in the dataframe. Here > is the for-loop version of the solution: > > for (k in 1:nrow(CALL)){ > CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE >= CALL$DATE[k] & VOL$DATE > <= CALL$EXPDATE[k])) > } > > The loop will run for close to a million times, it has been running > for more than 10 hours and its just 12% complete. It would take more > than 80 hours to complete, not the mention the toll it would take on > my laptop. So is there a better way that I can accomplish this task? > Any input would be greatly appreciated. Below are the dput of the two > dataframes. > > One point of note is that there are only 124 DISTINCT values of DATE > and 6 DISTINCT values of EXPDATE, in case it can be used in some way. > >> dput(CALL) > structure(list(NAME = c("STK", "STK", "STK", "STK", "STK", > "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029, > 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L, > 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30", > "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3, > 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L, > 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977, > 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422, > 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct", > "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME", > "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM", > "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame") > > >> dput(VOL) > structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800, > 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt" > ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05, > 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865, > 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA, > 6L), class = "data.frame") > > Please do let me know if any more information from my side would help > or if I need to explain the issue more clearly. > > Any minor improvement will be great help. > > Thanks in advance. > > -Shivam > > > -- > *Victoria Concordia Crescit* > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.