Just to add, I did not know that the speed of data access is so much different in matrix and dataframes. This is one for the future.
Thanks again Jim :) -Shivam On Mon, Sep 10, 2012 at 3:29 AM, Shivam <shivamsi...@gmail.com> wrote: > Thanks a lot Jim, it works a treat. Just had to change the date format > in the mCALL as well. But you saved me 80 hours of fretting and > frustration. Really thankful for it. > > Regards, > Shivam > > On Mon, Sep 10, 2012 at 1:33 AM, jim holtman <jholt...@gmail.com> wrote: >> How about an improvement to 16 seconds. The first thing to do is to >> convert you data to a matrix because accessing data in a dataframe is >> very expensive. If you run Rprof on your code you will see that all >> the time is spent in retrieving the information. Converting to a >> matrix and using matrix accessing is considerably faster. I did >> convert the POSIXct to Date. You were also paying a lot in the >> constant conversion of POSIXct to Date for your comparisons. I just >> replicated your CALL to 1 million rows for testing. >> >> >>> CALL <- >> + structure(list(NAME = c("STK", "STK", "STK", "STK", "STK", >> + "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029, >> + 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L, >> + 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30", >> + "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3, >> + 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L, >> + 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977, >> + 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422, >> + 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct", >> + "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = >> c("NAME", >> + "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM", >> + "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame") >>> >>> VOL <- >> + structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800, >> + 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt" >> + ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05, >> + 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865, >> + 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA, >> + 6L), class = "data.frame") >>> >>> # convert to matrices for faster testing >>> mCALL <- cbind(CALL$DATE, CALL$EXPDATE) >>> mVOL <- cbind(as.Date(VOL$DATE), VOL$VOL) # convert POSIXct to Date >>> >>> # create 1M rows in mCALL >>> mCALL <- rbind(mCALL, mCALL[rep(1L, 1e6),]) >>> >>> result <- numeric(nrow(mCALL)) >>> system.time({ >> + for (i in 1:nrow(mCALL)){ >> + result[i] <- sum(mVOL[(mVOL[, 1L] >= mCALL[i,1L]) >> + & (mVOL[, 1L] <= mCALL[i, 2L]), 2L]) >> + } >> + }) >> user system elapsed >> 15.94 0.00 16.07 >>> >>> >>> >>> >> >> >> On Sun, Sep 9, 2012 at 2:58 PM, Shivam <shivamsi...@gmail.com> wrote: >>> Dear All, >>> >>> I need to sum a column from another dataframe based on the row values >>> of one dataframe. I am stuck in a loop trying to accomplish it and at >>> current speed it will take more than 80 hours to complete. Needless to >>> say I am looking for a more elegant/quicker solution. Really need some >>> help here. Here is the issue: >>> >>> I have a dataframe CALL (the dput of head is given below) which has >>> close to a million rows. There are 2 date columns which are of >>> importance, DATE and EXPDATE. There is another dataframe, VOL (dput of >>> head given), which has 2 columns, DATE and VOL. It has the volatility >>> corresponding to each day and it has a total of 124 records >>> (corresponding to 6 months). I want to add another column in the CALL >>> dataframe which would contain the sum of all the volatilities from the >>> VOL df for the period specified by the interval of DATE and EXPDATE in >>> each row of CALL df. >>> >>> For ex: In the first row, DATE is '03-01-2011' and EXPDATE is >>> '27-01-2011'. So I want the SUM column (A new column in CALL df) to >>> contain the sum of volatilities of 03-01, 04-01, 05-01 .... till 27-01 >>> from the VOL dataframe. >>> >>> I have to repeat this process for all the rows in the dataframe. Here >>> is the for-loop version of the solution: >>> >>> for (k in 1:nrow(CALL)){ >>> CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE >= CALL$DATE[k] & VOL$DATE >>> <= CALL$EXPDATE[k])) >>> } >>> >>> The loop will run for close to a million times, it has been running >>> for more than 10 hours and its just 12% complete. It would take more >>> than 80 hours to complete, not the mention the toll it would take on >>> my laptop. So is there a better way that I can accomplish this task? >>> Any input would be greatly appreciated. Below are the dput of the two >>> dataframes. >>> >>> One point of note is that there are only 124 DISTINCT values of DATE >>> and 6 DISTINCT values of EXPDATE, in case it can be used in some way. >>> >>>> dput(CALL) >>> structure(list(NAME = c("STK", "STK", "STK", "STK", "STK", >>> "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029, >>> 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L, >>> 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30", >>> "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3, >>> 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L, >>> 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977, >>> 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422, >>> 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct", >>> "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME", >>> "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM", >>> "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame") >>> >>> >>>> dput(VOL) >>> structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800, >>> 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt" >>> ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05, >>> 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865, >>> 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA, >>> 6L), class = "data.frame") >>> >>> Please do let me know if any more information from my side would help >>> or if I need to explain the issue more clearly. >>> >>> Any minor improvement will be great help. >>> >>> Thanks in advance. >>> >>> -Shivam >>> >>> >>> -- >>> *Victoria Concordia Crescit* >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> >> >> -- >> Jim Holtman >> Data Munger Guru >> >> What is the problem that you are trying to solve? >> Tell me what you want to do, not how you want to do it. > > > > -- > *Victoria Concordia Crescit* -- *Victoria Concordia Crescit* ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.