How about an improvement to 16 seconds.  The first thing to do is to
convert you data to a matrix because accessing data in a dataframe is
very expensive.  If you run Rprof on your code you will see that all
the time is spent in retrieving the information.  Converting to a
matrix and using matrix accessing is considerably faster.  I did
convert the POSIXct to Date.  You were also paying a lot in the
constant conversion of POSIXct to Date for your comparisons.  I just
replicated your CALL to 1 million rows for testing.


> CALL <-
+ structure(list(NAME = c("STK", "STK", "STK", "STK", "STK",
+ "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
+ 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L,
+ 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30",
+ "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3,
+ 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
+ 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
+ 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422,
+ 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct",
+ "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME",
+ "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM",
+ "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
>
> VOL <-
+ structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
+ 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt"
+ ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
+ 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
+ 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA,
+ 6L), class = "data.frame")
>
> # convert to matrices for faster testing
> mCALL <- cbind(CALL$DATE, CALL$EXPDATE)
> mVOL <- cbind(as.Date(VOL$DATE), VOL$VOL)  # convert POSIXct to Date
>
> # create 1M rows in mCALL
> mCALL <- rbind(mCALL, mCALL[rep(1L, 1e6),])
>
> result <- numeric(nrow(mCALL))
> system.time({
+     for (i in 1:nrow(mCALL)){
+         result[i] <- sum(mVOL[(mVOL[, 1L] >= mCALL[i,1L])
+                             & (mVOL[, 1L] <= mCALL[i, 2L]), 2L])
+     }
+ })
   user  system elapsed
  15.94    0.00   16.07
>
>
>
>


On Sun, Sep 9, 2012 at 2:58 PM, Shivam <shivamsi...@gmail.com> wrote:
> Dear All,
>
> I need to sum a column from another dataframe based on the row values
> of one dataframe. I am stuck in a loop trying to accomplish it and at
> current speed it will take more than 80 hours to complete. Needless to
> say I am looking for a more elegant/quicker solution. Really need some
> help here. Here is the issue:
>
> I have a dataframe CALL (the dput of head is given below) which has
> close to a million rows. There are 2 date columns which are of
> importance, DATE and EXPDATE. There is another dataframe, VOL (dput of
> head given), which has 2 columns, DATE and VOL. It has the volatility
> corresponding to each day and it has a total of 124 records
> (corresponding to 6 months). I want to add another column in the CALL
> dataframe which would contain the sum of all the volatilities from the
> VOL df for the period specified by the interval of DATE and EXPDATE in
> each row of CALL df.
>
> For ex: In the first row, DATE is '03-01-2011' and EXPDATE is
> '27-01-2011'. So I want the SUM column (A new column in CALL df) to
> contain the sum of volatilities of 03-01, 04-01, 05-01 .... till 27-01
> from the VOL dataframe.
>
> I have to repeat this process for all the rows in the dataframe. Here
> is the for-loop version of the solution:
>
> for (k in 1:nrow(CALL)){
> CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE >= CALL$DATE[k] & VOL$DATE
> <= CALL$EXPDATE[k]))
> }
>
> The loop will run for close to a million times, it has been running
> for more than 10 hours and its just 12% complete. It would take more
> than 80 hours to complete, not the mention the toll it would take on
> my laptop. So is there a better way that I can accomplish this task?
> Any input would be greatly appreciated. Below are the dput of the two
> dataframes.
>
> One point of note is that there are only 124 DISTINCT values of DATE
> and 6 DISTINCT values of EXPDATE, in case it can be used in some way.
>
>> dput(CALL)
> structure(list(NAME = c("STK", "STK", "STK", "STK", "STK",
> "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
> 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L,
> 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30",
> "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3,
> 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
> 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
> 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422,
> 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct",
> "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME",
> "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM",
> "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
>
>
>> dput(VOL)
> structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
> 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt"
> ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
> 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
> 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA,
> 6L), class = "data.frame")
>
> Please do let me know if any more information from my side would help
> or if I need to explain the issue more clearly.
>
> Any minor improvement will be great help.
>
> Thanks in advance.
>
> -Shivam
>
>
> --
> *Victoria Concordia Crescit*
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to