[R] index instead of loop?

Ben quant Mon, 05 Mar 2012 12:55:51 -0800

Hello,

Does anyone know of a way I can speed this up? Basically I'm attempting to
get the data item on the same row as the report date for each report date
available. In reality, I have over 11k of columns, not just A, B, C, D and
I have to do that over 100 times. My solution is slow, but it works. The
loop is slow because of merge.


# create sample data
z.dates =
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D")
# these are the report dates that are the real days the data was available
rd1 =
matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",

"20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
               "20070426","--","--","--","--","--","--","20090319",
               "--","--","--","--","--","--","--","--"),
             nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)

# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,

2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
              NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
              NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
            nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)

################################# change anything below.

# My first attempt at this
fix = function(x)
{
  year = substring(x, 1, 4);
  mo = substring(x, 5, 6);
  day = substring(x, 7, 8);
  ifelse(year=="--", "NA", paste(year, mo, day, sep = "-"))
}

rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(rd)

dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by =
"day")
dt = sapply(dt1, as.character)

fin = dt
ck_rows = length(dt)
bad = character(0)
start_t_all = Sys.time()
for(cn in 1:ncol(ua)){
  uac = ua[,cn]
  tkr = colnames(ua)[cn]
  rdc = rd[,cn]
  ua_rd = cbind(uac,rdc)
  colnames(ua_rd) = c(tkr,'rt_date')
  xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)
  xx = as.character(xx1[,2])
  values <- c(NA, xx[!is.na(xx)])
  ind = cumsum(!is.na(xx)) + 1
  y <- values[ind]
  if(ck_rows == length(y)){
    fin  = data.frame(fin,y)
  }else{
    bad = c(bad,tkr)
  }
}

colnames(fin) = c('daily_dates',nms)

print("over all time for loop")
print(Sys.time()-start_t_all)

print(fin)


Thanks,

Ben

PS - the real/over-all issue is below, but it is probably too involved to
follow.

On Sat, Mar 3, 2012 at 2:30 PM, Ben quant <ccqu...@gmail.com> wrote:

> Hello,
>
> Thank you for your help/advice!
>
> The issue here is speed/efficiency. I can do what I want, but its really
> slow.
>
> The goal is to have the ability to do calculations on my data and have it
> adjusted for look-ahead. I see two ways to do this:
> (I'm open to more ideas. My terminology: Unadjusted = values not adjusted
> for look-ahead bias; adjusted = values adjusted for look-ahead bias.)
>
> 1) I could a) do calculations on unadjusted values then b) adjust the
> resulting values for look-ahead bias. Here is what I mean:
>  a) I could say the following using time series of val1: [(val1 - val1 4
> periods ago) / val1 4 periods ago] = resultval. ("Periods" correspond to
> the z.dates in my example below.)
> b) Then I would adjust the resultval for look-ahead based on val1's
> associated report date.
> Note: I don't think this will be the fastest.
>
> 2) I could do the same calculation [(val1 - val1 4 periods ago) / val1 4
> periods ago] = resultval, but my calculation function would get the 'right'
> values that would have no look-ahead bias. I'm not sure how I would do
> this, but maybe a query starting with the date that I want, indexed to
> appropriate report date indexed to the correct value to return. But how do
> I do this in R? I think I would have to put this in our database and do a
> query. The data comes to me in RData format. I could put it all in our
> database via PpgSQL which we already use.
> Note: I think this will be fastest.
>
> Anyway, my first attempt at this was to solve part b of #1 above. Here is
> how my data looks and my first attempt at solving part b of idea #1 above.
> It only takes 0.14 seconds for my mock data, but that is way too slow. The
> major things slowing it down A) the loop, B) the merge statement.
>
> # mock data: this is how it comes to me (raw)
> # in practice I have over 10,000 columns
>
> # the starting 'periods' for my data
> z.dates =
> c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")
>
> nms = c("A","B","C","D")
> # these are the report dates that are the real days the data was available
> rd1 =
> matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",
>
> "20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
>               "20070426","--","--","--","--","--","--","20090319",
>               "--","--","--","--","--","--","--","--"),
>             nrow=8,ncol=4)
> dimnames(rd1) = list(z.dates,nms)
>
> # this is the unadjusted raw data, that always has the same dimensions,
> rownames, and colnames as the report dates
> ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,
>
> 2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
>               NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
>               NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
>             nrow=8,ncol=4)
> dimnames(ua) = list(z.dates,nms)
>
> ################################# change anything below. I can't change
> anything above this line.
>
> # My first attempt at this was to solve part b of #1 above.
> fix = function(x)
> {
>   year = substring(x, 1, 4);
>   mo = substring(x, 5, 6);
>   day = substring(x, 7, 8);
>   ifelse(year=="--", "NA", paste(year, mo, day, sep = "-"))
> }
>
> rd = apply(rd1, 2, fix)
> dimnames(rd) = dimnames(rd)
>
> dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by =
> "day")
> dt = sapply(dt1, as.character)
>
> fin = dt
> ck_rows = length(dt)
> bad = character(0)
> start_t_all = Sys.time()
> for(cn in 1:ncol(ua)){
>   uac = ua[,cn]
>   tkr = colnames(ua)[cn]
>   rdc = rd[,cn]
>   ua_rd = cbind(uac,rdc)
>   colnames(ua_rd) = c(tkr,'rt_date')
>   xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)
>   xx = as.character(xx1[,2])
>   values <- c(NA, xx[!is.na(xx)])
>   ind = cumsum(!is.na(xx)) + 1
>   y <- values[ind]
>   if(ck_rows == length(y)){
>     fin  = data.frame(fin,y)
>   }else{
>     bad = c(bad,tkr)
>   }
> }
>
> colnames(fin) = c('daily_dates',nms)
>
> # after this I would slice and dice the data into weekly, monthly, etc.
> periodicity as needed, but this leaves it in daily format which is as
> granular as I will get.
>
> print("over all time for loop")
> print(Sys.time()-start_t_all)
>
> Regards,
>
> Ben
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] index instead of loop?

Reply via email to