Hello, Does anyone know of a way I can speed this up? Basically I'm attempting to get the data item on the same row as the report date for each report date available. In reality, I have over 11k of columns, not just A, B, C, D and I have to do that over 100 times. My solution is slow, but it works. The loop is slow because of merge.
# create sample data z.dates = c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31") nms = c("A","B","C","D") # these are the report dates that are the real days the data was available rd1 = matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217", "20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112", "20070426","--","--","--","--","--","--","20090319", "--","--","--","--","--","--","--","--"), nrow=8,ncol=4) dimnames(rd1) = list(z.dates,nms) # this is the unadjusted raw data, that always has the same dimensions, rownames, and colnames as the report dates ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65, 2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000, NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138, NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN), nrow=8,ncol=4) dimnames(ua) = list(z.dates,nms) ################################# change anything below. # My first attempt at this fix = function(x) { year = substring(x, 1, 4); mo = substring(x, 5, 6); day = substring(x, 7, 8); ifelse(year=="--", "NA", paste(year, mo, day, sep = "-")) } rd = apply(rd1, 2, fix) dimnames(rd) = dimnames(rd) dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by = "day") dt = sapply(dt1, as.character) fin = dt ck_rows = length(dt) bad = character(0) start_t_all = Sys.time() for(cn in 1:ncol(ua)){ uac = ua[,cn] tkr = colnames(ua)[cn] rdc = rd[,cn] ua_rd = cbind(uac,rdc) colnames(ua_rd) = c(tkr,'rt_date') xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T) xx = as.character(xx1[,2]) values <- c(NA, xx[!is.na(xx)]) ind = cumsum(!is.na(xx)) + 1 y <- values[ind] if(ck_rows == length(y)){ fin = data.frame(fin,y) }else{ bad = c(bad,tkr) } } colnames(fin) = c('daily_dates',nms) print("over all time for loop") print(Sys.time()-start_t_all) print(fin) Thanks, Ben PS - the real/over-all issue is below, but it is probably too involved to follow. On Sat, Mar 3, 2012 at 2:30 PM, Ben quant <ccqu...@gmail.com> wrote: > Hello, > > Thank you for your help/advice! > > The issue here is speed/efficiency. I can do what I want, but its really > slow. > > The goal is to have the ability to do calculations on my data and have it > adjusted for look-ahead. I see two ways to do this: > (I'm open to more ideas. My terminology: Unadjusted = values not adjusted > for look-ahead bias; adjusted = values adjusted for look-ahead bias.) > > 1) I could a) do calculations on unadjusted values then b) adjust the > resulting values for look-ahead bias. Here is what I mean: > a) I could say the following using time series of val1: [(val1 - val1 4 > periods ago) / val1 4 periods ago] = resultval. ("Periods" correspond to > the z.dates in my example below.) > b) Then I would adjust the resultval for look-ahead based on val1's > associated report date. > Note: I don't think this will be the fastest. > > 2) I could do the same calculation [(val1 - val1 4 periods ago) / val1 4 > periods ago] = resultval, but my calculation function would get the 'right' > values that would have no look-ahead bias. I'm not sure how I would do > this, but maybe a query starting with the date that I want, indexed to > appropriate report date indexed to the correct value to return. But how do > I do this in R? I think I would have to put this in our database and do a > query. The data comes to me in RData format. I could put it all in our > database via PpgSQL which we already use. > Note: I think this will be fastest. > > Anyway, my first attempt at this was to solve part b of #1 above. Here is > how my data looks and my first attempt at solving part b of idea #1 above. > It only takes 0.14 seconds for my mock data, but that is way too slow. The > major things slowing it down A) the loop, B) the merge statement. > > # mock data: this is how it comes to me (raw) > # in practice I have over 10,000 columns > > # the starting 'periods' for my data > z.dates = > c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31") > > nms = c("A","B","C","D") > # these are the report dates that are the real days the data was available > rd1 = > matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217", > > "20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112", > "20070426","--","--","--","--","--","--","20090319", > "--","--","--","--","--","--","--","--"), > nrow=8,ncol=4) > dimnames(rd1) = list(z.dates,nms) > > # this is the unadjusted raw data, that always has the same dimensions, > rownames, and colnames as the report dates > ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65, > > 2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000, > NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138, > NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN), > nrow=8,ncol=4) > dimnames(ua) = list(z.dates,nms) > > ################################# change anything below. I can't change > anything above this line. > > # My first attempt at this was to solve part b of #1 above. > fix = function(x) > { > year = substring(x, 1, 4); > mo = substring(x, 5, 6); > day = substring(x, 7, 8); > ifelse(year=="--", "NA", paste(year, mo, day, sep = "-")) > } > > rd = apply(rd1, 2, fix) > dimnames(rd) = dimnames(rd) > > dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by = > "day") > dt = sapply(dt1, as.character) > > fin = dt > ck_rows = length(dt) > bad = character(0) > start_t_all = Sys.time() > for(cn in 1:ncol(ua)){ > uac = ua[,cn] > tkr = colnames(ua)[cn] > rdc = rd[,cn] > ua_rd = cbind(uac,rdc) > colnames(ua_rd) = c(tkr,'rt_date') > xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T) > xx = as.character(xx1[,2]) > values <- c(NA, xx[!is.na(xx)]) > ind = cumsum(!is.na(xx)) + 1 > y <- values[ind] > if(ck_rows == length(y)){ > fin = data.frame(fin,y) > }else{ > bad = c(bad,tkr) > } > } > > colnames(fin) = c('daily_dates',nms) > > # after this I would slice and dice the data into weekly, monthly, etc. > periodicity as needed, but this leaves it in daily format which is as > granular as I will get. > > print("over all time for loop") > print(Sys.time()-start_t_all) > > Regards, > > Ben > > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.