The acf and ccf functions assume that time series are stationary, but yours are 
not.

I think that your alternative function is not well founded. You take a separate 
mean for each sub-series, which implicitly allows the mean of the series to 
vary arbitrarily with time. However, you only have one instance of each time 
series so you can't have a non-parametric model for the means.

A parametric approach is to remove a linear trend from the series and then 
apply ccf:

e1 <- residuals(lm(y1 ~ x))
e2 <- residuals(lm(y1 ~ x))
ccf(e1, e2)

which does identify a lag of -8 as the best match, although the correlation is 
somewhat lower than what you found (0.87).

Martyn

________________________________________
From: r-devel-boun...@r-project.org [r-devel-boun...@r-project.org] on behalf 
of Sami Toppinen [sami.toppi...@kolumbus.fi]
Sent: 04 November 2014 09:44
To: r-devel@r-project.org
Subject: [Rd] [R] Calculation of cross-correlation in ccf

Dear All,

I am studying some process measurement time series in R and trying to identify 
time delays using cross-correlation function ccf. The results have however been 
bit confusing. I found a couple of years old message about this issue but 
unfortunately wasn't able to find it again for a reference.

For example, an obvious time shift is observed between the measurements y1 and 
y2 when the following test data is plotted:

x <- 1:121
y1 <- c(328.27, 328.27, 328.27, 328.27, 328.21, 328.14, 328.14, 328.01,
  328.07, 328.01, 327.87, 328.01, 328.07, 328.27, 328.27, 328.54, 328.61,
  328.74, 328.88, 329.01, 329.01, 329.21, 329.28, 329.35, 329.35, 329.42,
  329.35, 329.28, 329.28, 329.15, 329.08, 329.08, 328.95, 328.95, 328.95,
  328.95, 329.01, 329.15, 329.21, 329.28, 329.55, 329.62, 329.75, 329.82,
  329.89, 330.09, 330.09, 330.29, 330.29, 330.36, 330.42, 330.29, 330.15,
  330.22, 330.09, 329.95, 329.82, 329.75, 329.62, 329.55, 329.48, 329.55,
  329.68, 329.75, 329.82, 329.89, 330.09, 330.09, 330.15, 330.22, 330.42,
  330.42, 330.42, 330.36, 330.42, 330.22, 330.15, 330.09, 329.89, 329.75,
  329.55, 329.35, 329.35, 329.42, 329.48, 329.55, 329.75, 329.75, 329.82,
  330.09, 330.15, 330.42, 330.42, 330.62, 330.69, 330.69, 330.83, 330.83,
  330.76, 330.62, 330.62, 330.56, 330.42, 330.42, 330.29, 330.29, 330.29,
  330.29, 330.22, 330.49, 330.56, 330.62, 330.76, 331.03, 330.96, 331.16,
  331.23, 331.50, 331.63, 332.03, 332.03)
y2 <- c(329.68, 329.75, 329.82, 329.95, 330.02, 330.15, 330.22, 330.36,
  330.22, 330.29, 330.29, 330.29, 330.29, 330.15, 330.22, 330.22, 330.15,
  330.15, 330.15, 330.15, 330.15, 330.29, 330.49, 330.49, 330.62, 330.89,
  331.03, 331.09, 331.16, 331.30, 331.30, 331.36, 331.43, 331.43, 331.43,
  331.36, 331.36, 331.36, 331.36, 331.23, 331.23, 331.16, 331.16, 331.23,
  331.30, 331.23, 331.36, 331.56, 331.70, 331.83, 331.97, 331.97, 332.10,
  332.30, 332.44, 332.44, 332.51, 332.51, 332.57, 332.57, 332.51, 332.37,
  332.24, 332.24, 332.10, 331.97, 331.97, 331.90, 331.83, 331.97, 331.97,
  331.97, 332.03, 332.24, 332.30, 332.30, 332.37, 332.57, 332.57, 332.57,
  332.57, 332.57, 332.51, 332.37, 332.30, 332.17, 331.97, 331.83, 331.70,
  331.70, 331.63, 331.63, 331.70, 331.83, 331.90, 332.10, 332.24, 332.30,
  332.44, 332.57, 332.71, 332.84, 332.77, 332.91, 332.84, 332.84, 332.91,
  332.84, 332.77, 332.77, 332.64, 332.64, 332.57, 332.57, 332.57, 332.57,
  332.57, 332.71, 332.98, 333.24, 333.58)
matplot(cbind(y1, y2))

However, the cross-correlation function surprisingly gives the maximum 
correlation 0.83 at zero lag:

ccf(y1, y2)

Plotting of variables against each other with zero lag

plot(y1, y2)

shows that the correlation is not that good. Instead, a very nice correlation 
is obtained with a plot with shifted variables:

plot(y1[1:113], y2[1:113 + 8])

As a comparison I defined my own cross correlation function:

cross.cor <- function(x, y, k)
{
  n <- length(x) - abs(k)
  if (k >= 0)
    cor(x[1:n + k], y[1:n])
  else
    cor(x[1:n], y[1:n - k])
}

The new function cross.cor gives maximum correlation of 0.99 at lag -8, and the 
shape of the correlation function is very different from the one obtained with 
ccf (both methods give same value at zero lag):

plot(-15:15, sapply(-15:15, function(lag) cross.cor(y1, y2, lag)),
  ylim = c(0.3, 1))
points(-15:15, ccf(y1, y2, 15, plot = FALSE)$acf, col = "red")

In order to understand the reason for the differences, I looked at the program 
codes of ccf function. When the variables are compared with a nonzero lag, some 
the data points must be left out from the tails of the time series. It appears 
to me that ccf calculates (partly in R code, partly in C code) first the means 
and the standard deviations using the whole data, and then uses those values as 
constants in further calculations. The time series are truncated only in the 
summations for covariances.

That approach naturally speeds up the computations, but is it really correct? 
Is the approach used in ccf somehow statistically more correct? I suppose the 
strong increasing trend in my test data emphasizes this issue (leaving data 
points out from one end has big impact on the average).

Best regards,
Sami Toppinen
sami.toppi...@kolumbus.fi

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel-----------------------------------------------------------------------
This message and its attachments are strictly confidenti...{{dropped:8}}

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to