When missing observations (NAs) are present in the data and the regular or
differencing filters are applied in
stats::arima, the number of used observations (nobs) reported by nobs() is
bigger than the actual number of non-NA observations. This value is used to
compute the covariance matrix of parameter estimates and by BIC(), so this is
relevant for model selection.
Example: the same number of observations are used in both fits below, but
nobs() reports different values:
x <- round(log(AirPassengers),2)
length(x)
#[1] 144
x[c(36,52)] <- NA
# fit ARIMA(0,1,0) to the original series
fit1 <- stats::arima(x, order=c(0,1,0), include.mean=FALSE, method="ML")
nobs(fit1)
#[1] 141
# fit ARIMA(0,0,0) to the differenced series
fit2 <- stats::arima(diff(x), order=c(0,0,0), include.mean=FALSE,
method="ML")
nobs(fit2)
#[1] 139
In "stats::arima", nobs is obtained as:
n.used <- sum(!is.na(x)) - length(Delta)
where "length(Delta)" is the number of observations missed due to differencing.
The original data in "x" are used instead of the differenced data. This ignores
the fact that a missing observation generates two NAs in the differenced series:
diff(c(1,2,3,NA,5,6,7,8))
#[1] 1 1 NA NA 1 1 1
"stats::arima" counts the NAs in the original data, but this does not always
match the number of NAs in the differenced series, which is the series that is
used to compute the log-likelihood. Although there is only *one* NA
observation, the contribution to the log-likelihood of *two* observations will
be missed due to this NA, hence the number of contributions to the likelihood
is not "sum(!is.na(x)) - length(Delta)" but "sum(!is.na(diff(x)))".
The more NAs are in the data, the bigger will be the effect of this issue.
If the argument "order[2]=1", "seasonal$order[2]=0" and the NA is the first
observation, then my comment above is not an issue. It is not either an issue
when "seasonal$order[2]=1" and the NAs are in the first "seasonal$period"
observations. But in general, I believe the way nobs is computed requires
adjusting.
I would propose (reusing code from stats::arima):
dx <- x
if(d > 0L) {
dx <- diff(dx, 1L, order[2L])
}
if(seasonal$period > 1L & seasonal$order[2L] > 0) {
dx <- diff(dx, seasonal$period, seasonal$order[2L]) }
n.used <- sum(!is.na(dx))
If the argument "xreg" is NULL, "dx" is not needed and the following may be
more efficient than the above option (I checked it with just a few examples, so
the above may be a safer option):
d <- order[2L]
D <- seasonal$order[2L]
S <- seasonal$period
n.used <- length(x) - sum(is.na(x)[seq_len(S)]) - 2*(d + D) *
sum(is.na(x)[-seq_len(S)]) - length(Delta)
If "xreg" is not NULL, "dx" is required in other part of the code, so it is
convenient to use it here as well; replacing "x" by "dx" and "xreg" by "dxreg":
isna <- is.na(dx) | apply(dxreg, 1L, anyNA)
n.used <- sum(!isna)
Are there any reasons for counting the number of available observations in the
original series rather than in the differenced series? Should this be fixed?
Thanks
Javier López-de-Lacalle
https://jalobe.com
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.