[R] Number of used observations in stats::arima with NAs

Javier López-de-Lacalle via R-help Wed, 02 Nov 2016 14:51:08 -0700

When missing observations (NAs) are present in the data and the regular or 
differencing filters are applied in 
stats::arima, the number of used observations (nobs) reported by nobs() is 
bigger than the actual number of non-NA observations. This value is used to 
compute the covariance matrix of parameter estimates and by BIC(), so this is 
relevant for model selection.


Example: the same number of observations are used in both fits below, but 
nobs() reports different values:

    x <- round(log(AirPassengers),2)
    length(x)
    #[1] 144
    x[c(36,52)] <- NA
    # fit ARIMA(0,1,0) to the original series
    fit1 <- stats::arima(x, order=c(0,1,0), include.mean=FALSE, method="ML") 
    nobs(fit1)
    #[1] 141
    # fit ARIMA(0,0,0) to the differenced series
    fit2 <- stats::arima(diff(x), order=c(0,0,0), include.mean=FALSE, 
method="ML") 
    nobs(fit2)

    #[1] 139

In "stats::arima", nobs is obtained as:

    n.used <- sum(!is.na(x)) - length(Delta)

where "length(Delta)" is the number of observations missed due to differencing. 
The original data in "x" are used instead of the differenced data. This ignores 
the fact that a missing observation generates two NAs in the differenced series:
    diff(c(1,2,3,NA,5,6,7,8))

    #[1]  1  1 NA NA  1  1  1

"stats::arima" counts the NAs in the original data, but this does not always 
match the number of NAs in the differenced series, which is the series that is 
used to compute the log-likelihood. Although there is only *one* NA 
observation, the contribution to the log-likelihood of *two* observations will 
be missed due to this NA, hence the number of contributions to the likelihood 
is not "sum(!is.na(x)) - length(Delta)" but "sum(!is.na(diff(x)))".

The more NAs are in the data, the bigger will be the effect of this issue.

If the argument "order[2]=1", "seasonal$order[2]=0" and the NA is the first 
observation, then my comment above is not an issue. It is not either an issue 
when "seasonal$order[2]=1" and the NAs are in the first "seasonal$period" 
observations. But in general, I believe the way nobs is computed requires 
adjusting.
I would propose (reusing code from stats::arima):

    dx <- x
    if(d > 0L) {
      dx <- diff(dx, 1L, order[2L])
    }

    if(seasonal$period > 1L & seasonal$order[2L] > 0) {
      dx <- diff(dx, seasonal$period, seasonal$order[2L])    }
    n.used <- sum(!is.na(dx))

If the argument "xreg" is NULL, "dx" is not needed and the following may be 
more efficient than the above option (I checked it with just a few examples, so 
the above may be a safer option):

    d <- order[2L]
    D <- seasonal$order[2L]

    S <- seasonal$period
    n.used <- length(x) - sum(is.na(x)[seq_len(S)]) -       2*(d + D) * 
sum(is.na(x)[-seq_len(S)]) - length(Delta)

If "xreg" is not NULL, "dx" is required in other part of the code, so it is 
convenient to use it here as well; replacing "x" by "dx" and "xreg" by "dxreg":

    isna <- is.na(dx) | apply(dxreg, 1L, anyNA)
    n.used <- sum(!isna)

Are there any reasons for counting the number of available observations in the 
original series rather than in the differenced series? Should this be fixed?

Thanks

Javier López-de-Lacalle
https://jalobe.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Number of used observations in stats::arima with NAs

Reply via email to