Re: [Rd] Fix for bug in arima function

2015-05-21 Thread Martin Maechler

> I noticed that the 3.2.1 release cycle is about to start.  Is there any
> chance that this fix will make it into the next version of R?
> 
> This bug is fairly serious: getting the wrong variance estimate leads to
> the wrong log-likelihood and the wrong AIC, BIC etc, which can and does
> lead to suboptimal model selection.  If it's not fixed, this issue will
> affect every student taking our time series course in Fall 2015 (and
> probably lots of other students in other time series courses).  When I
> taught time series in Spring 2015, I had to teach students how to work
> around the bug, which wasted class time and shook student confidence in R.
> It'd be great if we didn’t have to deal with this issue next semester.
> 
> Again, the fix is trivial:
> 
> --- a/src/library/stats/R/arima.R
> +++ b/src/library/stats/R/arima.R
> @@ -211,8 +211,10 @@ arima <- function(x, order = c(0L, 0L, 0L),
>  if(fit$rank == 0L) {
>  ## Degenerate model. Proceed anyway so as not to break old code
>  fit <- lm(x ~ xreg - 1, na.action = na.omit)
> +n.used <- sum(!is.na(resid(fit))) - length(Delta)
> +} else {
> +n.used <- sum(!is.na(resid(fit)))
>  }
> -n.used <- sum(!is.na(resid(fit))) - length(Delta)
>  init0 <- c(init0, coef(fit))
>  ses <- summary(fit)$coefficients[, 2L]
>  parscale <- c(parscale, 10 * ses)
> 

Yes, such a change *is* small in the source code.
But we have to be sure about  its desirability.

In another post about this you mention "REML", and I think we
really are discussing if variance estimates should use a
denominator of  'n'  or 'n - p' in this case.


> The patch that introduced the bug (
> https://github.com/wch/r-source/commit/32f633885a903bc422537dc426644f743cc645e0
> ) was designed to change the initialization for the optimization routine.

> The proposed fix leaves the deliberate part of the patch unchanged (it
> preserves the value of "init0").

I can confirm this... a change introduced in R 3.0.2.

I'm about to commit changes ... after also adding a proper
regression test.

Martin Maechler, ETH Zurich

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] The R Foundation announces new mailing list 'R-package-devel'

2015-05-21 Thread Martin Maechler
New Mailing list   *** R-package-devel -- User R Packages Development ***  

At last week's monthly meeting, the R foundation has decided to
create a new mailing list in order to help R package authors in
their package development and testing.

The idea is that some experienced R programmers (often those
currently helping on R-devel or also R-help) will help package
authors and thus unload some of the burden of the CRAN team
members.

We expect impact for R-devel: I'm expecting somewhat less traffic there, 
and the focus returning to implementation of R and future features of R 
itself.

Please read the description of the mailing list here
   https://stat.ethz.ch/mailman/listinfo/r-package-devel
or below, subscribe and start using it!


For the R foundation,
Martin Maechler,  Secretary General


--- "About R-package-devel"  (from above URL): -

This list is to get help about package development in R. The goal of the list 
is to provide a forum for learning about the package development process. We 
hope to build a community of R package developers who can help each other solve 
problems, and reduce some of the burden on the CRAN maintainers. If you are 
having problems developing a package or passing R CMD check, this is the place 
to ask!

Please note that while R-package-devel contributors will do their best to 
provide you accurate and authoritative information, the final arbiters of CRAN 
submission is the CRAN team.

Please keep it civil. It's easy to get frustrated when building a package, 
or when answering the same question for what feels like the thousandth time. 
But everyone involved in the process is a volunteer.
Include a reproducible example. We can't help if we don't know what the 
problem is. For packages, if possible, include a link to the package source. If 
you're having a problem with R CMD check, include the relevant message inline.
If you're in violation of this code, one of the moderators will send you a 
gentle admonishment off-list.
For more about such "Netiquette", read the Debian code of conduct.

Note that there may be some overlap of topics with the R-devel mailing list 
notably as before the existence of R-package-devel, many package developers 
have used R-devel for questions that are now meant to be asked on this list. 
Beware that cross-posting, i.e., posting to both, is generally considered as 
impolite — with rare exceptions, e.g., if a thread is being moved from one list 
to the other for good reasons.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Fix for bug in arima function

2015-05-21 Thread peter dalgaard

On 21 May 2015, at 10:35 , Martin Maechler  
wrote:

>> 
>> I noticed that the 3.2.1 release cycle is about to start.  Is there any
>> chance that this fix will make it into the next version of R?
>> 
>> This bug is fairly serious: getting the wrong variance estimate leads to
>> the wrong log-likelihood and the wrong AIC, BIC etc, which can and does
>> lead to suboptimal model selection.  If it's not fixed, this issue will
>> affect every student taking our time series course in Fall 2015 (and
>> probably lots of other students in other time series courses).  When I
>> taught time series in Spring 2015, I had to teach students how to work
>> around the bug, which wasted class time and shook student confidence in R.
>> It'd be great if we didn’t have to deal with this issue next semester.
>> 
>> Again, the fix is trivial:
>> 
>> --- a/src/library/stats/R/arima.R
>> +++ b/src/library/stats/R/arima.R
>> @@ -211,8 +211,10 @@ arima <- function(x, order = c(0L, 0L, 0L),
>> if(fit$rank == 0L) {
>> ## Degenerate model. Proceed anyway so as not to break old code
>> fit <- lm(x ~ xreg - 1, na.action = na.omit)
>> +n.used <- sum(!is.na(resid(fit))) - length(Delta)
>> +} else {
>> +n.used <- sum(!is.na(resid(fit)))
>> }
>> -n.used <- sum(!is.na(resid(fit))) - length(Delta)
>> init0 <- c(init0, coef(fit))
>> ses <- summary(fit)$coefficients[, 2L]
>> parscale <- c(parscale, 10 * ses)
>> 
> 
> Yes, such a change *is* small in the source code.
> But we have to be sure about  its desirability.
> 
> In another post about this you mention "REML", and I think we
> really are discussing if variance estimates should use a
> denominator of  'n'  or 'n - p' in this case.
> 
> 
>> The patch that introduced the bug (
>> https://github.com/wch/r-source/commit/32f633885a903bc422537dc426644f743cc645e0
>> ) was designed to change the initialization for the optimization routine.
> 
>> The proposed fix leaves the deliberate part of the patch unchanged (it
>> preserves the value of "init0").
> 
> I can confirm this... a change introduced in R 3.0.2.
> 
> I'm about to commit changes ... after also adding a proper
> regression test.
> 

Be careful here! I was just about to say that the diagnosis is dubious, and 
that the patch could very well be wrong!!

AFAICT, the issue is that n.used got changed from being based on lm(x~...) to 
lm(dx~...) where dx is the differenced series. Now that surely loses one 
observation in arima(.,1,.), most likely unintentionally, but it is not at all 
clear that the fix is not to subtract length(Delta) -- that code has been there 
long before the changes in 3.0.2.

I'd expect that a safer fix would be to add back the orders of the the two 
differencing operations.

> Martin Maechler, ETH Zurich

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Fix for bug in arima function

2015-05-21 Thread Martin Maechler
> peter dalgaard 
> on Thu, 21 May 2015 11:03:05 +0200 writes:

> On 21 May 2015, at 10:35 , Martin Maechler 
 wrote:

>>> 
>>> I noticed that the 3.2.1 release cycle is about to start.  Is there any
>>> chance that this fix will make it into the next version of R?
>>> 
>>> This bug is fairly serious: getting the wrong variance estimate leads to
>>> the wrong log-likelihood and the wrong AIC, BIC etc, which can and does
>>> lead to suboptimal model selection.  If it's not fixed, this issue will
>>> affect every student taking our time series course in Fall 2015 (and
>>> probably lots of other students in other time series courses).  When I
>>> taught time series in Spring 2015, I had to teach students how to work
>>> around the bug, which wasted class time and shook student confidence in 
R.
>>> It'd be great if we didn’t have to deal with this issue next semester.
>>> 
>>> Again, the fix is trivial:
>>> 
>>> --- a/src/library/stats/R/arima.R
>>> +++ b/src/library/stats/R/arima.R
>>> @@ -211,8 +211,10 @@ arima <- function(x, order = c(0L, 0L, 0L),
>>> if(fit$rank == 0L) {
>>> ## Degenerate model. Proceed anyway so as not to break old code
>>> fit <- lm(x ~ xreg - 1, na.action = na.omit)
>>> +n.used <- sum(!is.na(resid(fit))) - length(Delta)
>>> +} else {
>>> +n.used <- sum(!is.na(resid(fit)))
>>> }
>>> -n.used <- sum(!is.na(resid(fit))) - length(Delta)
>>> init0 <- c(init0, coef(fit))
>>> ses <- summary(fit)$coefficients[, 2L]
>>> parscale <- c(parscale, 10 * ses)
>>> 
>> 
>> Yes, such a change *is* small in the source code.
>> But we have to be sure about  its desirability.
>> 
>> In another post about this you mention "REML", and I think we
>> really are discussing if variance estimates should use a
>> denominator of  'n'  or 'n - p' in this case.
>> 
>> 
>>> The patch that introduced the bug (
>>> 
https://github.com/wch/r-source/commit/32f633885a903bc422537dc426644f743cc645e0
>>> ) was designed to change the initialization for the optimization 
routine.
>> 
>>> The proposed fix leaves the deliberate part of the patch unchanged (it
>>> preserves the value of "init0").
>> 
>> I can confirm this... a change introduced in R 3.0.2.
>> 
>> I'm about to commit changes ... after also adding a proper
>> regression test.
>> 

> Be careful here! I was just about to say that the diagnosis is dubious, 
and that the patch could very well be wrong!!

> AFAICT, the issue is that n.used got changed from being based on 
lm(x~...) to lm(dx~...) where dx is the differenced series. Now that surely 
loses one observation in arima(.,1,.), most likely unintentionally, but it is 
not at all clear that the fix is not to subtract length(Delta) -- that code has 
been there long before the changes in 3.0.2.

well... yes,  but as you say for the case of the original lm()
fit where the resulting residuals and hence is.na(resid(.)) have
been longer

> I'd expect that a safer fix would be to add back the orders of the the 
two differencing operations.

What I did check before replying is that the patch *does* revert to 'R <= 3.0.1'
behavior for simple 'xreg' cases.  

I do see changes in the S.Es of the regression coefficients, as
they are expected. 

The few cases I've looked at where all giving results compatible
with R <= 3.0.1 (or the bug triggered which was fixed in R 3.0.2),
but I am happy for other examples where the
degrees of freedom should be computed differently, e.g., by
taking account the differencing orders as you suggest.

Seeing how relatively easy it still is to get the internal call
to optim() to produce an error, I do wonder if there are such
extensively tested arima(*, xreg = .) examples.

If we do not get more suggestions here, I'd like to commit to
R-devel only.   This would still not mean that this is going to
be in R 3.2.1 ... though it would be nice if others confirmed or
helped with more references.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Fix for bug in arima function

2015-05-21 Thread peter dalgaard

On 21 May 2015, at 12:49 , Martin Maechler  
wrote:

>> peter dalgaard 
>>on Thu, 21 May 2015 11:03:05 +0200 writes:
> 
>> On 21 May 2015, at 10:35 , Martin Maechler 
>>  wrote:
> 
 
 I noticed that the 3.2.1 release cycle is about to start.  Is there any
 chance that this fix will make it into the next version of R?
 
 This bug is fairly serious: getting the wrong variance estimate leads to
 the wrong log-likelihood and the wrong AIC, BIC etc, which can and does
 lead to suboptimal model selection.  If it's not fixed, this issue will
 affect every student taking our time series course in Fall 2015 (and
 probably lots of other students in other time series courses).  When I
 taught time series in Spring 2015, I had to teach students how to work
 around the bug, which wasted class time and shook student confidence in R.
 It'd be great if we didn’t have to deal with this issue next semester.
 
 Again, the fix is trivial:
 
 --- a/src/library/stats/R/arima.R
 +++ b/src/library/stats/R/arima.R
 @@ -211,8 +211,10 @@ arima <- function(x, order = c(0L, 0L, 0L),
 if(fit$rank == 0L) {
 ## Degenerate model. Proceed anyway so as not to break old code
 fit <- lm(x ~ xreg - 1, na.action = na.omit)
 +n.used <- sum(!is.na(resid(fit))) - length(Delta)
 +} else {
 +n.used <- sum(!is.na(resid(fit)))
 }
 -n.used <- sum(!is.na(resid(fit))) - length(Delta)
 init0 <- c(init0, coef(fit))
 ses <- summary(fit)$coefficients[, 2L]
 parscale <- c(parscale, 10 * ses)
 
>>> 
>>> Yes, such a change *is* small in the source code.
>>> But we have to be sure about  its desirability.
>>> 
>>> In another post about this you mention "REML", and I think we
>>> really are discussing if variance estimates should use a
>>> denominator of  'n'  or 'n - p' in this case.
>>> 
>>> 
 The patch that introduced the bug (
 https://github.com/wch/r-source/commit/32f633885a903bc422537dc426644f743cc645e0
 ) was designed to change the initialization for the optimization routine.
>>> 
 The proposed fix leaves the deliberate part of the patch unchanged (it
 preserves the value of "init0").
>>> 
>>> I can confirm this... a change introduced in R 3.0.2.
>>> 
>>> I'm about to commit changes ... after also adding a proper
>>> regression test.
>>> 
> 
>> Be careful here! I was just about to say that the diagnosis is dubious, and 
>> that the patch could very well be wrong!!
> 
>> AFAICT, the issue is that n.used got changed from being based on lm(x~...) 
>> to lm(dx~...) where dx is the differenced series. Now that surely loses one 
>> observation in arima(.,1,.), most likely unintentionally, but it is not at 
>> all clear that the fix is not to subtract length(Delta) -- that code has 
>> been there long before the changes in 3.0.2.
> 
> well... yes,  but as you say for the case of the original lm()
> fit where the resulting residuals and hence is.na(resid(.)) have
> been longer
> 
>> I'd expect that a safer fix would be to add back the orders of the the two 
>> differencing operations.
> 
> What I did check before replying is that the patch *does* revert to 'R <= 
> 3.0.1'
> behavior for simple 'xreg' cases.  
> 
> I do see changes in the S.Es of the regression coefficients, as
> they are expected. 
> 
> The few cases I've looked at where all giving results compatible
> with R <= 3.0.1 (or the bug triggered which was fixed in R 3.0.2),
> but I am happy for other examples where the
> degrees of freedom should be computed differently, e.g., by
> taking account the differencing orders as you suggest.
> 
> Seeing how relatively easy it still is to get the internal call
> to optim() to produce an error, I do wonder if there are such
> extensively tested arima(*, xreg = .) examples.
> 
> If we do not get more suggestions here, I'd like to commit to
> R-devel only.   This would still not mean that this is going to
> be in R 3.2.1 ... though it would be nice if others confirmed or
> helped with more references.


Hmm: Delta comes from the following computation at the start of arima()

"%+%" <- function(a, b) .Call(C_TSconv, a, b)

Delta <- 1.
for(i in seq_len(order[2L])) Delta <- Delta %+% c(1., -1.)
for(i in seq_len(seasonal$order[2L]))
Delta <- Delta %+% c(1, rep.int(0, seasonal$period-1), -1)
Delta <- - Delta[-1L]
nd <- order[2L] + seasonal$order[2L]
n.used <- sum(!is.na(x)) - length(Delta)

and C_TSconv is defined in C code to have a result of length(a)+length(b) -1 

So length(Delta) increases by 1 for each order of ordinary differencing and by 
the number of periods for each seasonal differencing. 

So length(Delta) really is the number of observations that get "lost" by 
differencing. That makes sense, at least in the complete-data case. I still 
worry that it could get things wrong if there are NAs in the middle of the 
series, sin

Re: [Rd] Fix for bug in arima function

2015-05-21 Thread Patrick Perry
Thanks for your help, Martin and Peter.

I tried taking the value of “n.used” from the C functions (ARIMA_CSS and 
ARIMA_Like) instead of computing n.used on the R side.  Here is a patch, in 
case you’re interested:

https://github.com/patperry/r-source/commit/8fed79a6d2d558ef34738624a2a4f9e795bcf8b9

I don't recommend applying this new patch without further follow-up.  The patch 
highlights some strange behavior in the ARIMA_Like function, which sometimes 
gives missing values the same weights as observations, resulting in values of 
n.used that are too high.

See the commit message for more details.


Patrick


On Thursday, May 21, 2015 at 8:36 AM, peter dalgaard wrote:

>  
> On 21 May 2015, at 12:49 , Martin Maechler  (mailto:maech...@lynne.stat.math.ethz.ch)> wrote:
>  
> > > > > > > peter dalgaard mailto:pda...@gmail.com)>
> > > > > > > on Thu, 21 May 2015 11:03:05 +0200 writes:
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> >  
> >  
> > > On 21 May 2015, at 10:35 , Martin Maechler 
> > >  > > (mailto:maech...@lynne.stat.math.ethz.ch)> wrote:
> >  
> > > > >  
> > > > > I noticed that the 3.2.1 release cycle is about to start. Is there any
> > > > > chance that this fix will make it into the next version of R?
> > > > >  
> > > > > This bug is fairly serious: getting the wrong variance estimate leads 
> > > > > to
> > > > > the wrong log-likelihood and the wrong AIC, BIC etc, which can and 
> > > > > does
> > > > > lead to suboptimal model selection. If it's not fixed, this issue will
> > > > > affect every student taking our time series course in Fall 2015 (and
> > > > > probably lots of other students in other time series courses). When I
> > > > > taught time series in Spring 2015, I had to teach students how to work
> > > > > around the bug, which wasted class time and shook student confidence 
> > > > > in R.
> > > > > It'd be great if we didn’t have to deal with this issue next semester.
> > > > >  
> > > > > Again, the fix is trivial:
> > > > >  
> > > > > --- a/src/library/stats/R/arima.R
> > > > > +++ b/src/library/stats/R/arima.R
> > > > > @@ -211,8 +211,10 @@ arima <- function(x, order = c(0L, 0L, 0L),
> > > > > if(fit$rank == 0L) {
> > > > > ## Degenerate model. Proceed anyway so as not to break old code
> > > > > fit <- lm(x ~ xreg - 1, na.action = na.omit)
> > > > > + n.used <- sum(!is.na(resid(fit))) - length(Delta)
> > > > > + } else {
> > > > > + n.used <- sum(!is.na(resid(fit)))
> > > > > }
> > > > > - n.used <- sum(!is.na(resid(fit))) - length(Delta)
> > > > > init0 <- c(init0, coef(fit))
> > > > > ses <- summary(fit)$coefficients[, 2L]
> > > > > parscale <- c(parscale, 10 * ses)
> > > > >  
> > > >  
> > > >  
> > > > Yes, such a change *is* small in the source code.
> > > > But we have to be sure about its desirability.
> > > >  
> > > > In another post about this you mention "REML", and I think we
> > > > really are discussing if variance estimates should use a
> > > > denominator of 'n' or 'n - p' in this case.
> > > >  
> > > >  
> > > > > The patch that introduced the bug (
> > > > > https://github.com/wch/r-source/commit/32f633885a903bc422537dc426644f743cc645e0
> > > > > ) was designed to change the initialization for the optimization 
> > > > > routine.
> > > > >  
> > > >  
> > > >  
> > > > > The proposed fix leaves the deliberate part of the patch unchanged (it
> > > > > preserves the value of "init0").
> > > > >  
> > > >  
> > > >  
> > > > I can confirm this... a change introduced in R 3.0.2.
> > > >  
> > > > I'm about to commit changes ... after also adding a proper
> > > > regression test.
> > > >  
> > >  
> >  
> >  
> > > Be careful here! I was just about to say that the diagnosis is dubious, 
> > > and that the patch could very well be wrong!!
> >  
> > > AFAICT, the issue is that n.used got changed from being based on 
> > > lm(x~...) to lm(dx~...) where dx is the differenced series. Now that 
> > > surely loses one observation in arima(.,1,.), most likely 
> > > unintentionally, but it is not at all clear that the fix is not to 
> > > subtract length(Delta) -- that code has been there long before the 
> > > changes in 3.0.2.
> >  
> > well... yes, but as you say for the case of the original lm()
> > fit where the resulting residuals and hence is.na(resid(.)) have
> > been longer
> >  
> > > I'd expect that a safer fix would be to add back the orders of the the 
> > > two differencing operations.
> >  
> > What I did check before replying is that the patch *does* revert to 'R <= 
> > 3.0.1'
> > behavior for simple 'xreg' cases.  
> >  
> > I do see changes in the S.Es of the regression coefficients, as
> > they are expected.  
> >  
> > The few cases I've looked at where all giving results compatible
> > with R <= 3.0.1 (or the bug triggered which was fixed in R 3.0.2),
> > but I am happy for other examples where the
> > degrees of freedom should be computed differently, e.g., by
> > taking account the differencing orders as you suggest.
>