[Rd] PR#8528

IandJMSmith Sat, 28 Jan 2006 00:44:41 -0800

On 23/02/05 I suggested that given R had included TOMS 708 to correct for t=
he=20
poor performance of pbeta, TOMS 654 should be included to fix all the pgamm=
a=20
problems. I was slightly surprised to find Morten's code had been included=
=20
instead 2 days later. I noticed but did not worry that the reference to me =
had=20
been removed.=20


The derivation of the asymptotic expansion for the gamma distribution used =
by=20
Morten can be found at http://members.aol.com/iandjmsmith/PoissonApprox.htm=
=20
It is fairly easy to understand and find error bounds for and hence include=
=20
sensibly in an algorithm to calculate pgamma.

The basis and accuracy of the some of the algorithms I use is discussed in=
=20
http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute =
error=20
in the log of the probability gives a good indication of the accuracy of yo=
ur=20
answer. In the least extreme example you consider=20
(pgamma(0.9*1e25,1e25,log=3DT)the absolute error would be about 5360515 and=
 if you exponentiated the result=20
the relative error would be about 10 to the power 2328042. So the answer yo=
u=20
wish to calculate is K times 10 to the power -2.32804237034994E+22, where K=
 is=20
somewhere between 10 to the power plus or minus 2328042. In other words whe=
n=20
you make the changes to correct this problem, your calculation will still=
=20
return values with no real meaning but at least users might be aware of thi=
s which=20
would be no bad thing! For me this answer is possibly so meaningless that N=
an=20
would be preferable.

I did mention to Morten that I had updated my code but I believed that for=
=20
Gnumeric he was quite satisfied with what he had. If you look at the VBA co=
de at=20
http://members.aol.com/iandjmsmith/Examples.txt you can see the changes I=
=20
made to stop the overflow problems you seem to be worried about. My code fo=
r the=20
pdf of the gamma distribution still fails for shape parameters > 2e307 due =
to=20
multiplication of the shap parameter by 2pi. The code for dgamma will have =
the=20
same problem unless it is hidden by use of an 80 or more bit floating point=
=20
processor. The code for the asymptotic expansion for the gamma distribution=
=20
seems to be fine for any number, excluding silly ones like Nan and Inf. Ind=
eed it=20
takes the difference from the mean as a parameter and if you supply an=20
accurate value you get a sensible answer as mentioned in=20
http://members.aol.com/iandjmsmith/Accuracy.htm

I do not share your apparent sense of panic on this matter. I have no=20
problems with error signals like NaNs because it is obvious to the user tha=
t things=20
have gone wrong. Inaccurate answers when the user has no reason to expect t=
hem=20
are usually far more difficult to spot and in many cases the results are ju=
st=20
believed. That for me is a serious problem. I think you will find that the=
=20
pgamma code of 2.0.0 did not work for small shape parameters (similar to th=
e=20
problems exhibited by pbeta still for small parameters see PR#2894), was=20
inaccurate for large shape parameters (> 1e5) when it resorted to the norma=
l=20
approximation and was pretty slow in between. Indeed, the normal approximat=
ion was the=20
cause of PR#7307.


I don't understand your comments about=20
"pt_ =3D -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) =3D -x * log((lambda-x=
)/x) -=20
(lambda-x)=20
and naively assumes that this is small enough to use a power series expansi=
on=20
in 1/x with coefficients as powers of pt_. To make matters worse, consider =
=E2=80=A6"
In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't thin=
k=20
it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If =
|x|=20
< .01 it is evaluated by a power series, if x < -0.8 or x > 1 it uses=20
log1p(1+x)-x and for other values it uses a continued fraction which essent=
ially=20
evaluates more of the same series used when |x| < .01.

Your comments about replacing logspace_add with logspace_sub with simpler=
=20
code which works at first sight to be a very sensible improvement. However,=
 I=20
would be a bit nervous that lnd-lnp could be very large and the exp of it c=
ould=20
return infinity. I'm sure this can be accounted for in the code and lnp +=
=20
log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly.

I am not responsible for the code for calculating the logs of probabilities=
=20
but I seem to remember that the extremely poor performance of the algorithm=
s in=20
R2.0.0 with logged probabilities was one of the reasons Morten became=20
interested in changing the pgamma code (see PR#7307). I have had a quick lo=
ok and=20
once the corections mentioned above are made it should be giving nonsense a=
nswers=20
with no difficulty.


Unfortunately there are still a few examples of sloppy coding and accuracy=
=20
errors remaining in R.

The non-central distribution functions have horrible 1- cancellation errors=
=20
associated with them (see PR#7099) and separate code is required for the tw=
o=20
tails of the distributions to get round the problem.

The fix for PR#8251 is a kludge and just moves the inaccuracies to examples=
=20
with higher non-centrality parameters.

pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit=20
implementations. pcauchy works but I believe the pt function is also suppos=
ed to work for=20
non integral degrees of freedom so making it work one degree of freedom via=
=20
pcauchy is hardly much use.

qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying the parameters,=
=20
qnbinom can be made very slow indeed. I do not think there is anything wron=
g with=20
the Cornish-Fisher expansion. It just seems that it is not always very good=
=20
for the Negative Binomial distribution. In the example above, the initial=
=20
approximation is out by 2e6.

A slightly different problem which can cause qnbinom and qbinom to go into=
=20
infinite loops is when the q-value is too big. For example=20
qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15 app=
rox but the code works=20
with values where one of the statements y :=3D y +1 or y =3D y - 1; is exec=
uted but=20
does not alter the value of y.

df(0,2,2,FALSE) should be 1 not 0
df(0,df,2,FALSE) should be infinity for df < 2 not 0
dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0

Presumably R also has similar problems with the pbeta function. As I recall=
=20
the TOMS 708 code was pretty much included without edits and therefore didn=
't=20
calculate logs of probabilities except by calculating the probability and t=
hen=20
logging it. I assumed this was why it was not used for small shape paramete=
rs=20
where the current code does not work, although it did not seem logical to m=
e.=20
Of course, my memory is not what it was but if that is the case and there a=
re=20
problems with modifying the TOMS code, you could try an asymptotic expansio=
n=20
based on http://members.aol.com/iandjmsmith/BinomialApprox.htm

This response has been very rushed. I do not write well when I have plenty =
of=20
time and I felt I had so many different things to say so I apologise if it =
is=20
all a bit of a jumble.=20

Ian Smith


        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] PR#8528

Reply via email to