On 23/02/05 I suggested that given R had included TOMS 708 to correct for t= he=20 poor performance of pbeta, TOMS 654 should be included to fix all the pgamm= a=20 problems. I was slightly surprised to find Morten's code had been included= =20 instead 2 days later. I noticed but did not worry that the reference to me = had=20 been removed.=20
The derivation of the asymptotic expansion for the gamma distribution used = by=20 Morten can be found at http://members.aol.com/iandjmsmith/PoissonApprox.htm= =20 It is fairly easy to understand and find error bounds for and hence include= =20 sensibly in an algorithm to calculate pgamma. The basis and accuracy of the some of the algorithms I use is discussed in= =20 http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute = error=20 in the log of the probability gives a good indication of the accuracy of yo= ur=20 answer. In the least extreme example you consider=20 (pgamma(0.9*1e25,1e25,log=3DT)the absolute error would be about 5360515 and= if you exponentiated the result=20 the relative error would be about 10 to the power 2328042. So the answer yo= u=20 wish to calculate is K times 10 to the power -2.32804237034994E+22, where K= is=20 somewhere between 10 to the power plus or minus 2328042. In other words whe= n=20 you make the changes to correct this problem, your calculation will still= =20 return values with no real meaning but at least users might be aware of thi= s which=20 would be no bad thing! For me this answer is possibly so meaningless that N= an=20 would be preferable. I did mention to Morten that I had updated my code but I believed that for= =20 Gnumeric he was quite satisfied with what he had. If you look at the VBA co= de at=20 http://members.aol.com/iandjmsmith/Examples.txt you can see the changes I= =20 made to stop the overflow problems you seem to be worried about. My code fo= r the=20 pdf of the gamma distribution still fails for shape parameters > 2e307 due = to=20 multiplication of the shap parameter by 2pi. The code for dgamma will have = the=20 same problem unless it is hidden by use of an 80 or more bit floating point= =20 processor. The code for the asymptotic expansion for the gamma distribution= =20 seems to be fine for any number, excluding silly ones like Nan and Inf. Ind= eed it=20 takes the difference from the mean as a parameter and if you supply an=20 accurate value you get a sensible answer as mentioned in=20 http://members.aol.com/iandjmsmith/Accuracy.htm I do not share your apparent sense of panic on this matter. I have no=20 problems with error signals like NaNs because it is obvious to the user tha= t things=20 have gone wrong. Inaccurate answers when the user has no reason to expect t= hem=20 are usually far more difficult to spot and in many cases the results are ju= st=20 believed. That for me is a serious problem. I think you will find that the= =20 pgamma code of 2.0.0 did not work for small shape parameters (similar to th= e=20 problems exhibited by pbeta still for small parameters see PR#2894), was=20 inaccurate for large shape parameters (> 1e5) when it resorted to the norma= l=20 approximation and was pretty slow in between. Indeed, the normal approximat= ion was the=20 cause of PR#7307. I don't understand your comments about=20 "pt_ =3D -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) =3D -x * log((lambda-x= )/x) -=20 (lambda-x)=20 and naively assumes that this is small enough to use a power series expansi= on=20 in 1/x with coefficients as powers of pt_. To make matters worse, consider = =E2=80=A6" In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't thin= k=20 it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If = |x|=20 < .01 it is evaluated by a power series, if x < -0.8 or x > 1 it uses=20 log1p(1+x)-x and for other values it uses a continued fraction which essent= ially=20 evaluates more of the same series used when |x| < .01. Your comments about replacing logspace_add with logspace_sub with simpler= =20 code which works at first sight to be a very sensible improvement. However,= I=20 would be a bit nervous that lnd-lnp could be very large and the exp of it c= ould=20 return infinity. I'm sure this can be accounted for in the code and lnp += =20 log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly. I am not responsible for the code for calculating the logs of probabilities= =20 but I seem to remember that the extremely poor performance of the algorithm= s in=20 R2.0.0 with logged probabilities was one of the reasons Morten became=20 interested in changing the pgamma code (see PR#7307). I have had a quick lo= ok and=20 once the corections mentioned above are made it should be giving nonsense a= nswers=20 with no difficulty. Unfortunately there are still a few examples of sloppy coding and accuracy= =20 errors remaining in R. The non-central distribution functions have horrible 1- cancellation errors= =20 associated with them (see PR#7099) and separate code is required for the tw= o=20 tails of the distributions to get round the problem. The fix for PR#8251 is a kludge and just moves the inaccuracies to examples= =20 with higher non-centrality parameters. pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit=20 implementations. pcauchy works but I believe the pt function is also suppos= ed to work for=20 non integral degrees of freedom so making it work one degree of freedom via= =20 pcauchy is hardly much use. qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying the parameters,= =20 qnbinom can be made very slow indeed. I do not think there is anything wron= g with=20 the Cornish-Fisher expansion. It just seems that it is not always very good= =20 for the Negative Binomial distribution. In the example above, the initial= =20 approximation is out by 2e6. A slightly different problem which can cause qnbinom and qbinom to go into= =20 infinite loops is when the q-value is too big. For example=20 qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15 app= rox but the code works=20 with values where one of the statements y :=3D y +1 or y =3D y - 1; is exec= uted but=20 does not alter the value of y. df(0,2,2,FALSE) should be 1 not 0 df(0,df,2,FALSE) should be infinity for df < 2 not 0 dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0 Presumably R also has similar problems with the pbeta function. As I recall= =20 the TOMS 708 code was pretty much included without edits and therefore didn= 't=20 calculate logs of probabilities except by calculating the probability and t= hen=20 logging it. I assumed this was why it was not used for small shape paramete= rs=20 where the current code does not work, although it did not seem logical to m= e.=20 Of course, my memory is not what it was but if that is the case and there a= re=20 problems with modifying the TOMS code, you could try an asymptotic expansio= n=20 based on http://members.aol.com/iandjmsmith/BinomialApprox.htm This response has been very rushed. I do not write well when I have plenty = of=20 time and I felt I had so many different things to say so I apologise if it = is=20 all a bit of a jumble.=20 Ian Smith [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel