I have finished working on a revised test suite for [statistics]. It is ready to merge into master. I will do this when it has passed the CI checks [1].
The test suite uses properties files to contain reference values for a parameterisation of a distribution. Ideally all distributions should be tested using these files as it ensures that the distribution is run through all the standard tests. In the case of some old legacy tests for issues discovered in Commons Math the tests have been preserved. I updated most test data to higher accuracy and now most of the test suite passes using a *relative* tolerance of 1e-12. Using a relative tolerance is helpful for assessing the inverse CDF as the output value x may not lie in the interval [0, 1]. For testing probabilities an absolute accuracy is OK but if the probabilities have a large range of magnitudes then a high tolerance needed for some values will not identify problems in small values. Absolute accuracy is supported by JUnit assertions, but not relative accuracy. For relative accuracy I created a wrapper interface (DoubleTolerance) that uses Commons Numbers Precision to compute equivalence. This can use absolute, relative or ULP distance tolerances; and combinations of them with And and Or operations. I fixed a lot of issues that were missed due to sparsity of the previous test suite. I found additional cases for distributions which had bugs in functions that were not tested. I found domain issues with overflow of integers to negative. There are so many fixes that it is better to merge this into master and work on specific outstanding issues. I will create tickets for the following unresolved issues that may need to be addressed before a release: 1. The PoissonDistribution can be parameterised with any mean and yet the PoissonSampler in RNG excludes a mean above 2^30 to avoid truncation. This change was made to avoid distribution truncation and because the algorithm uses (int) floor(mean) (see [5]). Commons RNG now has a LongSampler interface and it may be possible to create a PossionSampler that outputs a long with the same algorithm and avoid truncation to int values. The final PoissonDistribution is still bounded by int values but the sampler can be less restrictive if it uses long. At large means the Poisson sampler mainly uses a Gaussian sampler and runtime performance should not be impacted. 2. Many distributions only compute the inverse CDF to 1e-9 relative error. Some fail to compute when the input p value maps to a very small x value (e.g. 1e-50). This is related to the defaults for the brent solver used in the default implementation. I tried to adjust the defaults but found a bug in the brent solver, which I have fixed [6]. The brent solver requires a release to be used with this fix so work can use a local copy until all issues with it have been resolved, and the changes ported back to numbers. 3. The Normal distribution uses InverseErfc for the inverse CDF. This is a wrapper for InverseErf and as such it is limited in precision as p -> 0 since the method computes using (2p - 1). When p is < 1e-16 then the inverse CDF returns -infinity. The smallest z is around -8.2 as computed by matlab: norminv(1e-16) -8.222082216130437 scipy, R, matlab, octave, Mathematica all have methods to invert small p. Here is output from matlab for norminv(5e-324): -38.472346342768788. It is accurate down to the minimum value for a 64-bit float. There are two free libraries that seem to be used for this: Boost and Cephes (function ntdri). The Boost licence is compatible with Apache and there are some Boost derived works already in Commons Numbers. Incorporating the Boost Erf functions into numbers would be useful. The inability to invert the Erfc for all p affects the Normal, Truncated normal and Levy distributions. 4. The Poisson distribution currently allows user configured tolerances for computing the Gamma functions for the CDF. These may not be necessary and it should be investigated. The version of the Gamma function used by the Poisson distribution may not be the same as those that required these tolerances in Commons Math 2 (use of the tolerances go as far back as the commit history in the git repo). 5. The Gamma distribution density is not accurate when shape < 1. There are extensive tests related to MATH-753 [7] for accuracy of the Gamma distribution. These test shape >= 1. The work requires extension to smaller shapes where the density -> infinite as X -> 0. When shape >= 1 then density -> 0 as X -> 0 (see [4]). This results in inaccuracy of the Chi-squared distribution which uses Gamma(df/2, 2) when the degrees of freedom (df) are < 2. I've read the code and the documentation for the Boost functions which the code is based on and I think the switch to the alternative computation to avoid overflow is currently not working for all cases. Regards, Alex [1] https://github.com/apache/commons-statistics/pull/31 [2] Boost error function inverses: https://www.boost.org/doc/libs/1_77_0/libs/math/doc/html/math_toolkit/sf_erf/error_inv.html [3] Cephes (ntdri in the probability functions) : https://www.netlib.org/cephes/ [4] https://en.wikipedia.org/wiki/Gamma_distribution [5] https://issues.apache.org/jira/browse/RNG-52 [6] https://issues.apache.org/jira/browse/NUMBERS-168 [7] https://issues.apache.org/jira/browse/MATH-753 --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org