Re: [statistics] Release 1.0

Alex Herbert Wed, 06 Oct 2021 14:16:39 -0700

I have finished working on a revised test suite for [statistics]. It
is ready to merge into master. I will do this when it has passed the
CI checks [1].

The test suite uses properties files to contain reference values for a
parameterisation of a distribution. Ideally all distributions should
be tested using these files as it ensures that the distribution is run
through all the standard tests. In the case of some old legacy tests
for issues discovered in Commons Math the tests have been preserved.

I updated most test data to higher accuracy and now most of the test
suite passes using a *relative* tolerance of 1e-12. Using a relative
tolerance is helpful for assessing the inverse CDF as the output value
x may not lie in the interval [0, 1]. For testing probabilities an
absolute accuracy is OK but if the probabilities have a large range of
magnitudes then a high tolerance needed for some values will not
identify problems in small values. Absolute accuracy is supported by
JUnit assertions, but not relative accuracy. For relative accuracy I
created a wrapper interface (DoubleTolerance) that uses Commons
Numbers Precision to compute equivalence. This can use absolute,
relative or ULP distance tolerances; and combinations of them with And
and Or operations.

I fixed a lot of issues that were missed due to sparsity of the
previous test suite. I found additional cases for distributions which
had bugs in functions that were not tested. I found domain issues with
overflow of integers to negative. There are so many fixes that it is
better to merge this into master and work on specific outstanding
issues.

I will create tickets for the following unresolved issues that may
need to be addressed before a release:

1. The PoissonDistribution can be parameterised with any mean and yet
the PoissonSampler in RNG excludes a mean above 2^30 to avoid
truncation. This change was made to avoid distribution truncation and
because the algorithm uses (int) floor(mean) (see [5]). Commons RNG
now has a LongSampler interface and it may be possible to create a
PossionSampler that outputs a long with the same algorithm and avoid
truncation to int values. The final PoissonDistribution is still
bounded by int values but the sampler can be less restrictive if it
uses long. At large means the Poisson sampler mainly uses a Gaussian
sampler and runtime performance should not be impacted.

2. Many distributions only compute the inverse CDF to 1e-9 relative
error. Some fail to compute when the input p value maps to a very
small x value (e.g. 1e-50). This is related to the defaults for the
brent solver used in the default implementation. I tried to adjust the
defaults but found a bug in the brent solver, which I have fixed [6].
The brent solver requires a release to be used with this fix so work
can use a local copy until all issues with it have been resolved, and
the changes ported back to numbers.

3. The Normal distribution uses InverseErfc for the inverse CDF. This
is a wrapper for InverseErf and as such it is limited in precision as
p -> 0 since the method computes using (2p - 1). When p is < 1e-16
then the inverse CDF returns -infinity. The smallest z is around -8.2
as computed by matlab:

norminv(1e-16)
-8.222082216130437

scipy, R, matlab, octave, Mathematica all have methods to invert small
p. Here is output from matlab for norminv(5e-324):
-38.472346342768788. It is accurate down to the minimum value for a
64-bit float.

There are two free libraries that seem to be used for this: Boost and
Cephes (function ntdri). The Boost licence is compatible with Apache
and there are some Boost derived works already in Commons Numbers.
Incorporating the Boost Erf functions into numbers would be useful.

The inability to invert the Erfc for all p affects the Normal,
Truncated normal and Levy distributions.

4. The Poisson distribution currently allows user configured
tolerances for computing the Gamma functions for the CDF. These may
not be necessary and it should be investigated. The version of the
Gamma function used by the Poisson distribution may not be the same as
those that required these tolerances in Commons Math 2 (use of the
tolerances go as far back as the commit history in the git repo).

5. The Gamma distribution density is not accurate when shape < 1.
There are extensive tests related to MATH-753 [7] for accuracy of the
Gamma distribution. These test shape >= 1. The work requires extension
to smaller shapes where the density -> infinite as X -> 0. When shape
>= 1 then density -> 0 as X -> 0 (see [4]). This results in inaccuracy
of the Chi-squared distribution which uses Gamma(df/2, 2) when the
degrees of freedom (df) are < 2. I've read the code and the
documentation for the Boost functions which the code is based on and I
think the switch to the alternative computation to avoid overflow is
currently not working for all cases.

Regards,

Alex

[1] https://github.com/apache/commons-statistics/pull/31
[2] Boost error function inverses:
https://www.boost.org/doc/libs/1_77_0/libs/math/doc/html/math_toolkit/sf_erf/error_inv.html
[3] Cephes (ntdri in the probability functions) : https://www.netlib.org/cephes/
[4] https://en.wikipedia.org/wiki/Gamma_distribution
[5] https://issues.apache.org/jira/browse/RNG-52
[6] https://issues.apache.org/jira/browse/NUMBERS-168
[7] https://issues.apache.org/jira/browse/MATH-753

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [statistics] Release 1.0

Reply via email to