I have created an initial version of the statistics distributions
application. It can be used with:

cd commons-statistics-examples/examples-distribution
mvn package -Pexamples-distribution
java -jar target/examples-distribution.jar

There are 26 distributions each with a command. Each distribution
command has sub-commands to evaluate functions: pdf, cdf, survival,
icdf, logpdf

I have created this entirely using picocli with sub-classes for each
command and to hold default values. The default parameterisations and
ranges for the distributions were copied from Wikipedia if available.
The evaluated points and the distribution parameters can be changed:

To create a sample of the PDF for the exponential distribution:

    java -jar target/examples-distribution.jar exp pdf
    java -jar target/examples-distribution.jar exp pdf --mean 3.45
    java -jar target/examples-distribution.jar exp pdf \
        --mean 3.45 --min 3 --max 5 --steps 50

To create a sample of the CDF for the Poisson distribution:

    java -jar target/examples-distribution.jar poisson cdf
    java -jar target/examples-distribution.jar poisson cdf --mean 3.45
    java -jar target/examples-distribution.jar poisson cdf \
        --mean 12.34 --min 0 --max 50 --increment 2

To create a sample of the PDF for the Gamma distribution with
different shape and
a fixed scale:

    java -jar target/examples-distribution.jar gamma pdf \
        --shape 0.5,1,2 --scale 1

As a quick verification I added a hidden command named 'check'. It was
added to verify the survival function is complementary to the CDF (so
I did not have to create plots of this function and visually check
it). This currently tests:

cdf + survival = 1.0
icdf(0) = lower bound
icdf(1) = upper bound
pdf = exp(logpdf)
x = icdf(cdf(x))


Any issues are output otherwise the command appears to do nothing.

I checked all the output for PDF and CDF against the figures on
wikipedia. This has revealed some issues.

The following distributions are OK:

Binomial
Gumbel
Hypergeometric
Laplace
Logistic
LogNormal
Normal
Pascal
T
Triangular
TruncatedNormal
UniformContinuous
UniformDiscrete
Zipf


The following have issues:

Beta
---
This is the only distribution parameterised using a reference to
Wolfram Mathworld rather than wikipedia. The parameterisation matches
that on wikipedia so could be updated.

Major:
This distribution can throw an exception:

p(x=0) throws an exception when alpha<1.
p(x=1) throws an exception when beta<1.
This is not the case in matlab which returns the natural limit as +inf, e.g.

betapdf(0, 0.5, 0.5) -> +inf

The CDF is OK.
x=0 when alpha<1: returns 0.
x=1 when beta<1: returns 1.
Thus you can evaluate the CDF but not the PDF which rejects the input point.

Cauchy
---
The median parameter is referred to as the location on Wikipedia and
in R. I suggest updating parameters from median and scale to location
and scale.

ChiSquared
---
The support should be (0, +inf) if k=1 otherwise [0, +inf). This is
not reflected in the lower bound return. This is minor.

The PDF is not correct at x=0.
pdf(x=0) returns 0 when k<2. This should return higher according to
matlab which uses the limit at x -> 0:

chi2pdf(0, 1) -> +inf
chi2pdf(0, 1.99) -> +inf
chi2pdf(0, 2) -> 0.5
chi2pdf(0, 2.01) -> 0
chi2pdf(0, 3) -> 0

Exp
---
The parameter is named the mean. Wikipedia uses either the scale or
1/scale = rate. R uses the rate parameterisation. Matlab uses the mean
(mu). Possible change mean to scale (or label it as a scale
parameter).

F
---
pdf(x=0) is always zero but it should go to infinity when df=1,df=2.
Using a very small x shows the direction:

java -jar target/examples-distribution.jar f pdf -x 0,1e-6
x df1=1.0,df2=1.0 df1=2.0,df2=1.0 df1=5.0,df2=2.0 df1=10.0,df2=1.0
df1=100.0,df2=100.0
0.00000 0.0 0.0 0.0 0.0 0.0
1.00000e-06 318.30956787422247 0.9999970000074999 2.470507804995697E-8
1.2304010764181617E-19 2.522031398015255E-264

Matlab:

fpdf(0,1,1) -> +inf
fpdf(0,1.9999,3) ->+inf
fpdf(0,2,3) -> 1
fpdf(0,2.0001,1) -> 0

Gamma
---

pdf(x=0) is zero when shape <= 1. This should be +inf

gampdf(0, [0.5,0.9999,1,1.001], 1)

ans =

   Inf   Inf     1     0

java -jar target/examples-distribution.jar gamma pdf -x 0,1e-10 -k
0.5,0.9999,1,1.001 -t 1 --format %.2f
x shape=0.5,scale=1.0 shape=0.9999,scale=1.0 shape=1.0,scale=1.0
shape=1.001,scale=1.0
0.00000 0.00 0.00 0.00 0.00
1.00000e-10 56418.95 1.00 1.00 0.98

Geometric
---

Does not handle the special case p=1:
- pmf and logpmf return NaN for x=1
- icdf(1.0) returns int max value (should be 0)
- upper bound returns int max value

Some occurrences where x != icdf(cdf(x)):

java -jar target/examples-distribution.jar geo check

I do not think this is a problem. It is the only distribution where
the round trip fails. It is due to rounding errors during direct
inversion of the computation for the CDF.

Levy
---

support is [mu, +inf)
The distribution outputs NaN for the PDF at mu. It should output p=0
at x=mu. This is due to division by zero.
This function is not in matlab or R. I checked the formula from
wikipedia and it is implemented correctly.


Nakagami
---

Uses omega scale parameter.
Wikipedia uses omega spread parameter.
Matlab uses omega scale parameter.
Check other implementations for scale vs spread.

Pareto
---

When alpha (shape) is large then pdf(x=1) == large.
When alpha is infinite this becomes nan (all other x are nan as well).

java -jar target/examples-distribution.jar pareto pdf --alpha
1e10,1e100,Infinity -x 0,1,1.0000000001,2,3,4 --xformat %s --format
%.3g
x xm=1.0,alpha=1.0E10 xm=1.0,alpha=1.0E100 xm=1.0,alpha=Infinity
0.0 0.00 0.00 0.00
1.0 1.00e+10 1.00e+100 NaN
1.0000000001 3.68e+09 0.00 NaN
2.0 0.00 0.00 NaN
3.0 0.00 0.00 NaN
4.0 0.00 0.00 NaN

Poisson
---

This is the only distribution using the RegularizedGamma function that
has the option to set the epsilon and iterations (default 1e-12,
10000000). The others use the default in the RegularizedGamma (1e-15,
Max int).

Perhaps this should be simplified by removing the epsilon option.
Switching to the default in the RegularizedGamma should be
investigated.

Weibull
---

The p(x=0) == 0 when shape <= 1. It should go to infinity if shape < 1
and be 1 when shape = 1.

java -jar target/examples-distribution.jar wbl pdf  -x 0,1e-6 --format %.3g
x alpha=0.5,beta=1.0 alpha=1.0,beta=1.0 alpha=1.5,beta=1.0 alpha=5.0,beta=1.0
0.00000 0.00 0.00 0.00 0.00
1.00000e-06 500 1.00 0.00150 5.00e-24

Matlab will use the natural limit as x -> 0.

wblpdf(0, 1, 0.5) -> inf
wblpdf(0, 1, 1) -> 1


General changes:
---

Should we change from the use of constructors to factory methods? This
would allow instance specialisations:

geometric : probability of success = 1
normal : mean=0, standard deviation = 1
pareto : shape=inf

Conclusion
---

Many of the issues identified are trivial. However some are due to
different behaviour of the distributions at x=0 depending on the
parameterisation. In certain cases the pdf should go to infinity or
return a finite non-zero number. These would have been found by
expanding the test suite to use more parameterisations. Further
investigation of the behaviour of different implementations at x=0 is
required. Matlab will evaluate the function. Other implementations may
return NaN or throw an exception.

I am considering an update of the test suite to use file resources.
Each distribution will have resources containing expected results for
the pdf, cdf, icdf, survival function, etc which are dynamically
discovered. Adding a new test would require putting a new file into
the resources. This is only an idea and will require more
investigation as not all functions may require testing for each
parameterisation. For example some tests currently target the high
precision computations of the survival function. However what is
apparent is that the current test suite is not simple to add tests for
more than one parameterisation of the distribution (see
TruncatedNormalDistributionTest). The Gamma and Beta distributions
also use more extensive tests of different parameterisations in a
different pattern. The Gamma distribution has file resources computed
using a Maxima script for the distribution. A consolidation of the
test suite to make testing variations simpler would be helpful.

I will work on fixing the identified issues. Please try the new
application and determine if any more issues are present.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to