Re: [statistics] Release 1.0

Gilles Sadowski Sun, 19 Sep 2021 17:33:28 -0700

Hi.

Le lun. 20 sept. 2021 à 00:59, Alex Herbert <alex.d.herb...@gmail.com> a écrit :
>
> I have created an initial version of the statistics distributions
> application. It can be used with:
>
> cd commons-statistics-examples/examples-distribution
> mvn package -Pexamples-distribution
> java -jar target/examples-distribution.jar
>
> There are 26 distributions each with a command. Each distribution
> command has sub-commands to evaluate functions: pdf, cdf, survival,
> icdf, logpdf
>
> I have created this entirely using picocli with sub-classes for each
> command and to hold default values. The default parameterisations and
> ranges for the distributions were copied from Wikipedia if available.
> The evaluated points and the distribution parameters can be changed:
>
> To create a sample of the PDF for the exponential distribution:
>
>     java -jar target/examples-distribution.jar exp pdf
>     java -jar target/examples-distribution.jar exp pdf --mean 3.45
>     java -jar target/examples-distribution.jar exp pdf \
>         --mean 3.45 --min 3 --max 5 --steps 50
>
> To create a sample of the CDF for the Poisson distribution:
>
>     java -jar target/examples-distribution.jar poisson cdf
>     java -jar target/examples-distribution.jar poisson cdf --mean 3.45
>     java -jar target/examples-distribution.jar poisson cdf \
>         --mean 12.34 --min 0 --max 50 --increment 2
>
> To create a sample of the PDF for the Gamma distribution with
> different shape and
> a fixed scale:
>
>     java -jar target/examples-distribution.jar gamma pdf \
>         --shape 0.5,1,2 --scale 1
>
> As a quick verification I added a hidden command named 'check'. It was
> added to verify the survival function is complementary to the CDF (so
> I did not have to create plots of this function and visually check
> it). This currently tests:
>
> cdf + survival = 1.0
> icdf(0) = lower bound
> icdf(1) = upper bound
> pdf = exp(logpdf)
> x = icdf(cdf(x))
>
>
> Any issues are output otherwise the command appears to do nothing.
>
> I checked all the output for PDF and CDF against the figures on
> wikipedia. This has revealed some issues.
>
> The following distributions are OK:
>
> Binomial
> Gumbel
> Hypergeometric
> Laplace
> Logistic
> LogNormal
> Normal
> Pascal
> T
> Triangular
> TruncatedNormal
> UniformContinuous
> UniformDiscrete
> Zipf
>
>
> The following have issues:


Great that your work has readily revealed them.

>
> Beta
> ---
> This is the only distribution parameterised using a reference to
> Wolfram Mathworld rather than wikipedia. The parameterisation matches
> that on wikipedia so could be updated.

+1 for using a single "primary" reference (a.o. for naming the
arguments).

>
> Major:
> This distribution can throw an exception:
>
> p(x=0) throws an exception when alpha<1.
> p(x=1) throws an exception when beta<1.
> This is not the case in matlab which returns the natural limit as +inf, e.g.
>
> betapdf(0, 0.5, 0.5) -> +inf
>
> The CDF is OK.
> x=0 when alpha<1: returns 0.
> x=1 when beta<1: returns 1.
> Thus you can evaluate the CDF but not the PDF which rejects the input point.
>
> Cauchy
> ---
> The median parameter is referred to as the location on Wikipedia and
> in R. I suggest updating parameters from median and scale to location
> and scale.
>
> ChiSquared
> ---
> The support should be (0, +inf) if k=1 otherwise [0, +inf). This is
> not reflected in the lower bound return. This is minor.
>
> The PDF is not correct at x=0.
> pdf(x=0) returns 0 when k<2. This should return higher according to
> matlab which uses the limit at x -> 0:
>
> chi2pdf(0, 1) -> +inf
> chi2pdf(0, 1.99) -> +inf
> chi2pdf(0, 2) -> 0.5
> chi2pdf(0, 2.01) -> 0
> chi2pdf(0, 3) -> 0
>
> Exp
> ---
> The parameter is named the mean. Wikipedia uses either the scale or
> 1/scale = rate. R uses the rate parameterisation. Matlab uses the mean
> (mu). Possible change mean to scale (or label it as a scale
> parameter).
>
> F
> ---
> pdf(x=0) is always zero but it should go to infinity when df=1,df=2.
> Using a very small x shows the direction:
>
> java -jar target/examples-distribution.jar f pdf -x 0,1e-6
> x df1=1.0,df2=1.0 df1=2.0,df2=1.0 df1=5.0,df2=2.0 df1=10.0,df2=1.0
> df1=100.0,df2=100.0
> 0.00000 0.0 0.0 0.0 0.0 0.0
> 1.00000e-06 318.30956787422247 0.9999970000074999 2.470507804995697E-8
> 1.2304010764181617E-19 2.522031398015255E-264
>
> Matlab:
>
> fpdf(0,1,1) -> +inf
> fpdf(0,1.9999,3) ->+inf
> fpdf(0,2,3) -> 1
> fpdf(0,2.0001,1) -> 0
>
> Gamma
> ---
>
> pdf(x=0) is zero when shape <= 1. This should be +inf
>
> gampdf(0, [0.5,0.9999,1,1.001], 1)
>
> ans =
>
>    Inf   Inf     1     0
>
> java -jar target/examples-distribution.jar gamma pdf -x 0,1e-10 -k
> 0.5,0.9999,1,1.001 -t 1 --format %.2f
> x shape=0.5,scale=1.0 shape=0.9999,scale=1.0 shape=1.0,scale=1.0
> shape=1.001,scale=1.0
> 0.00000 0.00 0.00 0.00 0.00
> 1.00000e-10 56418.95 1.00 1.00 0.98
>
> Geometric
> ---
>
> Does not handle the special case p=1:
> - pmf and logpmf return NaN for x=1
> - icdf(1.0) returns int max value (should be 0)
> - upper bound returns int max value
>
> Some occurrences where x != icdf(cdf(x)):
>
> java -jar target/examples-distribution.jar geo check
>
> I do not think this is a problem. It is the only distribution where
> the round trip fails. It is due to rounding errors during direct
> inversion of the computation for the CDF.
>
> Levy
> ---
>
> support is [mu, +inf)
> The distribution outputs NaN for the PDF at mu. It should output p=0
> at x=mu. This is due to division by zero.
> This function is not in matlab or R. I checked the formula from
> wikipedia and it is implemented correctly.
>
>
> Nakagami
> ---
>
> Uses omega scale parameter.
> Wikipedia uses omega spread parameter.
> Matlab uses omega scale parameter.
> Check other implementations for scale vs spread.
>
> Pareto
> ---
>
> When alpha (shape) is large then pdf(x=1) == large.
> When alpha is infinite this becomes nan (all other x are nan as well).
>
> java -jar target/examples-distribution.jar pareto pdf --alpha
> 1e10,1e100,Infinity -x 0,1,1.0000000001,2,3,4 --xformat %s --format
> %.3g
> x xm=1.0,alpha=1.0E10 xm=1.0,alpha=1.0E100 xm=1.0,alpha=Infinity
> 0.0 0.00 0.00 0.00
> 1.0 1.00e+10 1.00e+100 NaN
> 1.0000000001 3.68e+09 0.00 NaN
> 2.0 0.00 0.00 NaN
> 3.0 0.00 0.00 NaN
> 4.0 0.00 0.00 NaN
>
> Poisson
> ---
>
> This is the only distribution using the RegularizedGamma function that
> has the option to set the epsilon and iterations (default 1e-12,
> 10000000). The others use the default in the RegularizedGamma (1e-15,
> Max int).
>
> Perhaps this should be simplified by removing the epsilon option.
> Switching to the default in the RegularizedGamma should be
> investigated.
>
> Weibull
> ---
>
> The p(x=0) == 0 when shape <= 1. It should go to infinity if shape < 1
> and be 1 when shape = 1.
>
> java -jar target/examples-distribution.jar wbl pdf  -x 0,1e-6 --format %.3g
> x alpha=0.5,beta=1.0 alpha=1.0,beta=1.0 alpha=1.5,beta=1.0 alpha=5.0,beta=1.0
> 0.00000 0.00 0.00 0.00 0.00
> 1.00000e-06 500 1.00 0.00150 5.00e-24
>
> Matlab will use the natural limit as x -> 0.
>
> wblpdf(0, 1, 0.5) -> inf
> wblpdf(0, 1, 1) -> 1
>
>
> General changes:
> ---
>
> Should we change from the use of constructors to factory methods?

+1

> This
> would allow instance specialisations:
>
> geometric : probability of success = 1
> normal : mean=0, standard deviation = 1
> pareto : shape=inf
>
> Conclusion
> ---
>
> Many of the issues identified are trivial. However some are due to
> different behaviour of the distributions at x=0 depending on the
> parameterisation. In certain cases the pdf should go to infinity or
> return a finite non-zero number. These would have been found by
> expanding the test suite to use more parameterisations. Further
> investigation of the behaviour of different implementations at x=0 is
> required. Matlab will evaluate the function. Other implementations may
> return NaN or throw an exception.
>
> I am considering an update of the test suite to use file resources.

+1
I've used "@CsvFileSource" in [Math] "SimplexOptimizerTest".

> Each distribution will have resources containing expected results for
> the pdf, cdf, icdf, survival function, etc which are dynamically
> discovered. Adding a new test would require putting a new file into
> the resources. This is only an idea and will require more
> investigation as not all functions may require testing for each
> parameterisation. For example some tests currently target the high
> precision computations of the survival function. However what is
> apparent is that the current test suite is not simple to add tests for
> more than one parameterisation of the distribution (see
> TruncatedNormalDistributionTest). The Gamma and Beta distributions
> also use more extensive tests of different parameterisations in a
> different pattern. The Gamma distribution has file resources computed
> using a Maxima script for the distribution. A consolidation of the
> test suite to make testing variations simpler would be helpful.

Sure.  Still quite some work before a first release... :-}

>
> I will work on fixing the identified issues. Please try the new
> application and determine if any more issues are present.
>

Regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [statistics] Release 1.0

Reply via email to