Hi. Le lun. 20 sept. 2021 à 00:59, Alex Herbert <alex.d.herb...@gmail.com> a écrit : > > I have created an initial version of the statistics distributions > application. It can be used with: > > cd commons-statistics-examples/examples-distribution > mvn package -Pexamples-distribution > java -jar target/examples-distribution.jar > > There are 26 distributions each with a command. Each distribution > command has sub-commands to evaluate functions: pdf, cdf, survival, > icdf, logpdf > > I have created this entirely using picocli with sub-classes for each > command and to hold default values. The default parameterisations and > ranges for the distributions were copied from Wikipedia if available. > The evaluated points and the distribution parameters can be changed: > > To create a sample of the PDF for the exponential distribution: > > java -jar target/examples-distribution.jar exp pdf > java -jar target/examples-distribution.jar exp pdf --mean 3.45 > java -jar target/examples-distribution.jar exp pdf \ > --mean 3.45 --min 3 --max 5 --steps 50 > > To create a sample of the CDF for the Poisson distribution: > > java -jar target/examples-distribution.jar poisson cdf > java -jar target/examples-distribution.jar poisson cdf --mean 3.45 > java -jar target/examples-distribution.jar poisson cdf \ > --mean 12.34 --min 0 --max 50 --increment 2 > > To create a sample of the PDF for the Gamma distribution with > different shape and > a fixed scale: > > java -jar target/examples-distribution.jar gamma pdf \ > --shape 0.5,1,2 --scale 1 > > As a quick verification I added a hidden command named 'check'. It was > added to verify the survival function is complementary to the CDF (so > I did not have to create plots of this function and visually check > it). This currently tests: > > cdf + survival = 1.0 > icdf(0) = lower bound > icdf(1) = upper bound > pdf = exp(logpdf) > x = icdf(cdf(x)) > > > Any issues are output otherwise the command appears to do nothing. > > I checked all the output for PDF and CDF against the figures on > wikipedia. This has revealed some issues. > > The following distributions are OK: > > Binomial > Gumbel > Hypergeometric > Laplace > Logistic > LogNormal > Normal > Pascal > T > Triangular > TruncatedNormal > UniformContinuous > UniformDiscrete > Zipf > > > The following have issues:
Great that your work has readily revealed them. > > Beta > --- > This is the only distribution parameterised using a reference to > Wolfram Mathworld rather than wikipedia. The parameterisation matches > that on wikipedia so could be updated. +1 for using a single "primary" reference (a.o. for naming the arguments). > > Major: > This distribution can throw an exception: > > p(x=0) throws an exception when alpha<1. > p(x=1) throws an exception when beta<1. > This is not the case in matlab which returns the natural limit as +inf, e.g. > > betapdf(0, 0.5, 0.5) -> +inf > > The CDF is OK. > x=0 when alpha<1: returns 0. > x=1 when beta<1: returns 1. > Thus you can evaluate the CDF but not the PDF which rejects the input point. > > Cauchy > --- > The median parameter is referred to as the location on Wikipedia and > in R. I suggest updating parameters from median and scale to location > and scale. > > ChiSquared > --- > The support should be (0, +inf) if k=1 otherwise [0, +inf). This is > not reflected in the lower bound return. This is minor. > > The PDF is not correct at x=0. > pdf(x=0) returns 0 when k<2. This should return higher according to > matlab which uses the limit at x -> 0: > > chi2pdf(0, 1) -> +inf > chi2pdf(0, 1.99) -> +inf > chi2pdf(0, 2) -> 0.5 > chi2pdf(0, 2.01) -> 0 > chi2pdf(0, 3) -> 0 > > Exp > --- > The parameter is named the mean. Wikipedia uses either the scale or > 1/scale = rate. R uses the rate parameterisation. Matlab uses the mean > (mu). Possible change mean to scale (or label it as a scale > parameter). > > F > --- > pdf(x=0) is always zero but it should go to infinity when df=1,df=2. > Using a very small x shows the direction: > > java -jar target/examples-distribution.jar f pdf -x 0,1e-6 > x df1=1.0,df2=1.0 df1=2.0,df2=1.0 df1=5.0,df2=2.0 df1=10.0,df2=1.0 > df1=100.0,df2=100.0 > 0.00000 0.0 0.0 0.0 0.0 0.0 > 1.00000e-06 318.30956787422247 0.9999970000074999 2.470507804995697E-8 > 1.2304010764181617E-19 2.522031398015255E-264 > > Matlab: > > fpdf(0,1,1) -> +inf > fpdf(0,1.9999,3) ->+inf > fpdf(0,2,3) -> 1 > fpdf(0,2.0001,1) -> 0 > > Gamma > --- > > pdf(x=0) is zero when shape <= 1. This should be +inf > > gampdf(0, [0.5,0.9999,1,1.001], 1) > > ans = > > Inf Inf 1 0 > > java -jar target/examples-distribution.jar gamma pdf -x 0,1e-10 -k > 0.5,0.9999,1,1.001 -t 1 --format %.2f > x shape=0.5,scale=1.0 shape=0.9999,scale=1.0 shape=1.0,scale=1.0 > shape=1.001,scale=1.0 > 0.00000 0.00 0.00 0.00 0.00 > 1.00000e-10 56418.95 1.00 1.00 0.98 > > Geometric > --- > > Does not handle the special case p=1: > - pmf and logpmf return NaN for x=1 > - icdf(1.0) returns int max value (should be 0) > - upper bound returns int max value > > Some occurrences where x != icdf(cdf(x)): > > java -jar target/examples-distribution.jar geo check > > I do not think this is a problem. It is the only distribution where > the round trip fails. It is due to rounding errors during direct > inversion of the computation for the CDF. > > Levy > --- > > support is [mu, +inf) > The distribution outputs NaN for the PDF at mu. It should output p=0 > at x=mu. This is due to division by zero. > This function is not in matlab or R. I checked the formula from > wikipedia and it is implemented correctly. > > > Nakagami > --- > > Uses omega scale parameter. > Wikipedia uses omega spread parameter. > Matlab uses omega scale parameter. > Check other implementations for scale vs spread. > > Pareto > --- > > When alpha (shape) is large then pdf(x=1) == large. > When alpha is infinite this becomes nan (all other x are nan as well). > > java -jar target/examples-distribution.jar pareto pdf --alpha > 1e10,1e100,Infinity -x 0,1,1.0000000001,2,3,4 --xformat %s --format > %.3g > x xm=1.0,alpha=1.0E10 xm=1.0,alpha=1.0E100 xm=1.0,alpha=Infinity > 0.0 0.00 0.00 0.00 > 1.0 1.00e+10 1.00e+100 NaN > 1.0000000001 3.68e+09 0.00 NaN > 2.0 0.00 0.00 NaN > 3.0 0.00 0.00 NaN > 4.0 0.00 0.00 NaN > > Poisson > --- > > This is the only distribution using the RegularizedGamma function that > has the option to set the epsilon and iterations (default 1e-12, > 10000000). The others use the default in the RegularizedGamma (1e-15, > Max int). > > Perhaps this should be simplified by removing the epsilon option. > Switching to the default in the RegularizedGamma should be > investigated. > > Weibull > --- > > The p(x=0) == 0 when shape <= 1. It should go to infinity if shape < 1 > and be 1 when shape = 1. > > java -jar target/examples-distribution.jar wbl pdf -x 0,1e-6 --format %.3g > x alpha=0.5,beta=1.0 alpha=1.0,beta=1.0 alpha=1.5,beta=1.0 alpha=5.0,beta=1.0 > 0.00000 0.00 0.00 0.00 0.00 > 1.00000e-06 500 1.00 0.00150 5.00e-24 > > Matlab will use the natural limit as x -> 0. > > wblpdf(0, 1, 0.5) -> inf > wblpdf(0, 1, 1) -> 1 > > > General changes: > --- > > Should we change from the use of constructors to factory methods? +1 > This > would allow instance specialisations: > > geometric : probability of success = 1 > normal : mean=0, standard deviation = 1 > pareto : shape=inf > > Conclusion > --- > > Many of the issues identified are trivial. However some are due to > different behaviour of the distributions at x=0 depending on the > parameterisation. In certain cases the pdf should go to infinity or > return a finite non-zero number. These would have been found by > expanding the test suite to use more parameterisations. Further > investigation of the behaviour of different implementations at x=0 is > required. Matlab will evaluate the function. Other implementations may > return NaN or throw an exception. > > I am considering an update of the test suite to use file resources. +1 I've used "@CsvFileSource" in [Math] "SimplexOptimizerTest". > Each distribution will have resources containing expected results for > the pdf, cdf, icdf, survival function, etc which are dynamically > discovered. Adding a new test would require putting a new file into > the resources. This is only an idea and will require more > investigation as not all functions may require testing for each > parameterisation. For example some tests currently target the high > precision computations of the survival function. However what is > apparent is that the current test suite is not simple to add tests for > more than one parameterisation of the distribution (see > TruncatedNormalDistributionTest). The Gamma and Beta distributions > also use more extensive tests of different parameterisations in a > different pattern. The Gamma distribution has file resources computed > using a Maxima script for the distribution. A consolidation of the > test suite to make testing variations simpler would be helpful. Sure. Still quite some work before a first release... :-} > > I will work on fixing the identified issues. Please try the new > application and determine if any more issues are present. > Regards, Gilles --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org