I have created an initial version of the statistics distributions application. It can be used with:
cd commons-statistics-examples/examples-distribution mvn package -Pexamples-distribution java -jar target/examples-distribution.jar There are 26 distributions each with a command. Each distribution command has sub-commands to evaluate functions: pdf, cdf, survival, icdf, logpdf I have created this entirely using picocli with sub-classes for each command and to hold default values. The default parameterisations and ranges for the distributions were copied from Wikipedia if available. The evaluated points and the distribution parameters can be changed: To create a sample of the PDF for the exponential distribution: java -jar target/examples-distribution.jar exp pdf java -jar target/examples-distribution.jar exp pdf --mean 3.45 java -jar target/examples-distribution.jar exp pdf \ --mean 3.45 --min 3 --max 5 --steps 50 To create a sample of the CDF for the Poisson distribution: java -jar target/examples-distribution.jar poisson cdf java -jar target/examples-distribution.jar poisson cdf --mean 3.45 java -jar target/examples-distribution.jar poisson cdf \ --mean 12.34 --min 0 --max 50 --increment 2 To create a sample of the PDF for the Gamma distribution with different shape and a fixed scale: java -jar target/examples-distribution.jar gamma pdf \ --shape 0.5,1,2 --scale 1 As a quick verification I added a hidden command named 'check'. It was added to verify the survival function is complementary to the CDF (so I did not have to create plots of this function and visually check it). This currently tests: cdf + survival = 1.0 icdf(0) = lower bound icdf(1) = upper bound pdf = exp(logpdf) x = icdf(cdf(x)) Any issues are output otherwise the command appears to do nothing. I checked all the output for PDF and CDF against the figures on wikipedia. This has revealed some issues. The following distributions are OK: Binomial Gumbel Hypergeometric Laplace Logistic LogNormal Normal Pascal T Triangular TruncatedNormal UniformContinuous UniformDiscrete Zipf The following have issues: Beta --- This is the only distribution parameterised using a reference to Wolfram Mathworld rather than wikipedia. The parameterisation matches that on wikipedia so could be updated. Major: This distribution can throw an exception: p(x=0) throws an exception when alpha<1. p(x=1) throws an exception when beta<1. This is not the case in matlab which returns the natural limit as +inf, e.g. betapdf(0, 0.5, 0.5) -> +inf The CDF is OK. x=0 when alpha<1: returns 0. x=1 when beta<1: returns 1. Thus you can evaluate the CDF but not the PDF which rejects the input point. Cauchy --- The median parameter is referred to as the location on Wikipedia and in R. I suggest updating parameters from median and scale to location and scale. ChiSquared --- The support should be (0, +inf) if k=1 otherwise [0, +inf). This is not reflected in the lower bound return. This is minor. The PDF is not correct at x=0. pdf(x=0) returns 0 when k<2. This should return higher according to matlab which uses the limit at x -> 0: chi2pdf(0, 1) -> +inf chi2pdf(0, 1.99) -> +inf chi2pdf(0, 2) -> 0.5 chi2pdf(0, 2.01) -> 0 chi2pdf(0, 3) -> 0 Exp --- The parameter is named the mean. Wikipedia uses either the scale or 1/scale = rate. R uses the rate parameterisation. Matlab uses the mean (mu). Possible change mean to scale (or label it as a scale parameter). F --- pdf(x=0) is always zero but it should go to infinity when df=1,df=2. Using a very small x shows the direction: java -jar target/examples-distribution.jar f pdf -x 0,1e-6 x df1=1.0,df2=1.0 df1=2.0,df2=1.0 df1=5.0,df2=2.0 df1=10.0,df2=1.0 df1=100.0,df2=100.0 0.00000 0.0 0.0 0.0 0.0 0.0 1.00000e-06 318.30956787422247 0.9999970000074999 2.470507804995697E-8 1.2304010764181617E-19 2.522031398015255E-264 Matlab: fpdf(0,1,1) -> +inf fpdf(0,1.9999,3) ->+inf fpdf(0,2,3) -> 1 fpdf(0,2.0001,1) -> 0 Gamma --- pdf(x=0) is zero when shape <= 1. This should be +inf gampdf(0, [0.5,0.9999,1,1.001], 1) ans = Inf Inf 1 0 java -jar target/examples-distribution.jar gamma pdf -x 0,1e-10 -k 0.5,0.9999,1,1.001 -t 1 --format %.2f x shape=0.5,scale=1.0 shape=0.9999,scale=1.0 shape=1.0,scale=1.0 shape=1.001,scale=1.0 0.00000 0.00 0.00 0.00 0.00 1.00000e-10 56418.95 1.00 1.00 0.98 Geometric --- Does not handle the special case p=1: - pmf and logpmf return NaN for x=1 - icdf(1.0) returns int max value (should be 0) - upper bound returns int max value Some occurrences where x != icdf(cdf(x)): java -jar target/examples-distribution.jar geo check I do not think this is a problem. It is the only distribution where the round trip fails. It is due to rounding errors during direct inversion of the computation for the CDF. Levy --- support is [mu, +inf) The distribution outputs NaN for the PDF at mu. It should output p=0 at x=mu. This is due to division by zero. This function is not in matlab or R. I checked the formula from wikipedia and it is implemented correctly. Nakagami --- Uses omega scale parameter. Wikipedia uses omega spread parameter. Matlab uses omega scale parameter. Check other implementations for scale vs spread. Pareto --- When alpha (shape) is large then pdf(x=1) == large. When alpha is infinite this becomes nan (all other x are nan as well). java -jar target/examples-distribution.jar pareto pdf --alpha 1e10,1e100,Infinity -x 0,1,1.0000000001,2,3,4 --xformat %s --format %.3g x xm=1.0,alpha=1.0E10 xm=1.0,alpha=1.0E100 xm=1.0,alpha=Infinity 0.0 0.00 0.00 0.00 1.0 1.00e+10 1.00e+100 NaN 1.0000000001 3.68e+09 0.00 NaN 2.0 0.00 0.00 NaN 3.0 0.00 0.00 NaN 4.0 0.00 0.00 NaN Poisson --- This is the only distribution using the RegularizedGamma function that has the option to set the epsilon and iterations (default 1e-12, 10000000). The others use the default in the RegularizedGamma (1e-15, Max int). Perhaps this should be simplified by removing the epsilon option. Switching to the default in the RegularizedGamma should be investigated. Weibull --- The p(x=0) == 0 when shape <= 1. It should go to infinity if shape < 1 and be 1 when shape = 1. java -jar target/examples-distribution.jar wbl pdf -x 0,1e-6 --format %.3g x alpha=0.5,beta=1.0 alpha=1.0,beta=1.0 alpha=1.5,beta=1.0 alpha=5.0,beta=1.0 0.00000 0.00 0.00 0.00 0.00 1.00000e-06 500 1.00 0.00150 5.00e-24 Matlab will use the natural limit as x -> 0. wblpdf(0, 1, 0.5) -> inf wblpdf(0, 1, 1) -> 1 General changes: --- Should we change from the use of constructors to factory methods? This would allow instance specialisations: geometric : probability of success = 1 normal : mean=0, standard deviation = 1 pareto : shape=inf Conclusion --- Many of the issues identified are trivial. However some are due to different behaviour of the distributions at x=0 depending on the parameterisation. In certain cases the pdf should go to infinity or return a finite non-zero number. These would have been found by expanding the test suite to use more parameterisations. Further investigation of the behaviour of different implementations at x=0 is required. Matlab will evaluate the function. Other implementations may return NaN or throw an exception. I am considering an update of the test suite to use file resources. Each distribution will have resources containing expected results for the pdf, cdf, icdf, survival function, etc which are dynamically discovered. Adding a new test would require putting a new file into the resources. This is only an idea and will require more investigation as not all functions may require testing for each parameterisation. For example some tests currently target the high precision computations of the survival function. However what is apparent is that the current test suite is not simple to add tests for more than one parameterisation of the distribution (see TruncatedNormalDistributionTest). The Gamma and Beta distributions also use more extensive tests of different parameterisations in a different pattern. The Gamma distribution has file resources computed using a Maxima script for the distribution. A consolidation of the test suite to make testing variations simpler would be helpful. I will work on fixing the identified issues. Please try the new application and determine if any more issues are present. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org