On 8/10/13 8:59 AM, Ajo Fod wrote: > This depends on data size. If it fits in memory, a single pass through the > sorted array to find the biggest differences would suffice. > > If the data doesn't fit, you probably need a StorelessQuantile estimator > like QuantileBin1D from the colt libraries. Then pick a resolution and do > the single pass search.
Thanks, Ajo. I have no problem computing the D_n,m statistics. My problem is in computing the exact p-values for the test. For that, I need to compute the exact distribution of D_n,m. Brute-forcing requires that you examine every element of n + m choose n. R seems to use a clever approach, but there is no documentation in the R sources on how the algorithm works. Moreover, my first attempts at Monte Carlo simulation don't give the same results. Most likely, I have not set the simulation up correctly. Any better ideas or references on how to compute the exact distribution would be appreciated. Phil > > Cheers, > -Ajo > > > On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <phil.ste...@gmail.com> wrote: > >> I am working on MATH-437 (turning K-S distribution into a proper K-S >> test impl) and have to decide how to implement 2-sample tests. >> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a >> K-S distribution, so for large samples just using the cdf we already >> have is appropriate. For small samples (actually for any size >> sample), the test statistic distribution is discrete and can be >> computed exactly. A brute force way to do that is to enumerate all >> of the n-m partitions of {0, ..., n+m-1} and compute all the >> possible D_n,m values. R seems to use a more clever way to do >> this. Does anyone have a reference for an efficient way to compute >> the exact distribution, or background on where R got their >> implementation? >> >> Absent a "clever" approach, I see three alternatives and would >> appreciate some feedback: >> >> 0) just use the asymptotic distribution even for small samples >> 1) fully enumerate all n-m partitions and compute the D_n,m as above >> 1) use a monte carlo approach - instead of full enumeration of the >> D_n,m, randomly generate a large number of splits and compute the >> p-value for observed D_n,m by computing the number of random n-m >> splits generate a D value less than what is observed. >> >> Thanks in advance for any feedback / pointers. >> >> Phil >> >> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >> For additional commands, e-mail: dev-h...@commons.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org