Hello,

Yesterday wasn't one of my days.
The main problem I'm seeing is that the KS statistic is meant for continuous data and you have counts data assumed to follow a Poisson distribution. This might explain the nonsense results you are getting from ks.test.

Have you considered a chi-squared GOF test?

Hope this helps,

Rui Barradas

Às 20:51 de 05/09/19, Boo G. escreveu:
Hello and thanks for your patience.

As far as I understand, the paper of Marsiglia and colleagues refers to CDF 
samples (i.e. from a hypothetical distribution — e.g. a Poisson), while I have 
an ECDF sample (i.e. (pseudo-)observed data — e.g. rpois(1000, 500). In my 
study, I am actually comparing the statistical distribution of people per 
hectares in a region (~50,000 observations) with samples from that distribution.

Agree that we cannot consider at p-values for low sample size. Still, somewhere 
above 100, I expect to see p-values between 0 and 0.05 (rejecting the fact that 
the sample comes from the reference distribution). Ideally, I would try the 
procedure suggested by Marsiglia to compute p-values but this is far beyond my 
coding skills.




On 5 Sep 2019, at 21:21, Rui Barradas <ruipbarra...@sapo.pt> wrote:

Hello,

Inline.

Às 20:09 de 05/09/19, Boo G. escreveu:
Hello again.
I have tied this before but I see two problems:
1) According to the documentation I could read (including the ks.test code), 
the ks statistic would be max(abs(x - y)) and if you plot this for very low 
sample sizes you can actually see that this make sense. The results of 
ks.test(x, y) yields very different values.

The problem is that the distribution of Dn is very difficult to compute. From 
the reference [1] in the help page ?ks.test:

        Kolmogorov's goodness-of-fit measure, Dn , for a sample CDF has 
consistently been set aside for methods such as the D+n or D-n of Smirnov, 
primarily, it seems, because of the difficulty of computing the distribution of Dn 
. As far as we know, no easy way to compute that distribution has ever been 
provided in the 70+ years since Kolmogorov's fundamental paper. We provide one 
here, a C procedure that provides Pr(Dn < d) with 13-15 digit accuracy for n 
ranging from 2 to at least 16000.


That is why I used ks.test and its Dn and p-values. Note that n >= 2, size = 1 
is not covered (p-value == 1).

Also, the p-values distribution seem to become closer to a uniform with 
increasing sizes. Try

hist(d_stat[801:1000, 3])


[1]https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jstatsoft.org%2Farticle%2Fview%2Fv008i18&amp;data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7Cfebe94a441914aa58a9908d732363a52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=JBHkkXn6G4oLQZCV7HoqBLO4a3sMixTa16kOVFwXPlY%3D&amp;reserved=0


Hope this helps,

Rui Barradas

2) Also in this case the p-values don’t make much sense, according to my 
previous interpretation.
Again, I could be wrong in my interpretation.
On 5 Sep 2019, at 20:46, Rui Barradas <ruipbarra...@sapo.pt 
<mailto:ruipbarra...@sapo.pt>> wrote:

Hello,

I'm sorry, but apparently I missed the point of your problem.
Please do not take my previous answer seriously.

But you can use ks.test, just in a different way than what I wrote previously.

Corrected code:


#simulation
for (i in 1:1000) {
  #sample from the reference distribution
  m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
  m_2 <-m_2[order(m_2$d_1),]
  d_2 <- m_2$d_1
  p_2 <- m_2$p_1

  #weighted ecdf for the reference distribution and the sample
  f_d_1 <- ewcdf(d_1, normalise=F)
  f_d_2 <- ewcdf(d_2, 1/p_2, normalise=F, adjust=1/length(d_2))

  #kolmogorov-smirnov distance
  x <- f_d_1(d_2)
  y <- f_d_2(d_2)
  ht <- ks.test(x, y)
  d_stat[i, 2] <- ht$statistic
  d_stat[i, 3] <- ht$p.value
}


Hope this helps,

Rui Barradas

Às 19:29 de 05/09/19, Rui Barradas escreveu:
Hello,
I don't have the algorithms at hand but the KS statistic calculation is more 
complicated than your max/abs difference.
Anyway, why not use ks.test? it's not that difficult:
set.seed(1234)
#reference distribution
d_1 <- sort(rpois(1000, 500))
p_1 <- d_1/sum(d_1)
m_1 <- data.frame(d_1, p_1)
#data frame to store the values of the simulation
d_stat <- data.frame(1:1000, NA, NA)
names(d_stat) <- c("sample_size", "ks_distance", "p_value")
#simulation
for (i in 1:1000) {
#sample from the reference distribution
m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
d_2 <- m_2$d_1
ht <- ks.test(d_1, d_2)
#kolmogorov-smirnov distance
d_stat[i, 2] <- ht$statistic
d_stat[i, 3] <- ht$p.value
}
hist(d_stat[, 2])
hist(d_stat[, 3])
Note that d_2 is not sorted, but the results are equal in the sense of function 
identical(), meaning they are *exactly* the same. Why shouldn't they?
Hope this helps,
Rui Barradas
Às 17:06 de 05/09/19, Boo G. escreveu:
Hello,

I am trying to perform a Kolmogorov–Smirnov test to assess the difference 
between a distribution and samples drawn proportionally to size of different 
sizes. I managed to compute the Kolmogorov–Smirnov distance but I am lost with 
the p-value. I have looked into the ks.test function unsuccessfully. Can anyone 
help me with computing p-values for a two-tailed test?

Below a simplified version of my code.

Thanks in advance.
Gianluca


library(spatstat)

#reference distribution
d_1 <- sort(rpois(1000, 500))
p_1 <- d_1/sum(d_1)
m_1 <- data.frame(d_1, p_1)

#data frame to store the values of the siumation
d_stat <- data.frame(1:1000, NA, NA)
names(d_stat) <- c("sample_size", "ks_distance", "p_value")

#simulation
for (i in 1:1000) {
#sample from the reference distribution
m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
m_2 <-m_2[order(m_2$d_1),]
d_2 <- m_2$d_1
p_2 <- m_2$p_1

#weighted ecdf for the reference distribution and the sample
f_d_1 <- ewcdf(d_1, normalise=F)
f_d_2 <- ewcdf(d_2, 1/p_2, normalise=F, adjust=1/length(d_2))

#kolmogorov-smirnov distance
d_stat[i,2] <- max(abs(f_d_1(d_2) - f_d_2(d_2)))
}


     [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org <mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7Cfebe94a441914aa58a9908d732363a52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=mZZ05M0Qrjy8EelRpDiKpozWtLbF2kSGG%2BjzlAYyzf4%3D&amp;reserved=0
PLEASE do read the posting 
guidehttps://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&amp;data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7C4b4a253a1edc4960297f08d732316627%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=YKZNBeGme6V9QGyV2%2F150H6rZnLy9bX7xly%2BQEf6O14%3D&amp;reserved=0
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org <mailto:R-help@r-project.org>mailing list -- To 
UNSUBSCRIBE and more, see
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7Cfebe94a441914aa58a9908d732363a52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=mZZ05M0Qrjy8EelRpDiKpozWtLbF2kSGG%2BjzlAYyzf4%3D&amp;reserved=0
PLEASE do read the posting 
guidehttps://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&amp;data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7C4b4a253a1edc4960297f08d732316627%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=YKZNBeGme6V9QGyV2%2F150H6rZnLy9bX7xly%2BQEf6O14%3D&amp;reserved=0
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to