Dear Gonçalo, thanks for the additional information – I think I get now what you're trying to do.
> On 27 Jun 2016, at 06:35, Gonçalo Ferraz <gferra...@gmail.com> wrote: > > probabilities in lpvec should be <=1, but it is not. The sum is something on > the order of 1.48e-13.” > It is absolutely right that lpvec doesn’t contain probabilities, it contains > log-probabilities that are negative numbers. Also, clearly, the value of > 1.48e-13 is <=1! > What I meant is that I wanted to add probabilities in the log-probability > space Does that make any sense? Adding log probabilities corresponds to taking the product of the probabilities for k = 0 .. J-1, and I can't think of a reasonable interpretation of this product. Or do you perform a log-space addition, so you actually compute the log of the sum of probabilities? > and that I expected a negative value as a result. Instead, I am getting a > positive value. If you're performing log-space addition, then that's no surprise at all. The cumulative probability for k <= J that you're computing is equal to 1 up to 43 decimal places – a difference much smaller than the rounding errors you're accumulating with your formula (which unnecessarily uses 5 terms instead of the normal 3 in lveech, making things even worse). If you're directly adding the log probabilities, then I can't reproduce your result at all: > sum(lpvec) [1] -2835.86 In this case, I suspect that there's a mistake in the parts of the code you didn't show us. > You are also right that if I wanted to test for a positive association I > should add the probabilities of k>=J to obtain a p-value. This is clear to > me, but I want a metric of the strength of association between animals that > takes higher values when the association is stronger. A final result in the > log-probability space is useful because I want to use the strength of > association to build networks of individuals. What we usually do in computational linguistics is to take -log(p-value) as a measure of association because it has the desired properties: large values indicate stronger association, and it's measured in log-probability space (again, http://www.collocations.de/AM/section3.html explains this approach near the top of the page). This measure – and Dunning's related log-likelihood ratio – are used quite successfully for co-occurrences of words in my field. > Finally, your idea of using Fisher’s exact test is very appealing for the > simplicity and availability of a working function in R, but I am having > trouble interpreting the result, or the result is different from that of my > formula. Here’s a simple example: > > Say N=2, N1=1, N2=1, and J=1. In this case, if the individuals are completely > independent of each other, I expect to see them together with a probability > of 0.5. That is the output from my function, in probability space. If I run > Fisher’s exact test on the corresponding contingency table, however, I get a > p-value of 1. There are two possible outcomes of this experiment, J=0 and J=1, with equal probability 0.5. Fisher's test computes the tail probability of the observed outcome + all equally or less probable outcomes, i.e. P(J=1) + P(J=0) = 1. If you carry out a one-sided test, you'll get p=0.5. > Why should this be, when the Fisher’s exact test is giving me the probability > of obtaining the observed arrangement under the null hypothesis of no > association between the animals. A p-value isn't the probability of the observed arrangement, but the cumulative probability of the observed table + all more "extreme" tables that could have been observed (i.e. those which deviate further from the null hypothesis). If you need your p-values in log-space, you can't use fisher.test() but need to compute the hypergeometric distribution directly with dhyper() / phyper(). This is easy for a one-sided test (for positive association, i.e. atlernative="greater"): fisher.test(ct, alternative="greater")$p.value is the same as phyper(J-1, N1, N-N1, N2, lower.tail=FALSE) You can then instruct phyper() to return log probabilities and use -phyper(J-1, N1, N-N1, N2, lower.tail=FALSE, log.p=TRUE) as a measure of association strength. Best, Stefan ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.