Re: [R] strange behavior of lchoose in combinatorics problem

Stefan Evert Tue, 28 Jun 2016 23:45:12 -0700

Dear Gonçalo,

thanks for the additional information – I think I get now what you're trying to 
do.

> On 27 Jun 2016, at 06:35, Gonçalo Ferraz <gferra...@gmail.com> wrote:
> 
> probabilities in lpvec should be <=1, but it is not. The sum is something on 
> the order of 1.48e-13.”
> It is absolutely right that lpvec doesn’t contain probabilities, it contains 
> log-probabilities that are negative numbers. Also, clearly, the value of 
> 1.48e-13 is <=1!
> What I meant is that I wanted to add probabilities in the log-probability 
> space

Does that make any sense?  Adding log probabilities corresponds to taking the 
product of the probabilities for k = 0 .. J-1, and I can't think of a 
reasonable interpretation of this product. 

Or do you perform a log-space addition, so you actually compute the log of the 
sum of probabilities?

> and that I expected a negative value as a result. Instead, I am getting a 
> positive value.

If you're performing log-space addition, then that's no surprise at all.  The 
cumulative probability for k <= J that you're computing is equal to 1 up to 43 
decimal places – a difference much smaller than the rounding errors you're 
accumulating with your formula (which unnecessarily uses 5 terms instead of the 
normal 3 in lveech, making things even worse).

If you're directly adding the log probabilities, then I can't reproduce your 
result at all:

> sum(lpvec)
[1] -2835.86

In this case, I suspect that there's a mistake in the parts of the code you 
didn't show us.

> You are also right that if I wanted to test for a positive association I 
> should add the probabilities of k>=J to obtain a p-value. This is clear to 
> me, but I want a metric of the strength of association between animals that 
> takes higher values when the association is stronger. A final result in the 
> log-probability space is useful because I want to use the strength of 
> association to build networks of individuals.

What we usually do in computational linguistics is to take -log(p-value) as a 
measure of association because it has the desired properties: large values 
indicate stronger association, and it's measured in log-probability space 
(again, http://www.collocations.de/AM/section3.html explains this approach near 
the top of the page).

This measure – and Dunning's related log-likelihood ratio – are used quite 
successfully for co-occurrences of words in my field.

> Finally, your idea of using Fisher’s exact test is very appealing for the 
> simplicity and availability of a working function in R, but I am having 
> trouble interpreting the result, or the result is different from that of my 
> formula. Here’s a simple example:
> 
> Say N=2, N1=1, N2=1, and J=1. In this case, if the individuals are completely 
> independent of each other, I expect to see them together with a probability 
> of 0.5. That is the output from my function, in probability space. If I run 
> Fisher’s exact test on the corresponding contingency table, however, I get a 
> p-value of 1.

There are two possible outcomes of this experiment, J=0 and J=1, with equal 
probability 0.5.  Fisher's test computes the tail probability of the observed 
outcome + all equally or less probable outcomes, i.e. P(J=1) + P(J=0) = 1.

If you carry out a one-sided test, you'll get p=0.5.

> Why should this be, when the Fisher’s exact test is giving me the probability 
> of obtaining the observed arrangement under the null hypothesis of no 
> association between the animals.

A p-value isn't the probability of the observed arrangement, but the cumulative 
probability of the observed table + all more "extreme" tables that could have 
been observed (i.e. those which deviate further from the null hypothesis).

If you need your p-values in log-space, you can't use fisher.test() but need to 
compute the hypergeometric distribution directly with dhyper() / phyper().  
This is easy for a one-sided test (for positive association, i.e. 
atlernative="greater"):

        fisher.test(ct, alternative="greater")$p.value

is the same as

        phyper(J-1, N1, N-N1, N2, lower.tail=FALSE)

You can then instruct phyper() to return log probabilities and use

        -phyper(J-1, N1, N-N1, N2, lower.tail=FALSE, log.p=TRUE)

as a measure of association strength.

Best,
Stefan
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] strange behavior of lchoose in combinatorics problem

Reply via email to