I have gone to have a look at the paper. (http:// www.plantphysiol.org/cgi/content/full/148/3/1189 ) to try to work out what they're actually doing, in the hope that I might be able to figure out their procedure so we can give a more complete answer to the question. Unfortunately, I'm more confused now than before.
The numbers the OP refers to are in table I: http:// www.plantphysiol.org/cgi/content/full/148/3/1189/TBL1 The first four rows are (abbreviating the column titles): Chromosome length(Mb) Obs_No._Genes Exp_No._Genes Distribution_Test LG I 32.16 36 30 0.137 LG II 23.44 25 22 0.235 LG III 17.45 13 17 0.235 LG IV 15.08 10 14 0.159 They are (somehow) working out the probability above or below the observed count, depending on wther or not the observed count happened to be above or below the expected! This means their p-values are on average about half what they should be. However, they don't seem to be using Poisson probabilities (as they claim) - I can't reproduce the results with a Poisson distribution. Nor can I reproduce them with a normal approximation, nor with a normal approximation with continuity correction. It's not at all clear how they've got their numbers, but they definitely don't seem to have done what they claim to have done (see below). The note under Table 1 says (using some sort of "pseudo" LaTeX to indicate the greek letters) "for distribution test $P(m(i,j) < \lambda(i,j)) <= \alpha$ or $P(m(i,j) > \lambda(i,j)) < \alpha$, a Poisson distribution was used to determine the significance of the F-box gene distribution in the Populus genome" -- that is, if they did what this says, they're working out probabilities of MORE extreme, not "at least as extreme", a second error. [Note that (it's stated later) the expected numbers are estimated. The don't seem to take this into account (though the effect may be small). They're also doing a whole bunch of tests in this paper - 19 "p-values" in table 1, another 273 in table 3, 18 more in table 4, 28 more in table 5, another 24 in table 7 ... and so on.] Under the "Materials and Methods" section, subsection "Localization of F-Box Genes in the Genome", they claim that the "probabilities $P(m(i) < \lambda(i))$ and $P(m(i) > \lambda(i))$ were evaluated under the cumulative Posson distribution at \alpha <= 0.05 and \alpha <= 0.01 significance levels." Which is weird, because the actual p-values they seem to regard as worth mentioning in the paper are the ones at or below 0.0001 [Is this sort of thing pretty typical for papers in biology? Don't referees even do a basic check of one or two numbers? Apparently I put too much effort into refereeing.] Can anyone see what they're actually doing? -- View this message in context: http://www.nabble.com/How-to-do-poisson-distribution-test-like-this--tp24696413p24711764.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.