Re: [Rd] Are r2dtable and C_r2dtable behaving correctly?
> Gustavo Fernandez Bayon > on Thu, 24 Aug 2017 16:42:36 +0200 writes: > Hello, > While doing some enrichment tests using chisq.test() with simulated > p-values, I noticed some strange behaviour. The computed p-value was > extremely small, so I decided to dig a little deeper and debug > chisq.test(). I noticed then that the simulated statistics returned by the > following call > tmp <- .Call(C_chisq_sim, sr, sc, B, E) > were all the same, very small numbers. This, at first, seemed strange to > me. So I decided to do some simulations myself, and started playing around > with the r2dtable() function. Problem is, using my row and column > marginals, r2dtable() always returns the same matrix. Let's provide a > minimal example: > rr <- c(209410, 276167) > cc <- c(25000, 460577) > ms <- r2dtable(3, rr, cc) > I have tested this code in two machines and it always returned the same > list of length three containing the same matrix three times. The repeated > matrix is the following: > [[1]] > [,1] [,2] > [1,] 10782 198628 > [2,] 14218 261949 > [[2]] > [,1] [,2] > [1,] 10782 198628 > [2,] 14218 261949 > [[3]] > [,1] [,2] > [1,] 10782 198628 > [2,] 14218 261949 Yes. You can also do unique(r2dtable(100, rr, cc)) and see that the result is constant. I'm pretty sure this is still due to some integer overflow, in spite of the fact that I had spent quite some time to fix such problem in Dec 2003, see the 14 years old bug PR#5701 https://bugs.r-project.org/bugzilla/show_bug.cgi?id=5701#c2 It has to be said that this is based on an algorithm published in 1981, specifically - from help(r2dtable) - Patefield, W. M. (1981) Algorithm AS159. An efficient method of generating r x c tables with given row and column totals. _Applied Statistics_ *30*, 91-97. For those with JSTOR access (typically via your University), available at http://www.jstor.org/stable/2346669 When I start reading it, indeed the algorithm seems start from the expected value of a cell entry and then "explore from there"... and I do wonder if there is not a flaw somewhere in the algorithm: I've now found that a bit more than a year ago, 'paljenczy' found on SO https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated that indeed the generated tables seem to be too much around the mean. Basically his example: https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated > set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); > A11 <- vapply(tabs, function(x) x[1, 1], numeric(1)) user system elapsed 0.218 0.025 0.244 > table(A11) 34 35 36 37 38 39 40 41 42 43 2 17 40129334883 2026 4522 8766 15786 44 45 46 47 48 49 50 51 52 53 26850 42142 59535 78851 96217 107686 112438 108237 95761 78737 54 55 56 57 58 59 60 61 62 63 59732 41474 26939 16006 8827 4633 2050865340116 64 65 66 67 38 13 7 1 > For a 2x2 table, there's really only one degree of freedom, hence the above characterizes the full distribution for that case. I would have expected to see all possible values in 0:100 instead of such a "normal like" distribution with carrier only in [34, 67]. There are newer publications and maybe algorithms. So maybe the algorithm is "flawed by design" for really large total number of observations, rather than wrong Seems interesting ... Martin Maechler __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Are r2dtable and C_r2dtable behaving correctly?
It is not about "really arge total number of observations", but: set.seed(4711);tabs <- r2dtable(1e6, c(2, 2), c(2, 2)); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1));table(A11) A11 0 1 2 166483 666853 14 There are three possible matrices, and these come out in proportions 1:4:1, the one with all cells filled with ones being most common. Cheers, Jari O. From: R-devel on behalf of Martin Maechler Sent: 25 August 2017 11:30 To: Gustavo Fernandez Bayon Cc: r-devel@r-project.org Subject: Re: [Rd] Are r2dtable and C_r2dtable behaving correctly? > Gustavo Fernandez Bayon > on Thu, 24 Aug 2017 16:42:36 +0200 writes: > Hello, > While doing some enrichment tests using chisq.test() with simulated > p-values, I noticed some strange behaviour. The computed p-value was > extremely small, so I decided to dig a little deeper and debug > chisq.test(). I noticed then that the simulated statistics returned by the > following call > tmp <- .Call(C_chisq_sim, sr, sc, B, E) > were all the same, very small numbers. This, at first, seemed strange to > me. So I decided to do some simulations myself, and started playing around > with the r2dtable() function. Problem is, using my row and column > marginals, r2dtable() always returns the same matrix. Let's provide a > minimal example: > rr <- c(209410, 276167) > cc <- c(25000, 460577) > ms <- r2dtable(3, rr, cc) > I have tested this code in two machines and it always returned the same > list of length three containing the same matrix three times. The repeated > matrix is the following: > [[1]] > [,1] [,2] > [1,] 10782 198628 > [2,] 14218 261949 > [[2]] > [,1] [,2] > [1,] 10782 198628 > [2,] 14218 261949 > [[3]] > [,1] [,2] > [1,] 10782 198628 > [2,] 14218 261949 Yes. You can also do unique(r2dtable(100, rr, cc)) and see that the result is constant. I'm pretty sure this is still due to some integer overflow, in spite of the fact that I had spent quite some time to fix such problem in Dec 2003, see the 14 years old bug PR#5701 https://bugs.r-project.org/bugzilla/show_bug.cgi?id=5701#c2 It has to be said that this is based on an algorithm published in 1981, specifically - from help(r2dtable) - Patefield, W. M. (1981) Algorithm AS159. An efficient method of generating r x c tables with given row and column totals. _Applied Statistics_ *30*, 91-97. For those with JSTOR access (typically via your University), available at http://www.jstor.org/stable/2346669 When I start reading it, indeed the algorithm seems start from the expected value of a cell entry and then "explore from there"... and I do wonder if there is not a flaw somewhere in the algorithm: I've now found that a bit more than a year ago, 'paljenczy' found on SO https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated that indeed the generated tables seem to be too much around the mean. Basically his example: https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated > set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); > A11 <- vapply(tabs, function(x) x[1, 1], numeric(1)) user system elapsed 0.218 0.025 0.244 > table(A11) 34 35 36 37 38 39 40 41 42 43 2 17 40129334883 2026 4522 8766 15786 44 45 46 47 48 49 50 51 52 53 26850 42142 59535 78851 96217 107686 112438 108237 95761 78737 54 55 56 57 58 59 60 61 62 63 59732 41474 26939 16006 8827 4633 2050865340116 64 65 66 67 38 13 7 1 > For a 2x2 table, there's really only one degree of freedom, hence the above characterizes the full distribution for that case. I would have expected to see all possible values in 0:100 instead of such a "normal like" distribution with carrier only in [34, 67]. There are newer publications and maybe algorithms. So maybe the algorithm is "flawed by design" for really large total number of observations, rather than wrong Seems interesting ... Martin Maechler __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Are r2dtable and C_r2dtable behaving correctly?
> On 25 Aug 2017, at 10:30 , Martin Maechler wrote: > [...] > https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated > > >> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); >> A11 <- vapply(tabs, function(x) x[1, 1], numeric(1)) > user system elapsed > 0.218 0.025 0.244 >> table(A11) > >34 35 36 37 38 39 40 41 42 43 > 2 17 40129334883 2026 4522 8766 15786 >44 45 46 47 48 49 50 51 52 53 > 26850 42142 59535 78851 96217 107686 112438 108237 95761 78737 >54 55 56 57 58 59 60 61 62 63 > 59732 41474 26939 16006 8827 4633 2050865340116 >64 65 66 67 >38 13 7 1 >> > > For a 2x2 table, there's really only one degree of freedom, > hence the above characterizes the full distribution for that > case. > > I would have expected to see all possible values in 0:100 > instead of such a "normal like" distribution with carrier only > in [34, 67]. Hmm, am I missing a point here? > round(dhyper(0:100,100,100,100)*1e6) [1] 0 0 0 0 0 0 0 0 0 0 [11] 0 0 0 0 0 0 0 0 0 0 [21] 0 0 0 0 0 0 0 0 0 0 [31] 0 0 0 1 4 13 43129355897 [41] 2087 4469 8819 16045 26927 41700 59614 78694 95943 108050 [51] 112416 108050 95943 78694 59614 41700 26927 16045 8819 4469 [61] 2087897355129 43 13 4 1 0 0 [71] 0 0 0 0 0 0 0 0 0 0 [81] 0 0 0 0 0 0 0 0 0 0 [91] 0 0 0 0 0 0 0 0 0 0 [101] 0 -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Are r2dtable and C_r2dtable behaving correctly?
> On 25 Aug 2017, at 11:23 , Jari Oksanen wrote: > > It is not about "really arge total number of observations", but: > > set.seed(4711);tabs <- r2dtable(1e6, c(2, 2), c(2, 2)); A11 <- vapply(tabs, > function(x) x[1, 1], numeric(1));table(A11) > > A11 > 0 1 2 > 166483 666853 14 > > There are three possible matrices, and these come out in proportions 1:4:1, > the one with all cells filled with ones being > most common. ... and > dhyper(0:2,2,2,2) [1] 0.167 0.667 0.167 > dhyper(0:2,2,2,2) *6 [1] 1 4 1 so that is exactly what you would expect. However, > dhyper(10782,209410, 276167, 25000) [1] 0.005230889 so you wouldn't expect 10782 to recur. It is close to the mean of the hypergeometric distribution, but > sim <- rhyper(1e6,209410, 276167, 25000) > mean(sim) [1] 10781.53 > sd(sim) [1] 76.14209 (and incidentally, rhyper is plenty fast enough that you don't really need r2dtable for the 2x2 case) -pd > > Cheers, Jari O. > > From: R-devel on behalf of Martin Maechler > > Sent: 25 August 2017 11:30 > To: Gustavo Fernandez Bayon > Cc: r-devel@r-project.org > Subject: Re: [Rd] Are r2dtable and C_r2dtable behaving correctly? > >> Gustavo Fernandez Bayon >>on Thu, 24 Aug 2017 16:42:36 +0200 writes: > >> Hello, >> While doing some enrichment tests using chisq.test() with simulated >> p-values, I noticed some strange behaviour. The computed p-value was >> extremely small, so I decided to dig a little deeper and debug >> chisq.test(). I noticed then that the simulated statistics returned by the >> following call > >> tmp <- .Call(C_chisq_sim, sr, sc, B, E) > >> were all the same, very small numbers. This, at first, seemed strange to >> me. So I decided to do some simulations myself, and started playing around >> with the r2dtable() function. Problem is, using my row and column >> marginals, r2dtable() always returns the same matrix. Let's provide a >> minimal example: > >> rr <- c(209410, 276167) >> cc <- c(25000, 460577) >> ms <- r2dtable(3, rr, cc) > >> I have tested this code in two machines and it always returned the same >> list of length three containing the same matrix three times. The repeated >> matrix is the following: > >> [[1]] >> [,1] [,2] >> [1,] 10782 198628 >> [2,] 14218 261949 > >> [[2]] >> [,1] [,2] >> [1,] 10782 198628 >> [2,] 14218 261949 > >> [[3]] >> [,1] [,2] >> [1,] 10782 198628 >> [2,] 14218 261949 > > Yes. You can also do > > unique(r2dtable(100, rr, cc)) > > and see that the result is constant. > > I'm pretty sure this is still due to some integer overflow, > > in spite of the fact that I had spent quite some time to fix > such problem in Dec 2003, see the 14 years old bug PR#5701 > https://bugs.r-project.org/bugzilla/show_bug.cgi?id=5701#c2 > > It has to be said that this is based on an algorithm published > in 1981, specifically - from help(r2dtable) - > > Patefield, W. M. (1981) Algorithm AS159. An efficient method of > generating r x c tables with given row and column totals. > _Applied Statistics_ *30*, 91-97. > > For those with JSTOR access (typically via your University), > available at http://www.jstor.org/stable/2346669 > > When I start reading it, indeed the algorithm seems start from the > expected value of a cell entry and then "explore from there"... > and I do wonder if there is not a flaw somewhere in the > algorithm: > > I've now found that a bit more than a year ago, 'paljenczy' found on SO > > https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated > that indeed the generated tables seem to be too much around the mean. > Basically his example: > > https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated > > >> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); >> A11 <- vapply(tabs, function(x) x[1, 1], numeric(1)) > user system elapsed > 0.218 0.025 0.244 >> table(A11) > >34 35 36 37 38 39 40 41 42 43 > 2 17 40129334883 2026 4522 8766 15786 >44 45 46 47 48 49 50 51 52 53 > 26850 42142 59535 78851 96217 107686 112438 108237 95761 78737 >54 55 56 57 58 59 60 61 62 63 > 59732 41474 26939 16006 8827 4633 2050865340116 >64 65 66 67 >38 13 7 1 >> > > For a 2x2 table, there's really only one degree of freedom, > hence the above characterizes the full distribution for that > case. > > I would have expected to see all possible values in 0:100 > instead of such a "normal like" distribution with carrier only > in [34, 67]. > > There are newer publications and maybe algorithms. > So maybe the algorithm is "flawed by design" for really large > total number of observations, rather than wrong > Seems int
Re: [Rd] Are r2dtable and C_r2dtable behaving correctly?
> Peter Dalgaard > on Fri, 25 Aug 2017 11:43:40 +0200 writes: >> On 25 Aug 2017, at 10:30 , Martin Maechler wrote: >> > [...] >> https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated >> >> >>> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1)) >> user system elapsed >> 0.218 0.025 0.244 >>> table(A11) >> >> 34 35 36 37 38 39 40 41 42 43 >> 2 17 40129334883 2026 4522 8766 15786 >> 44 45 46 47 48 49 50 51 52 53 >> 26850 42142 59535 78851 96217 107686 112438 108237 95761 78737 >> 54 55 56 57 58 59 60 61 62 63 >> 59732 41474 26939 16006 8827 4633 2050865340116 >> 64 65 66 67 >> 38 13 7 1 >>> >> >> For a 2x2 table, there's really only one degree of freedom, >> hence the above characterizes the full distribution for that >> case. >> >> I would have expected to see all possible values in 0:100 >> instead of such a "normal like" distribution with carrier only >> in [34, 67]. > Hmm, am I missing a point here? >> round(dhyper(0:100,100,100,100)*1e6) > [1] 0 0 0 0 0 0 0 0 0 0 > [11] 0 0 0 0 0 0 0 0 0 0 > [21] 0 0 0 0 0 0 0 0 0 0 > [31] 0 0 0 1 4 13 43129355897 > [41] 2087 4469 8819 16045 26927 41700 59614 78694 95943 108050 > [51] 112416 108050 95943 78694 59614 41700 26927 16045 8819 4469 > [61] 2087897355129 43 13 4 1 0 0 > [71] 0 0 0 0 0 0 0 0 0 0 > [81] 0 0 0 0 0 0 0 0 0 0 > [91] 0 0 0 0 0 0 0 0 0 0 > [101] 0 No, you ain't, I was. :-( Martin > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Are r2dtable and C_r2dtable behaving correctly?
> On 25 Aug 2017, at 12:04 , Peter Dalgaard wrote: > >> There are three possible matrices, and these come out in proportions 1:4:1, >> the one with all cells filled with ones being >> most common. > > ... and > >> dhyper(0:2,2,2,2) > [1] 0.167 0.667 0.167 >> dhyper(0:2,2,2,2) *6 > [1] 1 4 1 > > so that is exactly what you would expect. And, incidentally, this is the "statistician's socks" puzzle from introductory probability: A statistician has 2 white socks and 2 black socks. Late for work a dark November morning, he puts on two socks at random. What is the probability that he goes to work with different colored socks? -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] A problem with headless R on WIndows
I need a version of R that has no dependency on GDI32.dll. Ideally, this would be a version of R that had no dependency on rgraphapp.dll. Is there a way to configure the build system to build a version of R like this? I've looked through the ./configure --help for seemingly appropriate flags, and haven't found any. OTOH, rgraphapp is in the extras folder, so it seems like R should be able to work without it. Again, is this possible? Thank you. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel