Thank you so much, Chuck! This is brilliant, I just tried some dichotomous variables, it was really fast.
Most categorical variables I am interested in are 3 levels, they are actually SNPs, I want to look at their interactions. My question is: after generating 0-1 codings, like 00, 01, 10, how should I use "crossprod()"? Should I just apply this function on these 2*n columns (originally I have n variables), and then operate on the generated cell counts? I am confused about this. Your input will be greatly appreciated. Andy Charles C. Berry wrote: > > On Wed, 9 Jan 2008, AndyZon wrote: > >> >> Hi, >> >> I have a huge number of categorical variables, say at least 10000, and I >> put >> them into a matrix, each column is one variable. The question is: how can >> I >> make all of the pairwise cross tabulation tables efficiently? The >> straightforward solution is to use for-loops by looping two indexes on >> the >> table() function, but it was just too slow. Is there a more efficient way >> to >> do that? Any guidance will be greatly appreciated. > > The totals are merely the crossproducts of a suitably constructed binary > (zero-one) matrix is used to encode the categories. See '?contr.treatment' > if you cannot grok 'suitably constructed'. > > If the categories are all dichotomies coded as 0:1, you can use > > res <- crossprod( dat ) > > to find the totals for the (1,1) cells > > If you need the full tables, you can get them from the marginal totals > using > > diag( res ) > > to get the number in each '1' margin and > > nrow(dat) > > to get the table total from which the numbers in each '0' margin by > subtracting the corresponding '1' margin. > > With dichotomous variables, dat has 10000 columns and you will only need > 10000^2 integers or about 0.75 Gigabytes to store the 'res'. And it takes > about 20 seconds to run 1000 rows on my MacBook. Of course, 'res' has a > redundant triangle > > This approach generalizes to any number of categories: > > To extend this to more than two categories, you will need to do for each > such column what model.matrix(~factor( dat[,i] ) ) does by default > ( using 'contr.treatment' ) - construct zero-one codes for all but one > (reference) category. > > Note that with 10000 trichotomies, you will have a result with > > 10000^2 * ( 3-1 )^2 > > integers needing about 3 Gigabytes, and so on. > > HTH, > > Chuck > > p.s. Why on Earth are you doing this???? > > >> >> Andy >> -- >> View this message in context: >> http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14723520.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive > Medicine > E mailto:[EMAIL PROTECTED] UC San Diego > http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- View this message in context: http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14744086.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.