On Thu, 10 Jan 2008, AndyZon wrote: > > Thank you so much, Chuck! > > This is brilliant, I just tried some dichotomous variables, it was really > fast.
Yes, and if you are on a multicore system with multithreaded linear algebra, crossprod() will distribute the job across the cores making the elapsed time shorter (by almost half on my Core 2 Duo MacBook as long as I have nothing else gobbling up CPU cycles)! > > Most categorical variables I am interested in are 3 levels, they are > actually SNPs, I want to look at their interactions. My question is: after > generating 0-1 codings, like 00, 01, 10, how should I use "crossprod()"? > Should I just apply this function on these 2*n columns (originally I have n > variables), and then operate on the generated cell counts? If I followed you here, and you have ONLY those three categories, then yes. Try a test case with perhaps 3 SNPs and a few subjects. Table the results the old fashioned way via table() or xtabs() or even by hand. Then look at what crossprod( test.case } gives you. --- If '11' shows up you'll have to use a 'contr.treatment' style approach. ( run 'example( contrasts )' and look at what is going on). Guess what these give: contrasts( factor( c( "00","01","10" ) ) ) contrasts( factor( c( "00","01","10","11" ) ) ) then run them if you have trouble seeing why '11' changes the picture. --- BTW, what I said (below) suggests that crossprod() returns integer values, but its storage.mode is actually "double". HTH, Chuck I am confused > about this. > > Your input will be greatly appreciated. > > Andy > > > > Charles C. Berry wrote: >> >> On Wed, 9 Jan 2008, AndyZon wrote: >> >>> >>> Hi, >>> >>> I have a huge number of categorical variables, say at least 10000, and I >>> put >>> them into a matrix, each column is one variable. The question is: how can >>> I >>> make all of the pairwise cross tabulation tables efficiently? The >>> straightforward solution is to use for-loops by looping two indexes on >>> the >>> table() function, but it was just too slow. Is there a more efficient way >>> to >>> do that? Any guidance will be greatly appreciated. >> >> The totals are merely the crossproducts of a suitably constructed binary >> (zero-one) matrix is used to encode the categories. See '?contr.treatment' >> if you cannot grok 'suitably constructed'. >> >> If the categories are all dichotomies coded as 0:1, you can use >> >> res <- crossprod( dat ) >> >> to find the totals for the (1,1) cells >> >> If you need the full tables, you can get them from the marginal totals >> using >> >> diag( res ) >> >> to get the number in each '1' margin and >> >> nrow(dat) >> >> to get the table total from which the numbers in each '0' margin by >> subtracting the corresponding '1' margin. >> >> With dichotomous variables, dat has 10000 columns and you will only need >> 10000^2 integers or about 0.75 Gigabytes to store the 'res'. And it takes >> about 20 seconds to run 1000 rows on my MacBook. Of course, 'res' has a >> redundant triangle >> >> This approach generalizes to any number of categories: >> >> To extend this to more than two categories, you will need to do for each >> such column what model.matrix(~factor( dat[,i] ) ) does by default >> ( using 'contr.treatment' ) - construct zero-one codes for all but one >> (reference) category. >> >> Note that with 10000 trichotomies, you will have a result with >> >> 10000^2 * ( 3-1 )^2 >> >> integers needing about 3 Gigabytes, and so on. >> >> HTH, >> >> Chuck >> >> p.s. Why on Earth are you doing this???? >> >> >>> >>> Andy >>> -- >>> View this message in context: >>> http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14723520.html >>> Sent from the R help mailing list archive at Nabble.com. >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> Charles C. Berry (858) 534-2098 >> Dept of Family/Preventive >> Medicine >> E mailto:[EMAIL PROTECTED] UC San Diego >> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > > -- > View this message in context: > http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14744086.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:[EMAIL PROTECTED] UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.