ear folks – I have a question, though it is more of a logic- or a good practices-question than a programming question per se. I am working with data from the American Community Survey summary file. It is mainly categorical count data. Currently I am working with about 40 tables covering about 35 variables, mainly in two-way tables, with some 3-way and a handful of four-way tables. I am going to be doing a lot of analysis on these tables, and hope to make them available in zipped format to other R users. Right now I am keeping this data in single-state data frames, but I will probably have to shift over to a database if I add many more variables.
Here is my problem: of my 35 variables, five of them are different versions of age. Different tables cover different age ranges, and have different levels of disaggregation for the age ranges they cover. Currently I just have a factor for each with the cut-points in the labels. But I feel uncomfortable with this. It seems to throw away a lot of information. There is a “natural” mapping from the different age ranges to one another, at least within universes (e.g. individuals vs. heads of household), and my current approach does not encode that mapping in any way that R can notice (unless I write special functions that read the labels) One of the first things I am doing with this data is using all the cross-tabs to produce some basic estimates of higher-dimensional tabulations – some 10-way tables covering age, race, sex, age, rent/own, income, etc. that are consistent with all the lower-dimensional margins, using a multi-dimensional analogue of the RAS balancing (biproportional matrix balancing) algorithm often used to update Leontief input-output tables. Right now the approach I am using is to sum the age variables into four categories the let me use four of my five age variables, and throw the fifth (which has inconsistent breakpoints and is used in only one table) away. But this seems wasteful to me – not only of one table, but of a lot of information on finer age sub-structure which is shared by two or more tables. I am guessing that this is a fairly common problem in dealing with large data sets of count objects. Is there a “standard” approach to is, or a set of commonly used approaches, that anyone could suggest or point me to? I’d be happy with either coding suggestions or pointers to the methodology literature if there is one. Any help or suggestions would be greatly appreciated. Thanks! andrewH -- View this message in context: http://r.789695.n4.nabble.com/Using-factor-variables-with-overlapping-categories-tp4651054.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.