Note : this post has been motivated more by the "hierarchical data" subject than the aside joke of Douglas Bates, but might be of interest to its respondents.
Le vendredi 05 février 2010 à 21:56 +0100, Peter Dalgaard a écrit : > Peter Ehlers wrote: > > I vote to 'fortunize' Doug Bates on > > > > Hierarchical data sets: which software to use? > > > > "The widespread use of spreadsheets or SPSS data sets or SAS data sets > > which encourage the "single table with a gargantuan number of columns, > > most of which are missing data in most cases" approach to organization > > of longitudinal data is regrettable." > > > > http://n4.nabble.com/Hierarchical-data-sets-which-software-to-use-td1458477.html#a1470430 > > > > > > > > Hmm, well, it's not like "long format" data frames (which I actually > think are more common in connection with SAS's PROC MIXED) are much > better. Those tend to replicate base data unnecessarily - "as if rats > change sex with millisecond resolution". [ Note to Achim Zeilis : the "rats changing sex with millisecond resolution" quote is well worth a nomination to "fortune" fame ; it seems it is not one already... ] > The correct data structure > would be a relational database with multiple levels of tables, but, to > my knowledge, no statistical software, including R, is prepared to deal > with data in that form. Well, I can think of two exceptions : - BUGS, in its various incarnations (WinBUGS, OpenBUGS, JAGS), does not require its data to come from the same source. For example, while programming a hierarchical model (a. k. a. mixed-effect model), individual level variables may come from one source and various group level variables may come from other sources. Quite handy : no previous merge() required. Now, writing (and debugging !) such models in BUGS is another story... - SAS has had this concept of "data view" for a long time, its most useful incarnation being a "data view" of an SQL view. Again, this avoids the need to actually merge the datasets (which, AFAICR, is a serious piece of pain in the @$$ in SAS (maybe that's the *real* etymology of the name ?)). This problem has bugged me for a while. I think that the concept of a "data view" is right (after all, that's one of the core concepts of SQL for a reason...), but that implementing it *cleanly* in R is probably hard work. Using a DBMS for maintaining tables and views and querying them "just at the right time" does help, but the ability of using these DBMS data without importing them in R is, AFAIK, currently lacking. One upon a time, a very old version of RPgSQL (a Bioconductor package), aimed to such a representation : it created objects inheriting from data.frame to represent Postgres-based data, allowing to use these data "transparently". This package dropped into oblivon when his creator and sole maintainer became unable to maintain it further. As far as I understand it, the DBI specification *might* allow the creation of such objects, but I am not aware of any driver actually implementing that. In fact, there are two elements of solution to this problem : a) creation of (abstract) objects representing data collections as data frames, with the same properties, but not requesting the creation of an actual data frame. As far as my (very poor) object-oriented knowledge goes, these objects should be, in C++/Python parlance, inherit from data.frame. b) creation of objects implementing various realizations of the objects created in a) : DBMS querying, actual data.frame querying (here I'm thinking of sqldf, which does this on the reverse direction, allowing querying R data frames to be queried in SQL. Quite handy...), etc ... I tried my hand once at building such a representation (for DBMS-deposited data), with partial success (read-only was OK, read-write was seriously buggy). But my S3 object-oriented code stinks, my Python is pytiful, and, as a public health measure, I won't even try to qualify my C++... So I leave implementation to better programmers as an exercise (a term project, or even a master's thesis subject is probably closer to truth...). A third, much larger, (implementation) element, is lacking in this picture : the algorithms used on these data. SAS is notoriously good (in some simple cases, such as ordinary regression) at handling datasets larger than available memory because the algorithms have been written with punched cards (maybe even paper tape) in mind : *one* *sequential* read of the data was the only *practical* way to go back in those days. So all the matrices and vectors necessary to the computation (notionally, X'X and X'Y) were built in memory in *one* step. Such an organization is probably impossible with most "modern" algorithms : see Douglas Bates' description of the lmer() algorithms for a nice, big counter-example, or consider MCMC... But coming closer to such an organization *seems* possible : see for example biglm. So I think that data views are a a worthy but not-so-easy possible goal aimed at various data structure problems (including hierarchical data), but not *the* solution to data-representation problem in R. Any thoughts ? Emmanuel Charpentier ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.