I am after R package recommendations. I have a data frame with ~5 million rows and ~50 columns. (I could do what I want with a sample of the rows, but ideally i would use all the rows.)
(1) I want to recursively partition the rows of the data frame in a way that I manually specify. That is, I want to generate a tree structure such that each node of the tree represents a subset of the rows of the data frame and the child nodes of any parent node represent a partition of the rows represented by the parent node. This is the sort of thing that tree induction algorithms like CART and ID3 do, but I want to manually specify the tree structure rather than have some algorithm decide it for me. (2) I want the means for specifying the tree structure to be as simple as possible, because the users will be trying out different tree structures. (3) Each node (internal or terminal) of the tree represents a row subset of the root data frame. I want to be able to specify a function to be applied to each node that takes the node data frame as input and calculates a set of summary statistics. I will probably write this node summary function as a dplyr pipeline. I will want to be able to associate the summaries with the nodes so that I keep track of the summaries in terms of the tree structure. (4) I want to be able to print and plot the tree of summaries in a way that shows the summaries in the context of the tree structure. Inevitably, there will be fiddling with the formatting of the prints and plots, so I expect i will need user definable print/plot formatting functions that are applied to each node of the tree. What I am looking for is an R package that provides the best starting point for me to implement this. I am not a particularly good programmer, so getting a package that minimises what I have to write is important to me. So far, the most likely packages appear to be: - partykit <http://partykit.r-forge.r-project.org/partykit/> - data.tree <https://github.com/gluc/data.tree> I would appreciate any recommendations for R packages that would serve as a good base; any comments on the relative merits of the packages for my purposes; and any pointers to example code of people doing similar things. Thanks Ross [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.