Charles, I suspect your are correct regarding copying of the attributes. First off, selectSubAct.df is my "real" data, which turns out to be of the same dim() as myDataFrame below, but each column is make up of strings, not simple letters, and there are many levels in each column, which I did not properly duplicate in my first example. I have ammended that below and with the split the new object size is now not 10X the size of the original, but 100X. My "real" data is even more complex than this, so I suspect that is where the problem lies. I need to search for a better solution to my problem than split, for which I will start a separate thread if I can't figure something out.
Thanks for pointing me in the right direction, Mark myDataFrame <- data.frame(matrix(paste("The rain in Spain", as.character(1:1400), sep = "."), ncol = 7, nrow = 399000)) mySplitVar <- factor(paste("Rainy days and Mondays", as.character(1:1400), sep = ".")) myDataFrame <- cbind(myDataFrame, mySplitVar) object.size(myDataFrame) ## 12860880 bytes # ~ 13MB myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar) object.size(myDataFrame.split) ## 1,274,929,792 bytes ~ 1.2GB object.size(selectSubAct.df) ## 52,348,272 bytes # ~ 52MB Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 399-1219 Skype No Voicemail please On Tue, Dec 8, 2009 at 10:22 PM, Charles C. Berry <cbe...@tajo.ucsd.edu>wrote: > On Tue, 8 Dec 2009, Mark Kimpel wrote: > > I'm having trouble using split on a very large data-set with ~1400 levels >> of >> the factor to be split. Unfortunately, I can't reproduce it with the >> simple >> self-contained example below. As you can see, splitting the artificial >> dataframe of size ~13MB results in a split dataframe of ~ 144MB, with an >> increase memory allocation of ~10 fold for the split object. If split >> scales >> linearly, then my actual 52MB dataframe should be easily handled by my >> 12GB >> of RAM, but it is not. instead, when I try to split selectSubAct.df on one >> of its factors with 1473 levels, my memory is slowly gobbled up (plus 3 GB >> of swap) until I cancel the operation. >> >> Any ideas on what might be happening? Thanks, Mark >> > > Each element of myDataFrame.split contains a copy of the attributes of the > parent data.frame. > > And probably it does scale linearly. But the scaling factor depends on the > size of the attributes that get copied, I guess. > > > > >> myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000)) >> mySplitVar <- factor(as.character(1:1400)) >> myDataFrame <- cbind(myDataFrame, mySplitVar) >> object.size(myDataFrame) >> ## 12860880 bytes # ~ 13MB >> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar) >> object.size(myDataFrame.split) >> ## 144524992 bytes # ~ 144MB >> > > Note: > > only.attr <- lapply(myDataFrame.split,function(x) sapply(x,attributes)) >> >> (object.size(myDataFrame.split)-object.size(myDataFrame))/object.size(only.attr) >> > 1.03726179240978 bytes > > >> > > object.size(selectSubAct.df) >> ## 52,348,272 bytes # ~ 52MB >> > > What was this?? > > > Chuck > > >> sessionInfo() >>> >> R version 2.10.0 Patched (2009-10-27 r50222) >> x86_64-unknown-linux-gnu >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices datasets utils methods base >> >> loaded via a namespace (and not attached): >> [1] tools_2.10.0 >> >> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >> Indiana University School of Medicine >> >> 15032 Hunter Court, Westfield, IN 46074 >> >> (317) 490-5129 Work, & Mobile & VoiceMail >> (317) 399-1219 Skype No Voicemail please >> >> [[alternative HTML version deleted]] >> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive > Medicine > E mailto:cbe...@tajo.ucsd.edu UC San Diego > http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 > > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.