Hi Mark, Why are you using factors? I think for this case you might find characters are faster and more space efficient.
Alternatively, you can have a look at the plyr package which uses some tricks to keep memory usage down. Hadley On Tue, Dec 8, 2009 at 9:46 PM, Mark Kimpel <mwkim...@gmail.com> wrote: > Charles, I suspect your are correct regarding copying of the attributes. > First off, selectSubAct.df is my "real" data, which turns out to be of the > same dim() as myDataFrame below, but each column is make up of strings, not > simple letters, and there are many levels in each column, which I did not > properly duplicate in my first example. I have ammended that below and with > the split the new object size is now not 10X the size of the original, but > 100X. My "real" data is even more complex than this, so I suspect that is > where the problem lies. I need to search for a better solution to my problem > than split, for which I will start a separate thread if I can't figure > something out. > > Thanks for pointing me in the right direction, > > Mark > > myDataFrame <- data.frame(matrix(paste("The rain in Spain", > as.character(1:1400), sep = "."), ncol = 7, nrow = 399000)) > mySplitVar <- factor(paste("Rainy days and Mondays", as.character(1:1400), > sep = ".")) > myDataFrame <- cbind(myDataFrame, mySplitVar) > object.size(myDataFrame) > ## 12860880 bytes # ~ 13MB > myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar) > object.size(myDataFrame.split) > ## 1,274,929,792 bytes ~ 1.2GB > object.size(selectSubAct.df) > ## 52,348,272 bytes # ~ 52MB > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > Indiana University School of Medicine > > 15032 Hunter Court, Westfield, IN 46074 > > (317) 490-5129 Work, & Mobile & VoiceMail > (317) 399-1219 Skype No Voicemail please > > > On Tue, Dec 8, 2009 at 10:22 PM, Charles C. Berry <cbe...@tajo.ucsd.edu>wrote: > >> On Tue, 8 Dec 2009, Mark Kimpel wrote: >> >> I'm having trouble using split on a very large data-set with ~1400 levels >>> of >>> the factor to be split. Unfortunately, I can't reproduce it with the >>> simple >>> self-contained example below. As you can see, splitting the artificial >>> dataframe of size ~13MB results in a split dataframe of ~ 144MB, with an >>> increase memory allocation of ~10 fold for the split object. If split >>> scales >>> linearly, then my actual 52MB dataframe should be easily handled by my >>> 12GB >>> of RAM, but it is not. instead, when I try to split selectSubAct.df on one >>> of its factors with 1473 levels, my memory is slowly gobbled up (plus 3 GB >>> of swap) until I cancel the operation. >>> >>> Any ideas on what might be happening? Thanks, Mark >>> >> >> Each element of myDataFrame.split contains a copy of the attributes of the >> parent data.frame. >> >> And probably it does scale linearly. But the scaling factor depends on the >> size of the attributes that get copied, I guess. >> >> >> >> >>> myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000)) >>> mySplitVar <- factor(as.character(1:1400)) >>> myDataFrame <- cbind(myDataFrame, mySplitVar) >>> object.size(myDataFrame) >>> ## 12860880 bytes # ~ 13MB >>> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar) >>> object.size(myDataFrame.split) >>> ## 144524992 bytes # ~ 144MB >>> >> >> Note: >> >> only.attr <- lapply(myDataFrame.split,function(x) sapply(x,attributes)) >>> >>> (object.size(myDataFrame.split)-object.size(myDataFrame))/object.size(only.attr) >>> >> 1.03726179240978 bytes >> >> >>> >> >> object.size(selectSubAct.df) >>> ## 52,348,272 bytes # ~ 52MB >>> >> >> What was this?? >> >> >> Chuck >> >> >>> sessionInfo() >>>> >>> R version 2.10.0 Patched (2009-10-27 r50222) >>> x86_64-unknown-linux-gnu >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices datasets utils methods base >>> >>> loaded via a namespace (and not attached): >>> [1] tools_2.10.0 >>> >>> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >>> Indiana University School of Medicine >>> >>> 15032 Hunter Court, Westfield, IN 46074 >>> >>> (317) 490-5129 Work, & Mobile & VoiceMail >>> (317) 399-1219 Skype No Voicemail please >>> >>> [[alternative HTML version deleted]] >>> >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> Charles C. Berry (858) 534-2098 >> Dept of Family/Preventive >> Medicine >> E mailto:cbe...@tajo.ucsd.edu UC San Diego >> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 >> >> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- http://had.co.nz/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.