Jim, could you provide a code snippit to illustrate what you mean? Hadley, good point, I did not know that.
Mark Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 399-1219 Skype No Voicemail please On Tue, Dec 8, 2009 at 11:00 PM, jim holtman <jholt...@gmail.com> wrote: > Also instead of 'splitting' the data frame, I split the indices and then > use those to access the information in the original dataframe. > > > On Tue, Dec 8, 2009 at 9:54 PM, Mark Kimpel <mwkim...@gmail.com> wrote: > >> Hadley, Just as you were apparently writing I had the same thought and did >> exactly what you suggested, converting all columns except the one that I >> want split to character. Executed almost instantaneously without problem. >> Thanks! Mark >> >> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >> Indiana University School of Medicine >> >> 15032 Hunter Court, Westfield, IN 46074 >> >> (317) 490-5129 Work, & Mobile & VoiceMail >> (317) 399-1219 Skype No Voicemail please >> >> >> On Tue, Dec 8, 2009 at 10:48 PM, hadley wickham <h.wick...@gmail.com> >> wrote: >> >> > Hi Mark, >> > >> > Why are you using factors? I think for this case you might find >> > characters are faster and more space efficient. >> > >> > Alternatively, you can have a look at the plyr package which uses some >> > tricks to keep memory usage down. >> > >> > Hadley >> > >> > On Tue, Dec 8, 2009 at 9:46 PM, Mark Kimpel <mwkim...@gmail.com> wrote: >> > > Charles, I suspect your are correct regarding copying of the >> attributes. >> > > First off, selectSubAct.df is my "real" data, which turns out to be of >> > the >> > > same dim() as myDataFrame below, but each column is make up of >> strings, >> > not >> > > simple letters, and there are many levels in each column, which I did >> not >> > > properly duplicate in my first example. I have ammended that below and >> > with >> > > the split the new object size is now not 10X the size of the original, >> > but >> > > 100X. My "real" data is even more complex than this, so I suspect that >> is >> > > where the problem lies. I need to search for a better solution to my >> > problem >> > > than split, for which I will start a separate thread if I can't figure >> > > something out. >> > > >> > > Thanks for pointing me in the right direction, >> > > >> > > Mark >> > > >> > > myDataFrame <- data.frame(matrix(paste("The rain in Spain", >> > > as.character(1:1400), sep = "."), ncol = 7, nrow = 399000)) >> > > mySplitVar <- factor(paste("Rainy days and Mondays", >> > as.character(1:1400), >> > > sep = ".")) >> > > myDataFrame <- cbind(myDataFrame, mySplitVar) >> > > object.size(myDataFrame) >> > > ## 12860880 bytes # ~ 13MB >> > > myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar) >> > > object.size(myDataFrame.split) >> > > ## 1,274,929,792 bytes ~ 1.2GB >> > > object.size(selectSubAct.df) >> > > ## 52,348,272 bytes # ~ 52MB >> > > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >> > > Indiana University School of Medicine >> > > >> > > 15032 Hunter Court, Westfield, IN 46074 >> > > >> > > (317) 490-5129 Work, & Mobile & VoiceMail >> > > (317) 399-1219 Skype No Voicemail please >> > > >> > > >> > > On Tue, Dec 8, 2009 at 10:22 PM, Charles C. Berry < >> cbe...@tajo.ucsd.edu >> > >wrote: >> > > >> > >> On Tue, 8 Dec 2009, Mark Kimpel wrote: >> > >> >> > >> I'm having trouble using split on a very large data-set with ~1400 >> > levels >> > >>> of >> > >>> the factor to be split. Unfortunately, I can't reproduce it with the >> > >>> simple >> > >>> self-contained example below. As you can see, splitting the >> artificial >> > >>> dataframe of size ~13MB results in a split dataframe of ~ 144MB, >> with >> > an >> > >>> increase memory allocation of ~10 fold for the split object. If >> split >> > >>> scales >> > >>> linearly, then my actual 52MB dataframe should be easily handled by >> my >> > >>> 12GB >> > >>> of RAM, but it is not. instead, when I try to split selectSubAct.df >> on >> > one >> > >>> of its factors with 1473 levels, my memory is slowly gobbled up >> (plus 3 >> > GB >> > >>> of swap) until I cancel the operation. >> > >>> >> > >>> Any ideas on what might be happening? Thanks, Mark >> > >>> >> > >> >> > >> Each element of myDataFrame.split contains a copy of the attributes >> of >> > the >> > >> parent data.frame. >> > >> >> > >> And probably it does scale linearly. But the scaling factor depends >> on >> > the >> > >> size of the attributes that get copied, I guess. >> > >> >> > >> >> > >> >> > >> >> > >>> myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000)) >> > >>> mySplitVar <- factor(as.character(1:1400)) >> > >>> myDataFrame <- cbind(myDataFrame, mySplitVar) >> > >>> object.size(myDataFrame) >> > >>> ## 12860880 bytes # ~ 13MB >> > >>> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar) >> > >>> object.size(myDataFrame.split) >> > >>> ## 144524992 bytes # ~ 144MB >> > >>> >> > >> >> > >> Note: >> > >> >> > >> only.attr <- lapply(myDataFrame.split,function(x) >> sapply(x,attributes)) >> > >>> >> > >>> >> > >> (object.size(myDataFrame.split)-object.size(myDataFrame))/object.size(only.attr) >> > >>> >> > >> 1.03726179240978 bytes >> > >> >> > >> >> > >>> >> > >> >> > >> object.size(selectSubAct.df) >> > >>> ## 52,348,272 bytes # ~ 52MB >> > >>> >> > >> >> > >> What was this?? >> > >> >> > >> >> > >> Chuck >> > >> >> > >> >> > >>> sessionInfo() >> > >>>> >> > >>> R version 2.10.0 Patched (2009-10-27 r50222) >> > >>> x86_64-unknown-linux-gnu >> > >>> >> > >>> locale: >> > >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> > >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> > >>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >> > >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> > >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >> > >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> > >>> >> > >>> attached base packages: >> > >>> [1] stats graphics grDevices datasets utils methods base >> > >>> >> > >>> loaded via a namespace (and not attached): >> > >>> [1] tools_2.10.0 >> > >>> >> > >>> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >> > >>> Indiana University School of Medicine >> > >>> >> > >>> 15032 Hunter Court, Westfield, IN 46074 >> > >>> >> > >>> (317) 490-5129 Work, & Mobile & VoiceMail >> > >>> (317) 399-1219 Skype No Voicemail please >> > >>> >> > >>> [[alternative HTML version deleted]] >> > >>> >> > >>> >> > >>> ______________________________________________ >> > >>> R-help@r-project.org mailing list >> > >>> https://stat.ethz.ch/mailman/listinfo/r-help >> > >>> PLEASE do read the posting guide >> > >>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >> > >>> and provide commented, minimal, self-contained, reproducible code. >> > >>> >> > >>> >> > >> Charles C. Berry (858) 534-2098 >> > >> Dept of Family/Preventive >> > >> Medicine >> > >> E mailto:cbe...@tajo.ucsd.edu UC San Diego >> > >> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego >> > 92093-0901 >> > >> >> > >> >> > >> >> > > >> > > [[alternative HTML version deleted]] >> > > >> > > ______________________________________________ >> > > R-help@r-project.org mailing list >> > > https://stat.ethz.ch/mailman/listinfo/r-help >> > > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >> > > and provide commented, minimal, self-contained, reproducible code. >> > > >> > >> > >> > >> > -- >> > http://had.co.nz/ >> > >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.