Here is an example:
> # create test data > N <- 1000000 > x <- data.frame(a=sample(LETTERS, N, TRUE), b=sample(letters, N, TRUE), + c=as.numeric(1:N), d=runif(N)) > system.time({ + x.df <- split(x, x$a) # split + print(sapply(x.df, function(a) sum(a$c))) + }) A B C D E F G H 19132375146 19261600080 19290064552 19355472666 19143448231 18973627622 19278423676 19362576931 I J K L M N O P 19405443596 19295695044 19052377988 19236047192 19143226220 19197703946 19297192525 19129252399 Q R S T U V W X 19272964991 19315856972 19355660155 19303178409 19242322477 19081573240 19309444512 19077003863 Y Z 19259313705 19228653862 user system elapsed 1.27 0.02 1.28 > # now use indices > system.time({ + x.indx <- split(seq(nrow(x)), x$a) # create list of indices + print(sapply(x.indx, function(a) sum(x$c[a]))) + }) A B C D E F G H 19132375146 19261600080 19290064552 19355472666 19143448231 18973627622 19278423676 19362576931 I J K L M N O P 19405443596 19295695044 19052377988 19236047192 19143226220 19197703946 19297192525 19129252399 Q R S T U V W X 19272964991 19315856972 19355660155 19303178409 19242322477 19081573240 19309444512 19077003863 Y Z 19259313705 19228653862 user system elapsed 0.23 0.00 0.23 > > > > > On Tue, Dec 8, 2009 at 10:26 PM, Mark Kimpel <mwkim...@gmail.com> wrote: > Jim, could you provide a code snippit to illustrate what you mean? > > Hadley, good point, I did not know that. > > Mark > > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > Indiana University School of Medicine > > 15032 Hunter Court, Westfield, IN 46074 > > (317) 490-5129 Work, & Mobile & VoiceMail > (317) 399-1219 Skype No Voicemail please > > > On Tue, Dec 8, 2009 at 11:00 PM, jim holtman <jholt...@gmail.com> wrote: > >> Also instead of 'splitting' the data frame, I split the indices and then >> use those to access the information in the original dataframe. >> >> >> On Tue, Dec 8, 2009 at 9:54 PM, Mark Kimpel <mwkim...@gmail.com> wrote: >> >>> Hadley, Just as you were apparently writing I had the same thought and >>> did >>> exactly what you suggested, converting all columns except the one that I >>> want split to character. Executed almost instantaneously without problem. >>> Thanks! Mark >>> >>> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >>> Indiana University School of Medicine >>> >>> 15032 Hunter Court, Westfield, IN 46074 >>> >>> (317) 490-5129 Work, & Mobile & VoiceMail >>> (317) 399-1219 Skype No Voicemail please >>> >>> >>> On Tue, Dec 8, 2009 at 10:48 PM, hadley wickham <h.wick...@gmail.com> >>> wrote: >>> >>> > Hi Mark, >>> > >>> > Why are you using factors? I think for this case you might find >>> > characters are faster and more space efficient. >>> > >>> > Alternatively, you can have a look at the plyr package which uses some >>> > tricks to keep memory usage down. >>> > >>> > Hadley >>> > >>> > On Tue, Dec 8, 2009 at 9:46 PM, Mark Kimpel <mwkim...@gmail.com> >>> wrote: >>> > > Charles, I suspect your are correct regarding copying of the >>> attributes. >>> > > First off, selectSubAct.df is my "real" data, which turns out to be >>> of >>> > the >>> > > same dim() as myDataFrame below, but each column is make up of >>> strings, >>> > not >>> > > simple letters, and there are many levels in each column, which I did >>> not >>> > > properly duplicate in my first example. I have ammended that below >>> and >>> > with >>> > > the split the new object size is now not 10X the size of the >>> original, >>> > but >>> > > 100X. My "real" data is even more complex than this, so I suspect >>> that is >>> > > where the problem lies. I need to search for a better solution to my >>> > problem >>> > > than split, for which I will start a separate thread if I can't >>> figure >>> > > something out. >>> > > >>> > > Thanks for pointing me in the right direction, >>> > > >>> > > Mark >>> > > >>> > > myDataFrame <- data.frame(matrix(paste("The rain in Spain", >>> > > as.character(1:1400), sep = "."), ncol = 7, nrow = 399000)) >>> > > mySplitVar <- factor(paste("Rainy days and Mondays", >>> > as.character(1:1400), >>> > > sep = ".")) >>> > > myDataFrame <- cbind(myDataFrame, mySplitVar) >>> > > object.size(myDataFrame) >>> > > ## 12860880 bytes # ~ 13MB >>> > > myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar) >>> > > object.size(myDataFrame.split) >>> > > ## 1,274,929,792 bytes ~ 1.2GB >>> > > object.size(selectSubAct.df) >>> > > ## 52,348,272 bytes # ~ 52MB >>> > > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >>> > > Indiana University School of Medicine >>> > > >>> > > 15032 Hunter Court, Westfield, IN 46074 >>> > > >>> > > (317) 490-5129 Work, & Mobile & VoiceMail >>> > > (317) 399-1219 Skype No Voicemail please >>> > > >>> > > >>> > > On Tue, Dec 8, 2009 at 10:22 PM, Charles C. Berry < >>> cbe...@tajo.ucsd.edu >>> > >wrote: >>> > > >>> > >> On Tue, 8 Dec 2009, Mark Kimpel wrote: >>> > >> >>> > >> I'm having trouble using split on a very large data-set with ~1400 >>> > levels >>> > >>> of >>> > >>> the factor to be split. Unfortunately, I can't reproduce it with >>> the >>> > >>> simple >>> > >>> self-contained example below. As you can see, splitting the >>> artificial >>> > >>> dataframe of size ~13MB results in a split dataframe of ~ 144MB, >>> with >>> > an >>> > >>> increase memory allocation of ~10 fold for the split object. If >>> split >>> > >>> scales >>> > >>> linearly, then my actual 52MB dataframe should be easily handled by >>> my >>> > >>> 12GB >>> > >>> of RAM, but it is not. instead, when I try to split selectSubAct.df >>> on >>> > one >>> > >>> of its factors with 1473 levels, my memory is slowly gobbled up >>> (plus 3 >>> > GB >>> > >>> of swap) until I cancel the operation. >>> > >>> >>> > >>> Any ideas on what might be happening? Thanks, Mark >>> > >>> >>> > >> >>> > >> Each element of myDataFrame.split contains a copy of the attributes >>> of >>> > the >>> > >> parent data.frame. >>> > >> >>> > >> And probably it does scale linearly. But the scaling factor depends >>> on >>> > the >>> > >> size of the attributes that get copied, I guess. >>> > >> >>> > >> >>> > >> >>> > >> >>> > >>> myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000)) >>> > >>> mySplitVar <- factor(as.character(1:1400)) >>> > >>> myDataFrame <- cbind(myDataFrame, mySplitVar) >>> > >>> object.size(myDataFrame) >>> > >>> ## 12860880 bytes # ~ 13MB >>> > >>> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar) >>> > >>> object.size(myDataFrame.split) >>> > >>> ## 144524992 bytes # ~ 144MB >>> > >>> >>> > >> >>> > >> Note: >>> > >> >>> > >> only.attr <- lapply(myDataFrame.split,function(x) >>> sapply(x,attributes)) >>> > >>> >>> > >>> >>> > >>> (object.size(myDataFrame.split)-object.size(myDataFrame))/object.size(only.attr) >>> > >>> >>> > >> 1.03726179240978 bytes >>> > >> >>> > >> >>> > >>> >>> > >> >>> > >> object.size(selectSubAct.df) >>> > >>> ## 52,348,272 bytes # ~ 52MB >>> > >>> >>> > >> >>> > >> What was this?? >>> > >> >>> > >> >>> > >> Chuck >>> > >> >>> > >> >>> > >>> sessionInfo() >>> > >>>> >>> > >>> R version 2.10.0 Patched (2009-10-27 r50222) >>> > >>> x86_64-unknown-linux-gnu >>> > >>> >>> > >>> locale: >>> > >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> > >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> > >>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >>> > >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> > >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> > >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> > >>> >>> > >>> attached base packages: >>> > >>> [1] stats graphics grDevices datasets utils methods >>> base >>> > >>> >>> > >>> loaded via a namespace (and not attached): >>> > >>> [1] tools_2.10.0 >>> > >>> >>> > >>> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >>> > >>> Indiana University School of Medicine >>> > >>> >>> > >>> 15032 Hunter Court, Westfield, IN 46074 >>> > >>> >>> > >>> (317) 490-5129 Work, & Mobile & VoiceMail >>> > >>> (317) 399-1219 Skype No Voicemail please >>> > >>> >>> > >>> [[alternative HTML version deleted]] >>> > >>> >>> > >>> >>> > >>> ______________________________________________ >>> > >>> R-help@r-project.org mailing list >>> > >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> > >>> PLEASE do read the posting guide >>> > >>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >>> > >>> and provide commented, minimal, self-contained, reproducible code. >>> > >>> >>> > >>> >>> > >> Charles C. Berry (858) 534-2098 >>> > >> Dept of Family/Preventive >>> > >> Medicine >>> > >> E mailto:cbe...@tajo.ucsd.edu UC San Diego >>> > >> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego >>> > 92093-0901 >>> > >> >>> > >> >>> > >> >>> > > >>> > > [[alternative HTML version deleted]] >>> > > >>> > > ______________________________________________ >>> > > R-help@r-project.org mailing list >>> > > https://stat.ethz.ch/mailman/listinfo/r-help >>> > > PLEASE do read the posting guide >>> > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >>> > > and provide commented, minimal, self-contained, reproducible code. >>> > > >>> > >>> > >>> > >>> > -- >>> > http://had.co.nz/ >>> > >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> >> >> -- >> Jim Holtman >> Cincinnati, OH >> +1 513 646 9390 >> >> What is the problem that you are trying to solve? >> > > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.