Re: [R] problem with split eating giga-bytes of memory

Mark Kimpel Tue, 08 Dec 2009 20:27:47 -0800

Jim, could you provide a code snippit to illustrate what you mean?

Hadley, good point, I did not know that.


Mark

Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 399-1219 Skype No Voicemail please


On Tue, Dec 8, 2009 at 11:00 PM, jim holtman <jholt...@gmail.com> wrote:

> Also instead of 'splitting' the data frame, I split the indices and then
> use those to access the information in the original dataframe.
>
>
> On Tue, Dec 8, 2009 at 9:54 PM, Mark Kimpel <mwkim...@gmail.com> wrote:
>
>> Hadley, Just as you were apparently writing I had the same thought and did
>> exactly what you suggested, converting all columns except the one that I
>> want split to character. Executed almost instantaneously without problem.
>> Thanks! Mark
>>
>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>> Indiana University School of Medicine
>>
>> 15032 Hunter Court, Westfield, IN  46074
>>
>> (317) 490-5129 Work, & Mobile & VoiceMail
>> (317) 399-1219 Skype No Voicemail please
>>
>>
>>  On Tue, Dec 8, 2009 at 10:48 PM, hadley wickham <h.wick...@gmail.com>
>> wrote:
>>
>> > Hi Mark,
>> >
>> > Why are you using factors?  I think for this case you might find
>> > characters are faster and more space efficient.
>> >
>> > Alternatively, you can have a look at the plyr package which uses some
>> > tricks to keep memory usage down.
>> >
>> > Hadley
>> >
>> > On Tue, Dec 8, 2009 at 9:46 PM, Mark Kimpel <mwkim...@gmail.com> wrote:
>> > > Charles, I suspect your are correct regarding copying of the
>> attributes.
>> > > First off, selectSubAct.df is my "real" data, which turns out to be of
>> > the
>> > > same dim() as myDataFrame below, but each column is make up of
>> strings,
>> > not
>> > > simple letters, and there are many levels in each column, which I did
>> not
>> > > properly duplicate in my first example. I have ammended that below and
>> > with
>> > > the split the new object size is now not 10X the size of the original,
>> > but
>> > > 100X. My "real" data is even more complex than this, so I suspect that
>> is
>> > > where the problem lies. I need to search for a better solution to my
>> > problem
>> > > than split, for which I will start a separate thread if I can't figure
>> > > something out.
>> > >
>> > > Thanks for pointing me in the right direction,
>> > >
>> > > Mark
>> > >
>> > > myDataFrame <- data.frame(matrix(paste("The rain in Spain",
>> > > as.character(1:1400), sep = "."), ncol = 7, nrow = 399000))
>> > > mySplitVar <- factor(paste("Rainy days and Mondays",
>> > as.character(1:1400),
>> > > sep = "."))
>> > > myDataFrame <- cbind(myDataFrame, mySplitVar)
>> > > object.size(myDataFrame)
>> > > ## 12860880 bytes # ~ 13MB
>> > > myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar)
>> > > object.size(myDataFrame.split)
>> > > ## 1,274,929,792 bytes ~ 1.2GB
>> > > object.size(selectSubAct.df)
>> > > ## 52,348,272 bytes # ~ 52MB
>> > > Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>> > > Indiana University School of Medicine
>> > >
>> > > 15032 Hunter Court, Westfield, IN  46074
>> > >
>> > > (317) 490-5129 Work, & Mobile & VoiceMail
>> > > (317) 399-1219 Skype No Voicemail please
>> > >
>> > >
>> > > On Tue, Dec 8, 2009 at 10:22 PM, Charles C. Berry <
>> cbe...@tajo.ucsd.edu
>> > >wrote:
>> > >
>> > >> On Tue, 8 Dec 2009, Mark Kimpel wrote:
>> > >>
>> > >>  I'm having trouble using split on a very large data-set with ~1400
>> > levels
>> > >>> of
>> > >>> the factor to be split. Unfortunately, I can't reproduce it with the
>> > >>> simple
>> > >>> self-contained example below. As you can see, splitting the
>> artificial
>> > >>> dataframe of size ~13MB results in a split dataframe of ~ 144MB,
>> with
>> > an
>> > >>> increase memory allocation of ~10 fold for the split object. If
>> split
>> > >>> scales
>> > >>> linearly, then my actual 52MB dataframe should be easily handled by
>> my
>> > >>> 12GB
>> > >>> of RAM, but it is not. instead, when I try to split selectSubAct.df
>> on
>> > one
>> > >>> of its factors with 1473 levels, my memory is slowly gobbled up
>> (plus 3
>> > GB
>> > >>> of swap) until I cancel the operation.
>> > >>>
>> > >>> Any ideas on what might be happening? Thanks, Mark
>> > >>>
>> > >>
>> > >> Each element of myDataFrame.split contains a copy of the attributes
>> of
>> > the
>> > >> parent data.frame.
>> > >>
>> > >> And probably it does scale linearly. But the scaling factor depends
>> on
>> > the
>> > >> size of the attributes that get copied, I guess.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>> myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000))
>> > >>> mySplitVar <- factor(as.character(1:1400))
>> > >>> myDataFrame <- cbind(myDataFrame, mySplitVar)
>> > >>> object.size(myDataFrame)
>> > >>> ## 12860880 bytes # ~ 13MB
>> > >>> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar)
>> > >>> object.size(myDataFrame.split)
>> > >>> ## 144524992 bytes # ~ 144MB
>> > >>>
>> > >>
>> > >> Note:
>> > >>
>> > >>  only.attr <- lapply(myDataFrame.split,function(x)
>> sapply(x,attributes))
>> > >>>
>> > >>>
>> >
>> (object.size(myDataFrame.split)-object.size(myDataFrame))/object.size(only.attr)
>> > >>>
>> > >> 1.03726179240978 bytes
>> > >>
>> > >>
>> > >>>
>> > >>
>> > >>  object.size(selectSubAct.df)
>> > >>> ## 52,348,272 bytes # ~ 52MB
>> > >>>
>> > >>
>> > >> What was this??
>> > >>
>> > >>
>> > >> Chuck
>> > >>
>> > >>
>> > >>>  sessionInfo()
>> > >>>>
>> > >>> R version 2.10.0 Patched (2009-10-27 r50222)
>> > >>> x86_64-unknown-linux-gnu
>> > >>>
>> > >>> locale:
>> > >>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> > >>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> > >>> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>> > >>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> > >>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> > >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> > >>>
>> > >>> attached base packages:
>> > >>> [1] stats     graphics  grDevices datasets  utils     methods   base
>> > >>>
>> > >>> loaded via a namespace (and not attached):
>> > >>> [1] tools_2.10.0
>> > >>>
>> > >>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>> > >>> Indiana University School of Medicine
>> > >>>
>> > >>> 15032 Hunter Court, Westfield, IN  46074
>> > >>>
>> > >>> (317) 490-5129 Work, & Mobile & VoiceMail
>> > >>> (317) 399-1219 Skype No Voicemail please
>> > >>>
>> > >>>        [[alternative HTML version deleted]]
>> > >>>
>> > >>>
>> > >>> ______________________________________________
>> > >>> R-help@r-project.org mailing list
>> > >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> > >>> PLEASE do read the posting guide
>> > >>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>> > >>> and provide commented, minimal, self-contained, reproducible code.
>> > >>>
>> > >>>
>> > >> Charles C. Berry                            (858) 534-2098
>> > >>                                            Dept of Family/Preventive
>> > >> Medicine
>> > >> E mailto:cbe...@tajo.ucsd.edu               UC San Diego
>> > >> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego
>> > 92093-0901
>> > >>
>> > >>
>> > >>
>> > >
>> > >        [[alternative HTML version deleted]]
>> > >
>> > > ______________________________________________
>> > > R-help@r-project.org mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/r-help
>> > > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>> > > and provide commented, minimal, self-contained, reproducible code.
>> > >
>> >
>> >
>> >
>> > --
>> > http://had.co.nz/
>> >
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] problem with split eating giga-bytes of memory

Reply via email to