Re: [R] problem with split eating giga-bytes of memory

jim holtman Wed, 09 Dec 2009 04:21:44 -0800

Here is an example:


> # create test data
> N <- 1000000
> x <- data.frame(a=sample(LETTERS, N, TRUE), b=sample(letters, N, TRUE),
+     c=as.numeric(1:N), d=runif(N))
> system.time({
+     x.df <- split(x, x$a)  # split
+     print(sapply(x.df, function(a) sum(a$c)))
+ })
          A           B           C           D           E
F           G           H
19132375146 19261600080 19290064552 19355472666 19143448231 18973627622
19278423676 19362576931
          I           J           K           L           M
N           O           P
19405443596 19295695044 19052377988 19236047192 19143226220 19197703946
19297192525 19129252399
          Q           R           S           T           U
V           W           X
19272964991 19315856972 19355660155 19303178409 19242322477 19081573240
19309444512 19077003863
          Y           Z
19259313705 19228653862
   user  system elapsed
   1.27    0.02    1.28
> # now use indices
> system.time({
+     x.indx <- split(seq(nrow(x)), x$a)  # create list of indices
+     print(sapply(x.indx, function(a) sum(x$c[a])))
+ })
          A           B           C           D           E
F           G           H
19132375146 19261600080 19290064552 19355472666 19143448231 18973627622
19278423676 19362576931
          I           J           K           L           M
N           O           P
19405443596 19295695044 19052377988 19236047192 19143226220 19197703946
19297192525 19129252399
          Q           R           S           T           U
V           W           X
19272964991 19315856972 19355660155 19303178409 19242322477 19081573240
19309444512 19077003863
          Y           Z
19259313705 19228653862
   user  system elapsed
   0.23    0.00    0.23
>
>
>
>
>


On Tue, Dec 8, 2009 at 10:26 PM, Mark Kimpel <mwkim...@gmail.com> wrote:

> Jim, could you provide a code snippit to illustrate what you mean?
>
> Hadley, good point, I did not know that.
>
> Mark
>
> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
>
> 15032 Hunter Court, Westfield, IN  46074
>
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 399-1219 Skype No Voicemail please
>
>
>   On Tue, Dec 8, 2009 at 11:00 PM, jim holtman <jholt...@gmail.com> wrote:
>
>> Also instead of 'splitting' the data frame, I split the indices and then
>> use those to access the information in the original dataframe.
>>
>>
>> On Tue, Dec 8, 2009 at 9:54 PM, Mark Kimpel <mwkim...@gmail.com> wrote:
>>
>>> Hadley, Just as you were apparently writing I had the same thought and
>>> did
>>> exactly what you suggested, converting all columns except the one that I
>>> want split to character. Executed almost instantaneously without problem.
>>> Thanks! Mark
>>>
>>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>>> Indiana University School of Medicine
>>>
>>> 15032 Hunter Court, Westfield, IN  46074
>>>
>>> (317) 490-5129 Work, & Mobile & VoiceMail
>>> (317) 399-1219 Skype No Voicemail please
>>>
>>>
>>>  On Tue, Dec 8, 2009 at 10:48 PM, hadley wickham <h.wick...@gmail.com>
>>> wrote:
>>>
>>> > Hi Mark,
>>> >
>>> > Why are you using factors?  I think for this case you might find
>>> > characters are faster and more space efficient.
>>> >
>>> > Alternatively, you can have a look at the plyr package which uses some
>>> > tricks to keep memory usage down.
>>> >
>>> > Hadley
>>> >
>>> > On Tue, Dec 8, 2009 at 9:46 PM, Mark Kimpel <mwkim...@gmail.com>
>>> wrote:
>>> > > Charles, I suspect your are correct regarding copying of the
>>> attributes.
>>> > > First off, selectSubAct.df is my "real" data, which turns out to be
>>> of
>>> > the
>>> > > same dim() as myDataFrame below, but each column is make up of
>>> strings,
>>> > not
>>> > > simple letters, and there are many levels in each column, which I did
>>> not
>>> > > properly duplicate in my first example. I have ammended that below
>>> and
>>> > with
>>> > > the split the new object size is now not 10X the size of the
>>> original,
>>> > but
>>> > > 100X. My "real" data is even more complex than this, so I suspect
>>> that is
>>> > > where the problem lies. I need to search for a better solution to my
>>> > problem
>>> > > than split, for which I will start a separate thread if I can't
>>> figure
>>> > > something out.
>>> > >
>>> > > Thanks for pointing me in the right direction,
>>> > >
>>> > > Mark
>>> > >
>>> > > myDataFrame <- data.frame(matrix(paste("The rain in Spain",
>>> > > as.character(1:1400), sep = "."), ncol = 7, nrow = 399000))
>>> > > mySplitVar <- factor(paste("Rainy days and Mondays",
>>> > as.character(1:1400),
>>> > > sep = "."))
>>> > > myDataFrame <- cbind(myDataFrame, mySplitVar)
>>> > > object.size(myDataFrame)
>>> > > ## 12860880 bytes # ~ 13MB
>>> > > myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar)
>>> > > object.size(myDataFrame.split)
>>> > > ## 1,274,929,792 bytes ~ 1.2GB
>>> > > object.size(selectSubAct.df)
>>> > > ## 52,348,272 bytes # ~ 52MB
>>> > > Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>>> > > Indiana University School of Medicine
>>> > >
>>> > > 15032 Hunter Court, Westfield, IN  46074
>>> > >
>>> > > (317) 490-5129 Work, & Mobile & VoiceMail
>>> > > (317) 399-1219 Skype No Voicemail please
>>> > >
>>> > >
>>> > > On Tue, Dec 8, 2009 at 10:22 PM, Charles C. Berry <
>>> cbe...@tajo.ucsd.edu
>>> > >wrote:
>>> > >
>>> > >> On Tue, 8 Dec 2009, Mark Kimpel wrote:
>>> > >>
>>> > >>  I'm having trouble using split on a very large data-set with ~1400
>>> > levels
>>> > >>> of
>>> > >>> the factor to be split. Unfortunately, I can't reproduce it with
>>> the
>>> > >>> simple
>>> > >>> self-contained example below. As you can see, splitting the
>>> artificial
>>> > >>> dataframe of size ~13MB results in a split dataframe of ~ 144MB,
>>> with
>>> > an
>>> > >>> increase memory allocation of ~10 fold for the split object. If
>>> split
>>> > >>> scales
>>> > >>> linearly, then my actual 52MB dataframe should be easily handled by
>>> my
>>> > >>> 12GB
>>> > >>> of RAM, but it is not. instead, when I try to split selectSubAct.df
>>> on
>>> > one
>>> > >>> of its factors with 1473 levels, my memory is slowly gobbled up
>>> (plus 3
>>> > GB
>>> > >>> of swap) until I cancel the operation.
>>> > >>>
>>> > >>> Any ideas on what might be happening? Thanks, Mark
>>> > >>>
>>> > >>
>>> > >> Each element of myDataFrame.split contains a copy of the attributes
>>> of
>>> > the
>>> > >> parent data.frame.
>>> > >>
>>> > >> And probably it does scale linearly. But the scaling factor depends
>>> on
>>> > the
>>> > >> size of the attributes that get copied, I guess.
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >>> myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000))
>>> > >>> mySplitVar <- factor(as.character(1:1400))
>>> > >>> myDataFrame <- cbind(myDataFrame, mySplitVar)
>>> > >>> object.size(myDataFrame)
>>> > >>> ## 12860880 bytes # ~ 13MB
>>> > >>> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar)
>>> > >>> object.size(myDataFrame.split)
>>> > >>> ## 144524992 bytes # ~ 144MB
>>> > >>>
>>> > >>
>>> > >> Note:
>>> > >>
>>> > >>  only.attr <- lapply(myDataFrame.split,function(x)
>>> sapply(x,attributes))
>>> > >>>
>>> > >>>
>>> >
>>> (object.size(myDataFrame.split)-object.size(myDataFrame))/object.size(only.attr)
>>> > >>>
>>> > >> 1.03726179240978 bytes
>>> > >>
>>> > >>
>>> > >>>
>>> > >>
>>> > >>  object.size(selectSubAct.df)
>>> > >>> ## 52,348,272 bytes # ~ 52MB
>>> > >>>
>>> > >>
>>> > >> What was this??
>>> > >>
>>> > >>
>>> > >> Chuck
>>> > >>
>>> > >>
>>> > >>>  sessionInfo()
>>> > >>>>
>>> > >>> R version 2.10.0 Patched (2009-10-27 r50222)
>>> > >>> x86_64-unknown-linux-gnu
>>> > >>>
>>> > >>> locale:
>>> > >>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> > >>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> > >>> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>> > >>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>> > >>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> > >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> > >>>
>>> > >>> attached base packages:
>>> > >>> [1] stats     graphics  grDevices datasets  utils     methods
>>> base
>>> > >>>
>>> > >>> loaded via a namespace (and not attached):
>>> > >>> [1] tools_2.10.0
>>> > >>>
>>> > >>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>>> > >>> Indiana University School of Medicine
>>> > >>>
>>> > >>> 15032 Hunter Court, Westfield, IN  46074
>>> > >>>
>>> > >>> (317) 490-5129 Work, & Mobile & VoiceMail
>>> > >>> (317) 399-1219 Skype No Voicemail please
>>> > >>>
>>> > >>>        [[alternative HTML version deleted]]
>>> > >>>
>>> > >>>
>>> > >>> ______________________________________________
>>> > >>> R-help@r-project.org mailing list
>>> > >>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> > >>> PLEASE do read the posting guide
>>> > >>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>>> > >>> and provide commented, minimal, self-contained, reproducible code.
>>> > >>>
>>> > >>>
>>> > >> Charles C. Berry                            (858) 534-2098
>>> > >>                                            Dept of Family/Preventive
>>> > >> Medicine
>>> > >> E mailto:cbe...@tajo.ucsd.edu               UC San Diego
>>> > >> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego
>>> > 92093-0901
>>> > >>
>>> > >>
>>> > >>
>>> > >
>>> > >        [[alternative HTML version deleted]]
>>> > >
>>> > > ______________________________________________
>>> > > R-help@r-project.org mailing list
>>> > > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>>> > > and provide commented, minimal, self-contained, reproducible code.
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > http://had.co.nz/
>>> >
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] problem with split eating giga-bytes of memory

Reply via email to