Charles, I suspect your are correct regarding copying of the attributes.
First off, selectSubAct.df is my "real" data, which turns out to be of the
same dim() as myDataFrame below, but each column is make up of strings, not
simple letters, and there are many levels in each column, which I did not
properly duplicate in my first example. I have ammended that below and with
the split the new object size is now not 10X the size of the original, but
100X. My "real" data is even more complex than this, so I suspect that is
where the problem lies. I need to search for a better solution to my problem
than split, for which I will start a separate thread if I can't figure
something out.

Thanks for pointing me in the right direction,

Mark

myDataFrame <- data.frame(matrix(paste("The rain in Spain",
as.character(1:1400), sep = "."), ncol = 7, nrow = 399000))
mySplitVar <- factor(paste("Rainy days and Mondays", as.character(1:1400),
sep = "."))
myDataFrame <- cbind(myDataFrame, mySplitVar)
object.size(myDataFrame)
## 12860880 bytes # ~ 13MB
myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar)
object.size(myDataFrame.split)
## 1,274,929,792 bytes ~ 1.2GB
object.size(selectSubAct.df)
## 52,348,272 bytes # ~ 52MB
Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 399-1219 Skype No Voicemail please


On Tue, Dec 8, 2009 at 10:22 PM, Charles C. Berry <cbe...@tajo.ucsd.edu>wrote:

> On Tue, 8 Dec 2009, Mark Kimpel wrote:
>
>  I'm having trouble using split on a very large data-set with ~1400 levels
>> of
>> the factor to be split. Unfortunately, I can't reproduce it with the
>> simple
>> self-contained example below. As you can see, splitting the artificial
>> dataframe of size ~13MB results in a split dataframe of ~ 144MB, with an
>> increase memory allocation of ~10 fold for the split object. If split
>> scales
>> linearly, then my actual 52MB dataframe should be easily handled by my
>> 12GB
>> of RAM, but it is not. instead, when I try to split selectSubAct.df on one
>> of its factors with 1473 levels, my memory is slowly gobbled up (plus 3 GB
>> of swap) until I cancel the operation.
>>
>> Any ideas on what might be happening? Thanks, Mark
>>
>
> Each element of myDataFrame.split contains a copy of the attributes of the
> parent data.frame.
>
> And probably it does scale linearly. But the scaling factor depends on the
> size of the attributes that get copied, I guess.
>
>
>
>
>> myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000))
>> mySplitVar <- factor(as.character(1:1400))
>> myDataFrame <- cbind(myDataFrame, mySplitVar)
>> object.size(myDataFrame)
>> ## 12860880 bytes # ~ 13MB
>> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar)
>> object.size(myDataFrame.split)
>> ## 144524992 bytes # ~ 144MB
>>
>
> Note:
>
>  only.attr <- lapply(myDataFrame.split,function(x) sapply(x,attributes))
>>
>> (object.size(myDataFrame.split)-object.size(myDataFrame))/object.size(only.attr)
>>
> 1.03726179240978 bytes
>
>
>>
>
>  object.size(selectSubAct.df)
>> ## 52,348,272 bytes # ~ 52MB
>>
>
> What was this??
>
>
> Chuck
>
>
>>  sessionInfo()
>>>
>> R version 2.10.0 Patched (2009-10-27 r50222)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices datasets  utils     methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.10.0
>>
>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>> Indiana University School of Medicine
>>
>> 15032 Hunter Court, Westfield, IN  46074
>>
>> (317) 490-5129 Work, & Mobile & VoiceMail
>> (317) 399-1219 Skype No Voicemail please
>>
>>        [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> Charles C. Berry                            (858) 534-2098
>                                            Dept of Family/Preventive
> Medicine
> E mailto:cbe...@tajo.ucsd.edu               UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to