On Jan 30, 2008 7:20 AM, Jay Emerson <[EMAIL PROTECTED]> wrote: > I was surprised to observe the following difference between 2.4.1 and > 2.6.0 after a long overdue upgrade a few months ago of our > departmental server. It wasn't a bug fix, but a subtle improvement. > Here's the simplest example I could create. The size is excessive, on > the order of the Netflix Competition data. > > The integer matrix is about 1.12 GB, and if coerced to numeric it is > 2.24 GB. The peak memory consumption of the first (old) operation was > 1.2 + 2.24 + 2.24 = 5.6 GB. The peak memory consumption of the second > (new) operation is 1.12 + 2.24 = 3.36 GB. (See below) > > In contrast, if a numeric matrix is used, there are no differences > between the versions (so the improvement seems related to the integer > type or the decision when/how to do the coercion). And of course I > realize that x <- x + as.integer(1) is an option, but that isn't the > point of this exercise. > > I'm curious, but also spending time on memory-related work. Someone > deserves a 'thank you' and a pat on the pack for making this sort of > improvement. Surely someone can step forward and take a bow, and > perhaps explain the nature of the improvement? > > On a related note, a new package bigmemoRy will be available soon, > handling massive matrices of double, integer, short, or char in RAM. > In Unix (sorry, Windows), these matrices can also be used with shared > memory (with mutexes implemented) for parallel processing. It's a > niche market, obviously, ideal for data larger than 1 GB (roughly) but > still within the boundaries of the RAM. It may be a useful developer > tool for big-data problems. > > ------------------------ > R version 2.4.1 (linux): > > x <- matrix(as.integer(0), 1e+08, 3) > > x <- x + 1 > > gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 233754 12.5 467875 25 350000 18.7 > Vcells 300119431 2289.8 787870506 6011 750119944 5723.0 > ------------------------ > R version 2.6.0 (linux): > > x <- matrix(as.integer(0), 1e+08, 3) > > x <- x + 1 > > gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 137931 7.4 350000 18.7 350000 18.7 > Vcells 300126402 2289.8 472877829 3607.8 450126789 3434.2
That's interesting - I never noticed that change. On the same topic, in R 2.7.0 devel, the (re-)assignment in the following example does no longer create an extra copy: > x <- matrix(1, nrow=5000, ncol=5000) gc()> gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 132056 7.1 350000 18.7 350000 18.7 Vcells 25136968 191.8 28050871 214.1 25137357 191.8 > x[1,1] <- 2 > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 132060 7.1 350000 18.7 350000 18.7 Vcells 25136969 191.8 29533414 225.4 25137357 191.8 In R 2.6.1 that 2nd assignment would result in: > x[1,1] <- 2 > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 138119 7.4 350000 18.7 350000 18.7 Vcells 25126464 191.7 52877950 403.5 50126482 382.5 See https://stat.ethz.ch/pipermail/r-devel/2007-September/047008.html for background. Thanks a lot whoever (Luke?) took the time to update matrix(). /Henrik > > > -- > John W. Emerson (Jay) > Assistant Professor of Statistics > Director of Graduate Studies (on leave 07-08) > Department of Statistics > Yale University > http://www.stat.yale.edu/~jay > Statistical Consultant, REvolution Computing > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel