Le lundi 30 janvier 2012 à 09:54 +0100, Petr Kurtin a écrit : > Hi, > > I have got a lot of SPSS data for years 1993-2010. I load all data into > lists so I can easily index the values over the years. Unfortunately loaded > data occupy quite a lot of memory (10Gb) - so my question is, what's the > best approach to work with big data files? Can R get a value from the file > data without full loading into memory? How can a slower computer with not > enough memory work with such data? > > I use the following commands: > > data1993 = vector("list", 4); > data1993[[1]] = read.spss(...) # first trimester > data1993[[2]] = read.spss(...) # second trimester > ... > data_all = vector("list", 17); > data_all[[1993]] = data1993; > ... > > and indexing, e.g.: data_all[[1993]][[1]]$DISTRICT, etc. Have a look at the "Large memory and out-of-memory data" of High Performance Computing task view[1]. In particular, you may want to use the "ff" package and its ffdf object, which allows backing a data frame on a file so that RAM can be freed when needed.
Another advice I'd give you is to convert the data from SPSS format to .RData once, and to always use the latter. In my experience, importation often creates memory fragmentation, in addition to being very slow (don't hesitate to save, quit and restart R to reduce this problem). What use do you make of the different years? If you need e.g. to run a model on all of them at the same time, then you'll need to concatenate all the data frames from the "data_all" list, and I guess that's where the RAM will be the problem: you'll have two copies of the data at the same time. Once you've succeeded doing this, loading the full data set will use less RAM, and so may work on lower-end computers. A general solution is also to only load the variables you really need. The "saves" package allows you to save the whole data set into an archive of several .RData files, and to load only what you want from it. It all depends on your needs, constraints, and failed attempts. ;-) Regards 1: http://cran.r-project.org/web/views/HighPerformanceComputing.html ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.