Re: [Rd] Pb with agrep()
> "Herve" == Herve Pages <[EMAIL PROTECTED]> > on Wed, 04 Jan 2006 17:29:35 -0800 writes: Herve> Happy new year everybody, Herve> I'm getting the following while trying to use the agrep() function: >> pattern <- "XXX" >> subject <- c("oo", "oooXooo", "oooXXooo", "oooXXXooo") >> max <- list(ins=0, del=0, sub=0) # I want exact matches only >> agrep(pattern, subject, max=max) Herve> [1] 4 Herve> OK >> max$sub <- 1 # One allowed substitution >> agrep(pattern, subject, max=max) Herve> [1] 3 4 Herve> OK >> max$sub <- 2 # Two allowed substitutions >> agrep(pattern, subject, max=max) Herve> [1] 3 4 Herve> Wrong! No. You have overlooked the fact that 'max.distance = 0.1' (10%) *remains* the default, even when 'max.distance' is specified as a list as in your example [from "?agrep" ] : >> max.distance: Maximum distance allowed for a match. Expressed either >> as integer, or as a fraction of the pattern length (will be >> replaced by the smallest integer not less than the >> corresponding fraction), or a list with possible components >> >> 'all': maximal (overall) distance >> >> 'insertions': maximum number/fraction of insertions >> >> 'deletions': maximum number/fraction of deletions >> >> 'substitutions': maximum number/fraction of substitutions >> >> If 'all' is missing, it is set to 10%, the other components >> default to 'all'. The component names can be abbreviated. If you specify max$all as "100%", i.e, as 0. ('< 1' !) everything works as you expect it: agrep(pattern, subject, max = list(ins=0, del=0, sub= 2, all = 0.)) ## --> 2 3 4 Martin Maechler, ETH Zurich __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Using gcc4 visibility features
R-devel now makes use of gcc4's visibility features: for an in-depth account see http://people.redhat.com/drepper/dsohowto.pdf (and note there are older versions of that document around). Consider for example stats.so. On a gcc4 Linux system this has just three entry points gannet% nm -g stats.so | grep " T " 2720 T R_init_stats 0004a544 T _fini 1f28 T _init since the only entry point we need is the symbol registration. This results in a smaller DSO and a faster load. It is only worth doing for shared objects with many entry points, but this one had 262. It also gives another reason for the registration of symbols, as this is the only way I know to hide Fortran entry points (except to hide them all, which will hide them from .Fortran). Until recently registration was used in the standard packages and a handful of others (not including any recommended packages). You can copy the way it is done in package stats (see PKG_* in Makefile.in and init.c). The next step will be to prune libR.so down to something close to the documented API (it currently has 1816 exported symbols). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Using STL containers in R/C++
Andrew Finley umn.edu> writes: > I am in the process of writing an R extension in c++ and am using several > STL containers (e.g., vector, map, multimap double>). I make sure to clear all these containers at the end of the > .Call. Everything compiles and runs just fine, but I'm a bit worried > since I haven't found any other packages that use STL. RQuantLib, a wrapper to the QuantLib libraries, has indirect exposure to Boost. [ QuantLib uses Boost smart pointers, and unit testing. ] However, I have kept the actual R-to-QuantLib interface very 'plain C' to keep it simple and sane. Dominick Samperi wrote a Rcpp.{hpp,cpp} class for C++ to R interface that is used in RQuantLib. Dominick was musing about releasing this stand-alone to CRAN as well, but I don't think it has happened. > So, my question: is it OK to use the STL containers in my code? Will the In my book, yes. It would be nice to have a few nice, and documented, examples. > use of this library somehow limit portability of this package? I don't see why. Effectively, R goes where gcc/g++ go so if you make sure you stay within the bounds of g++. It will create an external dependency, as those do confuse users from time to time (c.f. the r-help archives). > Or cause memory management problems? If you have a bug, yes. If you don't, you don't. The R Extensions manual has a few tips on the matter. Cheers, Dirk __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Using gcc4 visibility features
> Subject: [Rd] Using gcc4 visibility features > > R-devel now makes use of gcc4's visibility features: for an in-depth > account see > > http://people.redhat.com/drepper/dsohowto.pdf does this mean that we now have a dependency on gcc4, or just that it "can" use the feature of gcc4? clarification, please. thanks, --elijah __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Using STL containers in R/C++
You might want to take a look at the Bioconductor package RBGL which provides an R interface to the BGL which is C++ STL heavy, I believe. + seth __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Problems with calloc function.
Hi Marcelo, You need to read the R extension manual more carefully... Basically you haven't deference thed pointers. You think you were allocating say, col=2, but instead you were allocating &col in int's, the address of col, which is a large number since user-land memory address starts at a large offset (0x4000? = 1/2 GB or 0x8000 = 1GB?), so after 4 large allocations, you run out of memory. Everything through the C interface is passed by pointer, in the fortran convention. BTW, you should use Rprintf() instead of printf(). Details below. Hin-Tak Leung Marcelo Damasceno wrote: > Hello all and Prof. Brian Ripley , > > Sorry about my incautiousness, I use the tips, but is happen same problems. > Below the R code. > > rspcplot <- function(file1="dg_01.lab.txt"){ > if(is.null(n)) > stop("You need first run the function cluster") > file<-"dg_01.lab.txt" > aux<-file1 > file1<-pmatch(file1,file) > if(is.na(file1)){ > matrix=loadMatrix(file,n) > } > else{ > matrix=loadMatrix(aux,n) > } > matrixc<-correct(matrix) > #merge2(matrixc) > nrow=nrow(matrixc) > ncol=ncol(matrixc) > ntemp=getTemp() > out <- .C("merge2",matrixc,nrow, ncol,ntemp,outMerge=as.integer > (0),outHeight=as.integer(0),PACKAGE="rspc") > ## > Below the C code. > ## > void merge2(int *nmat,int nrow,int ncol, int *ntemp,int ntam, int *out, int > *height){ Here, you should have "*ncol" instead of "ncol". (I am only correcting this one - you can change the others) like this: void merge2(int *nmat,int nrow,int *ncol, int *ntemp,int ntam, int *out, int *height){ > int row,col,*temp,i,j,k,n3,tam,x,aux2,n1; > row = nrow; > col = ncol; You should use here: int col = *ncol; > > int *temp1,*temp2,*temp3,*temp4; > > temp1 = (int*)Calloc(col,int); inserting here: Rprintf("I am trying to allocate col = %d\n", col); would have told you what's wrong with your code... > printf("OK1 \n"); > temp2 = (int*)Calloc(col,int); > printf("OK2 \n"); > temp3 = (int *)Calloc(col,int); > printf("OK3 \n"); > temp4 = (int *)Calloc(col,int); > if(temp4 == NULL){ > printf("\n\n No Memory4!"); > exit(1); > } > printf("OK4\n"); > int *cvector; > cvector = (int *)Calloc(col,int); > if(cvector == NULL){ > printf("\n\n No Memory5!"); > exit(1); > } > printf("OK5\n"); > tam=ntam; > ### > Output of Work Space: > > >>rspcplot() > > Read 525 items > Read 101 items > OK1 > OK2 > OK3 > OK4 > Error in rspcplot() : Calloc could not allocate (145869080 of 4) memory > > > Using the Ruspini data, the values of variables col = 2 and row = 75. I was > thinking that the number of pointers and space of memory are too big. > > > Thanks All ! __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] random seed question
Greetings, I am trying to write a C++ subroutine of my random number generator. Based on tutorial in Writing R Extensions, one should call GetRNGstate() before and PutRNGstate() after calling R's random variate generation routines. Now suppose my function would generate n(n>1) numbers by a loop, do I need to call these two functions at each iteration? This certainly increases computation burden (although indiscernible in my case). But if I only call them once (outside the loop), will the quality of my random numbers be reduced because of serial correlations in RNG algorithms? I do comparisons between two methods, no significant difference is found. How do you guys write RNG functions? Any other specific reasons? Thanks, -- I am Tib, not Rob. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] random seed question
Just one call to each that enclose the RNG calls will do, I believe. Andy From: Tib > > Greetings, > > I am trying to write a C++ subroutine of my random number > generator. Based > on tutorial in Writing R Extensions, one should call > GetRNGstate() before > and PutRNGstate() after calling R's random variate > generation routines. Now > suppose my function would generate n(n>1) numbers by a loop, > do I need to > call these two functions at each iteration? This certainly increases > computation burden (although indiscernible in my case). But > if I only call > them once (outside the loop), will the quality of my random numbers be > reduced because of serial correlations in RNG algorithms? I > do comparisons > between two methods, no significant difference is found. How > do you guys > write RNG functions? Any other specific reasons? Thanks, > > -- > I am Tib, not Rob. > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Using gcc4 visibility features
> "elijah" == elijah wright <[EMAIL PROTECTED]> > on Thu, 5 Jan 2006 09:13:15 -0600 (CST) writes: >> Subject: [Rd] Using gcc4 visibility features >> >> R-devel now makes use of gcc4's visibility features: for >> an in-depth account see >> >> http://people.redhat.com/drepper/dsohowto.pdf elijah> does this mean that we now have a dependency on elijah> gcc4, or just that it "can" use the feature of gcc4? the latter (of course!) elijah> clarification, please. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Problems with calloc function.
On Tue, 3 Jan 2006, Marcelo Damasceno wrote: > Sorry about my incautiousness, I use the tips, but is > happen same problems. Not really. You definitely skipped over the most important one - don't terminate the host process. > if(temp4 == NULL){ > printf("\n\n No Memory4!"); > exit(1); > } if (temp4 == NULL) { error("memory allocation failed for 'temp4'"); /*NOTREACHED*/ } -- SIGSIG -- signature too long (core dumped) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Using gcc4 visibility features
>>> R-devel now makes use of gcc4's visibility features: for >>> an in-depth account see > >elijah> does this mean that we now have a dependency on >elijah> gcc4, or just that it "can" use the feature of gcc4? > > the latter (of course!) that's what i expected, i was just checking :) --elijah __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] extending lattice to S4 classes
On 10/20/05, ernesto <[EMAIL PROTECTED]> wrote: [...] > Hi Deepayan, > > I see that there are alternatives, I found one my self that works and > it's transparent for the user. > > I don't want to implement solutions that force the user to use lattice > methods differently from your implementation. > > The cleanest solution as you say is to add a data argument to xyplot but > I can't do it so I would not propose it. FYI, I have added the 'data' argument to high level generics in lattice_0.13-1, available for r-devel (of course this causes the current version of FLCore to fail). Deepayan -- http://www.stat.wisc.edu/~deepayan/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] random seed question
On 1/5/2006 11:44 AM, Tib wrote: > Greetings, > > I am trying to write a C++ subroutine of my random number generator. Based > on tutorial in Writing R Extensions, one should call GetRNGstate() before > and PutRNGstate() after calling R's random variate generation routines. Now > suppose my function would generate n(n>1) numbers by a loop, do I need to > call these two functions at each iteration? This certainly increases > computation burden (although indiscernible in my case). But if I only call > them once (outside the loop), will the quality of my random numbers be > reduced because of serial correlations in RNG algorithms? I do comparisons > between two methods, no significant difference is found. How do you guys > write RNG functions? Any other specific reasons? Thanks, Call them once, outside the loop. What they do is move the RNG state between the R workspace and a place where the C functions can use it. Wrapping each call in the loop in those calls will just generate a lot of unnecessary moves. Not doing the calls, or not pairing them up properly, is the dangerous thing that would lead to damage to the RNG algorithms, because the state would not be updated properly. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Q: R 2.2.1: Memory Management Issues?
Dear Developers: I have a question about memory management in R 2.2.1 and am wondering if you would be kind enough to help me understand what is going on. (It has been a few years since I have done software development on Windows, so I apologize in advance if these are easy questions.) - MY SYSTEM - I am currently using R (version 2.2.1) on a PC running Windows 2000 (Intel Pentium M) that has 785,328 KB (a little over 766 MB) of physical RAM. The R executable resides on the C drive, which is of NTFS format, says it has 15.08 GB free space and has recently been defragmented. The report of that defragmented drive gives: Volume (C:): Volume size = 35,083 MB Cluster size= 512 bytes Used space = 19,642 MB Free space = 15,440 MB Percent free space = 44 % Volume fragmentation Total fragmentation = 1 % File fragmentation = 2 % Free space fragmentation= 0 % File fragmentation Total files = 121,661 Average file size = 193 KB Total fragmented files = 64 Total excess fragments = 146 Average fragments per file = 1.00 Pagefile fragmentation Pagefile size = 768 MB Total fragments = 1 Directory fragmentation Total directories = 7,479 Fragmented directories = 2 Excess directory fragments = 3 Master File Table (MFT) fragmentation Total MFT size = 126 MB MFT record count= 129,640 Percent MFT in use = 99 % Total MFT fragments = 4 -- PROBLEM - I am trying to run a R script which makes use of the MCLUST package. The script can successfully read in the approximately 17000 data points ok, but then throws an error: Error: cannot allocate vector of size 1115070Kb In addition: Warning messages: 1: Reached total allocation of # Mb: see help(memory.size) 2: Reached total allocation of # Mb: see help(memory.size) Execution halted after attempting line: summary(EMclust(y),y) which is computationally intensive (performs a "deconvolution" of the data into a series of Gaussian peaks) and where # is either 766Mb or 2048Mb (depending on the max memory size I set). The call I make is to Rterm.exe (to try to avoid Windows overhead): "C:\Program Files\R\R-2.2.1\bin\Rterm.exe" --no-save --no-restore --vanilla --silent --max-mem-size=766M < "C:\Program Files\R\R-2.2.1\dTest.R" (I have also tried it with 2048M but with same lack of success.) QUESTIONS (1) I had initially thought that Windows 2000 should be able to allocate up to about 2 GB memory. So, why is there a problem to allocate a little over 1GB on a defragmented disk with over 15 GB free? (Is this a pagefile size issue?) (2) Do you think the origin of the problem is (a) the R environment, or (b) the function in the MCLUST package using an in-memory instead of an on-disk approach? (3) (a) If the problem originates in the R environment, would switching to the Linux version of R solve the problem? (b) If the problem originates in the function in the MCLUST package, whom do I need to contact to get more information about re-writing the source code to handle large datasets? Information I have located on overcoming Windows2000 memory allocation limits [http://www.rsinc.com/services/techtip.asp?ttid=3346; http://www.petri.co.il/pagefile_optimization.htm] does not seem to help me understand this any better. I had initially upgraded to R version 2.2.1 because I had read [https://svn.r-project.org/R/trunk/src/gnuwin32/CHANGES/]: R 2.2.1 === Using the latest binutils allows us to distribute RGui.exe and Rterm.exe as large-address-aware (see the rw-FAQ Q2.9). The maximum C stack size for RGui.exe and Rterm.exe has been increased to 10Mb (from 2Mb); this is comparable with the default on Linux systems and may allow some larger programs to run without crashes. ... and also from the Windows FAQ [http://cran.r-project.org/bin/windows/base/rw-FAQ.html#There-seems-to-be-a-limit-on-the-memory-it-uses_0021]: 2.9 There seems to be a limit on the memory it uses! Indeed there is. It is set by the command-line flag --max-mem-size (see How do I install R for Windows?) and defaults to the smaller of the amount of physical RAM in th
Re: [Rd] Using STL containers in R/C++
Dirk Eddelbuettel wrote: > Dominick Samperi wrote a Rcpp.{hpp,cpp} class for C++ to R interface that is > used in RQuantLib. Dominick was musing about releasing this stand-alone to > CRAN > as well, but I don't think it has happened. > It just happened. I uploaded Rcpp to CRAN today. The package contains a PDF file Rcpp.pdf that describes the package and the class library. Dominick __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Q: R 2.2.1: Memory Management Issues?
Karen, On Jan 5, 2006, at 5:18 PM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > I am trying to run a R script which makes use of the MCLUST package. > The script can successfully read in the approximately 17000 data > points ok, but then throws an error: > > Error: cannot allocate vector of size 1115070Kb This is 1.1GB of RAM to allocate alone for one vector(!). As you stated yourself the total upper limit is 2GB, so you cannot even fit two of those in memory anyway - not much you can do with it even if it is allocated. > summary(EMclust(y),y) I suspect that memory is your least problem. Did you even try to run EMclust on a small subsample? I suspect that if you did, you would figure out that what you are trying to do is not likely to terminate within days... > (1) I had initially thought that Windows 2000 should be able to > allocate up to about 2 GB memory. So, why is there a problem to > allocate a little over 1GB on a defragmented disk with over 15 GB > free? (Is this a pagefile size issue?) Because that is not the only 1GB vector that is allocated. Your "15GB/ defragmented" are irrelevant - if at all, look how much virtual memory is set up in you system's preferences. > (2) Do you think the origin of the problem is > (a) the R environment, or > (b) the function in the MCLUST package using an in-memory > instead of an on-disk approach? Well, a toy example of 17000x2 needs 2.3GB and it's unlikely to terminate anytime soon, so I'd rather call it shooting with the wrong gun. Maybe you should consider different approach to your problem - possibly ask at the BioConductor list, because people there have more experience with large data and this is not really a technical question about R, but rather how to apply statistical methods. > (3) > (a) If the problem originates in the R environment, would > switching to the Linux version of R solve the problem? Any reasonable unix will do - technically (64-bit versions preferably, but in your case even 32-bit would do). Again, I don't think memory is your only problem here, though. Cheers, Simon __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Q: R 2.2.1: Memory Management Issues?
Dear Simon, Thank you for taking time to address my questions. >> summary(EMclust(y),y) > >I suspect that memory is your least problem. Did you even try to run >EMclust on a small subsample? I suspect that if you did, you would >figure out that what you are trying to do is not likely to terminate >within days... The empirically derived limit on my machine (under R 1.9.1) was approximately 7500 data points. I have been able to successfully run the script that uses package MCLUST on several hundred smaller data sets. I even had written a work-around for the case of greater than 9600 data points. My work-around first orders the points by their value then takes a sample (e.g. every other point or 1 point every n points) in order to bring the number under 9600. No problems with the computations were observed, but you are correct that a deconvolution on that larger dataset of 9600 takes almost 30 minutes. However, for our purposes, we do not have many datasets over 9600 so the time is not a major constraint. Unfortunately, my management does not like using a work-around and really wants to operate on the larger data sets. I was told to find a way to make it operate on the larger data sets or avoid using R and find another solution. >From previous programming projects in a different scientific field long ago, I >recall making a trade-off of using temp files instead of holding data in >memory in order to make working with larger data sets possible. I am >wondering if something like that would be possible for this situation, but I >don't have enough knowledge at this moment to make this decision. Karen --- Karen M. Green, Ph.D. [EMAIL PROTECTED] Research Investigator Drug Design Group Sanofi Aventis Pharmaceuticals Tucson, AZ 85737 -Original Message- From: Simon Urbanek [mailto:[EMAIL PROTECTED] Sent: Thursday, January 05, 2006 5:13 PM To: Green, Karen M. PH/US Cc: R-devel@stat.math.ethz.ch Subject: Re: [Rd] Q: R 2.2.1: Memory Management Issues? Importance: High Karen, On Jan 5, 2006, at 5:18 PM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > I am trying to run a R script which makes use of the MCLUST package. > The script can successfully read in the approximately 17000 data > points ok, but then throws an error: > > Error: cannot allocate vector of size 1115070Kb This is 1.1GB of RAM to allocate alone for one vector(!). As you stated yourself the total upper limit is 2GB, so you cannot even fit two of those in memory anyway - not much you can do with it even if it is allocated. > summary(EMclust(y),y) I suspect that memory is your least problem. Did you even try to run EMclust on a small subsample? I suspect that if you did, you would figure out that what you are trying to do is not likely to terminate within days... > (1) I had initially thought that Windows 2000 should be able to > allocate up to about 2 GB memory. So, why is there a problem to > allocate a little over 1GB on a defragmented disk with over 15 GB > free? (Is this a pagefile size issue?) Because that is not the only 1GB vector that is allocated. Your "15GB/ defragmented" are irrelevant - if at all, look how much virtual memory is set up in you system's preferences. > (2) Do you think the origin of the problem is > (a) the R environment, or > (b) the function in the MCLUST package using an in-memory > instead of an on-disk approach? Well, a toy example of 17000x2 needs 2.3GB and it's unlikely to terminate anytime soon, so I'd rather call it shooting with the wrong gun. Maybe you should consider different approach to your problem - possibly ask at the BioConductor list, because people there have more experience with large data and this is not really a technical question about R, but rather how to apply statistical methods. > (3) > (a) If the problem originates in the R environment, would > switching to the Linux version of R solve the problem? Any reasonable unix will do - technically (64-bit versions preferably, but in your case even 32-bit would do). Again, I don't think memory is your only problem here, though. Cheers, Simon __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Q: R 2.2.1: Memory Management Issues?
Dear Simon, Thank you for taking time to address my questions. >> summary(EMclust(y),y) > >I suspect that memory is your least problem. Did you even try to run >EMclust on a small subsample? I suspect that if you did, you would >figure out that what you are trying to do is not likely to terminate >within days... The empirically derived limit on my machine (under R 1.9.1) was approximately 7500 data points. I have been able to successfully run the script that uses package MCLUST on several hundred smaller data sets. I even had written a work-around for the case of greater than 9600 data points (the limit when using R 2.2.1). My work-around first orders the points by their value then takes a sample (e.g. every other point or 1 point every n points) in order to bring the number under 9600. No problems with the computations were observed, but you are correct that a deconvolution on that larger dataset of 9600 takes almost 30 minutes. However, for our purposes, we do not have many datasets over 9600 so the time is not a major constraint. Unfortunately, my management does not like using a work-around and really wants to operate on the larger data sets. I was told to find a way to make it operate on the larger data sets or avoid using R and find another solution. >From previous programming projects in a different scientific field long ago, I >recall making a trade-off of using temp files instead of holding data in >memory in order to make working with larger data sets possible. I am >wondering if something like that would be possible for this situation, but I >don't have enough knowledge at this moment to make this decision. Karen --- Karen M. Green, Ph.D. [EMAIL PROTECTED] Research Investigator Drug Design Group Sanofi Aventis Pharmaceuticals Tucson, AZ 85737 -Original Message- From: Simon Urbanek [mailto:[EMAIL PROTECTED] Sent: Thursday, January 05, 2006 5:13 PM To: Green, Karen M. PH/US Cc: R-devel@stat.math.ethz.ch Subject: Re: [Rd] Q: R 2.2.1: Memory Management Issues? Importance: High Karen, On Jan 5, 2006, at 5:18 PM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > I am trying to run a R script which makes use of the MCLUST package. > The script can successfully read in the approximately 17000 data > points ok, but then throws an error: > > Error: cannot allocate vector of size 1115070Kb This is 1.1GB of RAM to allocate alone for one vector(!). As you stated yourself the total upper limit is 2GB, so you cannot even fit two of those in memory anyway - not much you can do with it even if it is allocated. > summary(EMclust(y),y) I suspect that memory is your least problem. Did you even try to run EMclust on a small subsample? I suspect that if you did, you would figure out that what you are trying to do is not likely to terminate within days... > (1) I had initially thought that Windows 2000 should be able to > allocate up to about 2 GB memory. So, why is there a problem to > allocate a little over 1GB on a defragmented disk with over 15 GB > free? (Is this a pagefile size issue?) Because that is not the only 1GB vector that is allocated. Your "15GB/ defragmented" are irrelevant - if at all, look how much virtual memory is set up in you system's preferences. > (2) Do you think the origin of the problem is > (a) the R environment, or > (b) the function in the MCLUST package using an in-memory > instead of an on-disk approach? Well, a toy example of 17000x2 needs 2.3GB and it's unlikely to terminate anytime soon, so I'd rather call it shooting with the wrong gun. Maybe you should consider different approach to your problem - possibly ask at the BioConductor list, because people there have more experience with large data and this is not really a technical question about R, but rather how to apply statistical methods. > (3) > (a) If the problem originates in the R environment, would > switching to the Linux version of R solve the problem? Any reasonable unix will do - technically (64-bit versions preferably, but in your case even 32-bit would do). Again, I don't think memory is your only problem here, though. Cheers, Simon __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Q: R 2.2.1: Memory Management Issues?
On Jan 5, 2006, at 7:33 PM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > The empirically derived limit on my machine (under R 1.9.1) was > approximately 7500 data points. > I have been able to successfully run the script that uses package > MCLUST on several hundred smaller data sets. > > I even had written a work-around for the case of greater than 9600 > data points. My work-around first orders the > points by their value then takes a sample (e.g. every other point > or 1 point every n points) in order to bring the number under > 9600. No problems with the computations were observed, but you are > correct that a deconvolution on that larger dataset of 9600 takes > almost 30 minutes. However, for our purposes, we do not have many > datasets over 9600 so the time is not a major constraint. > > Unfortunately, my management does not like using a work-around and > really wants to operate on the larger data sets. > I was told to find a way to make it operate on the larger data sets > or avoid using R and find another solution. Well, sure, if your only concern is the memory then moving to unix will give you several hundred more data points you can use. I would recommend a 64-bit unix preferably, because then there is practically no software limit on the size of virtual memory. Nevertheless there is still a limit of ca. 4GB for a single vector, so that should give you around 32500 rows that mclust can handle as- is (I don't want to see the runtime, though ;)). For anything else you'll really have to think about another approach.. Cheers, Simon __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Pb with agrep()
Martin Maechler wrote: >If you specify max$all as "100%", i.e, as 0. ('< 1' !) everything works >as you expect it: > >agrep(pattern, subject, max = list(ins=0, del=0, sub= 2, all = 0.)) >## --> 2 3 4 > > OK I got it! Thanks for the explanation. Cheers, Hervé -- Hervé Pagès E-mail: [EMAIL PROTECTED] Phone: (206) 667-5791 Fax: (206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel