Re: [Rd] Pb with agrep()

2006-01-05 Thread Martin Maechler
> "Herve" == Herve Pages <[EMAIL PROTECTED]>
> on Wed, 04 Jan 2006 17:29:35 -0800 writes:

Herve> Happy new year everybody,
Herve> I'm getting the following while trying to use the agrep() function:

>> pattern <- "XXX"
>> subject <- c("oo", "oooXooo", "oooXXooo", "oooXXXooo")
>> max <- list(ins=0, del=0, sub=0) # I want exact matches only
>> agrep(pattern, subject, max=max)
Herve> [1] 4

Herve> OK

>> max$sub <- 1 # One allowed substitution
>> agrep(pattern, subject, max=max)
Herve> [1] 3 4

Herve> OK

>> max$sub <- 2 # Two allowed substitutions
>> agrep(pattern, subject, max=max)
Herve> [1] 3 4

Herve> Wrong!

No. 
You have overlooked the fact that 'max.distance = 0.1' (10%) 
*remains* the default, even when 'max.distance' is specified as
a list as in your example [from  "?agrep" ] :

>> max.distance: Maximum distance allowed for a match.  Expressed either
>>   as integer, or as a fraction of the pattern length (will be
>>   replaced by the smallest integer not less than the
>>   corresponding fraction), or a list with possible components
>> 
>>   'all': maximal (overall) distance
>> 
>>   'insertions': maximum number/fraction of insertions
>> 
>>   'deletions': maximum number/fraction of deletions
>> 
>>   'substitutions': maximum number/fraction of substitutions
>> 
>>   If 'all' is missing, it is set to 10%, the other components
>>   default to 'all'.  The component names can be abbreviated. 

If you specify max$all as "100%", i.e, as 0.  ('< 1' !)  everything works
as you expect it:

agrep(pattern, subject, max = list(ins=0, del=0, sub= 2, all = 0.))
## --> 2 3 4


Martin Maechler, ETH Zurich

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Using gcc4 visibility features

2006-01-05 Thread Prof Brian Ripley
R-devel now makes use of gcc4's visibility features: for an in-depth 
account see

http://people.redhat.com/drepper/dsohowto.pdf

(and note there are older versions of that document around).

Consider for example stats.so.  On a gcc4 Linux system this has just three
entry points

gannet% nm -g stats.so | grep " T "
2720 T R_init_stats
0004a544 T _fini
1f28 T _init

since the only entry point we need is the symbol registration.  This 
results in a smaller DSO and a faster load.  It is only worth doing for
shared objects with many entry points, but this one had 262.

It also gives another reason for the registration of symbols, as this is 
the only way I know to hide Fortran entry points (except to hide them all, 
which will hide them from .Fortran).  Until recently registration was used 
in the standard packages and a handful of others (not including any 
recommended packages).  You can copy the way it is done in package stats 
(see PKG_* in Makefile.in and init.c).

The next step will be to prune libR.so down to something close to the 
documented API (it currently has 1816 exported symbols).

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Using STL containers in R/C++

2006-01-05 Thread Dirk Eddelbuettel
Andrew Finley  umn.edu> writes:
> I am in the process of writing an R extension in c++ and am using several
> STL containers  (e.g., vector, map, multimap double>).  I make sure to clear all these containers at the end of the
> .Call.   Everything compiles and runs just fine, but I'm a bit worried
> since I haven't found any other packages that use STL.

RQuantLib, a wrapper to the QuantLib libraries, has indirect exposure to Boost.
[ QuantLib uses Boost smart pointers, and unit testing. ]  However, I have kept 
the actual R-to-QuantLib interface very 'plain C' to keep it simple and sane. 

Dominick Samperi wrote a Rcpp.{hpp,cpp} class for C++ to R interface that is 
used in RQuantLib. Dominick was musing about releasing this stand-alone to CRAN
as well, but I don't think it has happened.

> So, my question: is it OK to use the STL containers in my code?  Will the

In my book, yes.  It would be nice to have a few nice, and documented, examples.

> use of this library somehow limit portability of this package? 

I don't see why. Effectively, R goes where gcc/g++ go so if you make sure you 
stay within the bounds of g++. It will create an external dependency, as those
do confuse users from time to time (c.f. the r-help archives).

> Or cause memory management problems?

If you have a bug, yes. If you don't, you don't. The R Extensions manual has a 
few tips on the matter.

Cheers, Dirk

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Using gcc4 visibility features

2006-01-05 Thread elijah wright

> Subject: [Rd] Using gcc4 visibility features
> 
> R-devel now makes use of gcc4's visibility features: for an in-depth 
> account see
>
> http://people.redhat.com/drepper/dsohowto.pdf


does this mean that we now have a dependency on gcc4, or just that it 
"can" use the feature of gcc4?

clarification, please.

thanks,

--elijah

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Using STL containers in R/C++

2006-01-05 Thread Seth Falcon
You might want to take a look at the Bioconductor package RBGL which
provides an R interface to the BGL which is C++ STL heavy, I believe.

+ seth

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Problems with calloc function.

2006-01-05 Thread Hin-Tak Leung
Hi Marcelo,

You need to read the R extension manual more carefully...
Basically you haven't deference thed pointers. You think you
were allocating say, col=2, but instead you were allocating &col
in int's, the address of col, which is a large number since
user-land memory address starts at a large offset (0x4000? =
1/2 GB or 0x8000 = 1GB?),
so after 4 large allocations, you run out of memory.

Everything through the C interface is passed by pointer, in the
fortran convention.

BTW, you should use Rprintf() instead of printf(). Details below.

Hin-Tak Leung

Marcelo Damasceno wrote:
> Hello all and Prof. Brian Ripley ,
> 
> Sorry about my incautiousness, I use the tips, but is happen same problems.
> Below the R code.
> 
> rspcplot <- function(file1="dg_01.lab.txt"){
>   if(is.null(n))
> stop("You need first run the function cluster")
> file<-"dg_01.lab.txt"
> aux<-file1
> file1<-pmatch(file1,file)
>   if(is.na(file1)){
>   matrix=loadMatrix(file,n)
>   }
>   else{
>   matrix=loadMatrix(aux,n)
>   }
>   matrixc<-correct(matrix)
>   #merge2(matrixc)
>   nrow=nrow(matrixc)
>   ncol=ncol(matrixc)
>   ntemp=getTemp()
>   out <- .C("merge2",matrixc,nrow, ncol,ntemp,outMerge=as.integer
> (0),outHeight=as.integer(0),PACKAGE="rspc")
> ##
> Below the C code.
> ##
> void merge2(int *nmat,int nrow,int ncol, int *ntemp,int ntam, int *out, int
> *height){

Here, you should have "*ncol" instead of "ncol". (I am only correcting 
this one - you can change the others) like this:

void merge2(int *nmat,int nrow,int *ncol, int *ntemp,int ntam,  int 
*out, int *height){

> int row,col,*temp,i,j,k,n3,tam,x,aux2,n1;
> row = nrow;
> col = ncol;

You should use here:

int col = *ncol;

> 
> int *temp1,*temp2,*temp3,*temp4;
> 
> temp1 = (int*)Calloc(col,int);


inserting here:

Rprintf("I am trying to allocate col = %d\n", col);

would have told you what's wrong with your code...

> printf("OK1 \n");


> temp2 = (int*)Calloc(col,int);
> printf("OK2 \n");
> temp3 = (int *)Calloc(col,int);
> printf("OK3 \n");
> temp4 = (int *)Calloc(col,int);
> if(temp4 == NULL){
> printf("\n\n No Memory4!");
> exit(1);
> }
> printf("OK4\n");
> int *cvector;
> cvector = (int *)Calloc(col,int);
> if(cvector == NULL){
> printf("\n\n No Memory5!");
> exit(1);
> }
> printf("OK5\n");
> tam=ntam;
> ###
> Output of Work Space:
> 
> 
>>rspcplot()
> 
> Read 525 items
> Read 101 items
> OK1
> OK2
> OK3
> OK4
> Error in rspcplot() : Calloc could not allocate (145869080 of 4) memory
> 
> 
> Using the Ruspini data, the values of variables col = 2 and row = 75. I was
> thinking that the number of pointers and space of memory are too big.
> 
> 
> Thanks All !


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] random seed question

2006-01-05 Thread Tib
Greetings,

I am trying to write a C++ subroutine of my random number generator. Based
on tutorial in Writing R Extensions, one should call GetRNGstate() before
and PutRNGstate() after calling  R's random variate generation routines. Now
suppose my function would generate n(n>1) numbers by a loop, do I need to
call these two functions at each iteration? This certainly increases
computation burden (although indiscernible in my case). But if I only call
them once (outside the loop), will the quality of my random numbers be
reduced because of serial correlations in RNG algorithms? I do comparisons
between two methods, no significant difference is found. How do you guys
write RNG functions? Any other specific reasons? Thanks,

--
I am Tib, not Rob.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] random seed question

2006-01-05 Thread Liaw, Andy
Just one call to each that enclose the RNG calls will do, I believe.

Andy

From: Tib
> 
> Greetings,
> 
> I am trying to write a C++ subroutine of my random number 
> generator. Based
> on tutorial in Writing R Extensions, one should call 
> GetRNGstate() before
> and PutRNGstate() after calling  R's random variate 
> generation routines. Now
> suppose my function would generate n(n>1) numbers by a loop, 
> do I need to
> call these two functions at each iteration? This certainly increases
> computation burden (although indiscernible in my case). But 
> if I only call
> them once (outside the loop), will the quality of my random numbers be
> reduced because of serial correlations in RNG algorithms? I 
> do comparisons
> between two methods, no significant difference is found. How 
> do you guys
> write RNG functions? Any other specific reasons? Thanks,
> 
> --
> I am Tib, not Rob.
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Using gcc4 visibility features

2006-01-05 Thread Martin Maechler
> "elijah" == elijah wright <[EMAIL PROTECTED]>
> on Thu, 5 Jan 2006 09:13:15 -0600 (CST) writes:

>> Subject: [Rd] Using gcc4 visibility features
>> 
>> R-devel now makes use of gcc4's visibility features: for
>> an in-depth account see
>> 
>> http://people.redhat.com/drepper/dsohowto.pdf


elijah> does this mean that we now have a dependency on
elijah> gcc4, or just that it "can" use the feature of gcc4?

the latter (of course!)

elijah> clarification, please.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Problems with calloc function.

2006-01-05 Thread Paul Roebuck
On Tue, 3 Jan 2006, Marcelo Damasceno wrote:

> Sorry about my incautiousness, I use the tips, but is
> happen same problems.

Not really. You definitely skipped over the most important
one - don't terminate the host process.

> if(temp4 == NULL){
> printf("\n\n No Memory4!");
> exit(1);
> }

if (temp4 == NULL) {
error("memory allocation failed for 'temp4'");
/*NOTREACHED*/
}

--
SIGSIG -- signature too long (core dumped)

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Using gcc4 visibility features

2006-01-05 Thread elijah wright

>>> R-devel now makes use of gcc4's visibility features: for
>>> an in-depth account see
>
>elijah> does this mean that we now have a dependency on
>elijah> gcc4, or just that it "can" use the feature of gcc4?
>
> the latter (of course!)


that's what i expected, i was just checking :)

--elijah

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extending lattice to S4 classes

2006-01-05 Thread Deepayan Sarkar
On 10/20/05, ernesto <[EMAIL PROTECTED]> wrote:

[...]

> Hi Deepayan,
>
> I see that there are alternatives, I found one my self that works and
> it's transparent for the user.
>
> I don't want to implement solutions that force the user to use lattice
> methods differently from your implementation.
>
> The cleanest solution as you say is to add a data argument to xyplot but
> I can't do it so I would not propose it.

FYI, I have added the 'data' argument to high level generics in
lattice_0.13-1, available for r-devel (of course this causes the
current version of FLCore to fail).

Deepayan
--
http://www.stat.wisc.edu/~deepayan/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] random seed question

2006-01-05 Thread Duncan Murdoch
On 1/5/2006 11:44 AM, Tib wrote:
> Greetings,
> 
> I am trying to write a C++ subroutine of my random number generator. Based
> on tutorial in Writing R Extensions, one should call GetRNGstate() before
> and PutRNGstate() after calling  R's random variate generation routines. Now
> suppose my function would generate n(n>1) numbers by a loop, do I need to
> call these two functions at each iteration? This certainly increases
> computation burden (although indiscernible in my case). But if I only call
> them once (outside the loop), will the quality of my random numbers be
> reduced because of serial correlations in RNG algorithms? I do comparisons
> between two methods, no significant difference is found. How do you guys
> write RNG functions? Any other specific reasons? Thanks,

Call them once, outside the loop.

What they do is move the RNG state between the R workspace and a place 
where the C functions can use it.  Wrapping each call in the loop in 
those calls will just generate a lot of unnecessary moves.  Not doing 
the calls, or not pairing them up properly, is the dangerous thing that 
would lead to damage to the RNG algorithms, because the state would not 
be updated properly.

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Q: R 2.2.1: Memory Management Issues?

2006-01-05 Thread Karen.Green
Dear Developers:

I have a question about memory management in R 2.2.1 and am wondering if you 
would be kind enough to help me understand what is going on.
(It has been a few years since I have done software development on Windows, so 
I apologize in advance if these are easy questions.)

-
MY SYSTEM
-

I am currently using R (version 2.2.1) on a PC running Windows 2000 (Intel 
Pentium M) that has 785,328 KB (a little over 766 MB) of physical RAM.
The R executable resides on the C drive, which is of NTFS format, says it has 
15.08 GB free space and has recently been defragmented.

The report of that defragmented drive gives:

Volume (C:):
Volume size =   35,083 MB
Cluster size=   512 bytes
Used space  =   19,642 MB
Free space  =   15,440 MB
Percent free space  =   44 %

Volume fragmentation
Total fragmentation =   1 %
File fragmentation  =   2 %
Free space fragmentation=   0 %

File fragmentation
Total files =   121,661
Average file size   =   193 KB
Total fragmented files  =   64
Total excess fragments  =   146
Average fragments per file  =   1.00

Pagefile fragmentation
Pagefile size   =   768 MB
Total fragments =   1

Directory fragmentation
Total directories   =   7,479
Fragmented directories  =   2
Excess directory fragments  =   3

Master File Table (MFT) fragmentation
Total MFT size  =   126 MB
MFT record count=   129,640
Percent MFT in use  =   99 %
Total MFT fragments =   4
--
PROBLEM
-

I am trying to run a R script which makes use of the MCLUST package.
The script can successfully read in the approximately 17000 data points ok, but 
then throws an error:

Error:  cannot allocate vector of size 1115070Kb
In addition:  Warning messages:
1:  Reached total allocation of # Mb:  see help(memory.size)
2:  Reached total allocation of # Mb:  see help(memory.size)
Execution halted

after attempting line:
summary(EMclust(y),y)
which is computationally intensive (performs a "deconvolution" of the data into 
a series of Gaussian peaks)

and where # is either 766Mb or 2048Mb (depending on the max memory size I set).

The call I make is to Rterm.exe (to try to avoid Windows overhead):
"C:\Program Files\R\R-2.2.1\bin\Rterm.exe" --no-save --no-restore --vanilla 
--silent --max-mem-size=766M < 
"C:\Program Files\R\R-2.2.1\dTest.R"

(I have also tried it with 2048M but with same lack of success.)


QUESTIONS  


(1) I had initially thought that Windows 2000 should be able to allocate up to 
about 2 GB memory.  So, why is there a problem to allocate a little over 1GB on 
a defragmented disk with over 15 GB free?  (Is this a pagefile size issue?)

(2) Do you think the origin of the problem is 
(a) the R environment, or 
(b) the function in the MCLUST package using an in-memory instead of an 
on-disk approach?

(3)
(a) If the problem originates in the R environment, would switching to the 
Linux version of R solve the problem? 
(b) If the problem originates in the function in the MCLUST package, whom 
do I need to contact to get more information about re-writing the source code 
to handle large datasets?


Information I have located on overcoming Windows2000 memory allocation limits 
[http://www.rsinc.com/services/techtip.asp?ttid=3346; 
http://www.petri.co.il/pagefile_optimization.htm] does not seem to help me 
understand this any better.

I had initially upgraded to R version 2.2.1 because I had read 
[https://svn.r-project.org/R/trunk/src/gnuwin32/CHANGES/]:

R 2.2.1
===
Using the latest binutils allows us to distribute RGui.exe and Rterm.exe
as large-address-aware (see the rw-FAQ Q2.9).

The maximum C stack size for RGui.exe and Rterm.exe has been increased
to 10Mb (from 2Mb); this is comparable with the default on Linux systems
and may allow some larger programs to run without crashes.  ... 

and also from the Windows FAQ 
[http://cran.r-project.org/bin/windows/base/rw-FAQ.html#There-seems-to-be-a-limit-on-the-memory-it-uses_0021]:

2.9 There seems to be a limit on the memory it uses!
Indeed there is. It is set by the command-line flag --max-mem-size (see How do 
I install R for Windows?) and defaults to the smaller of the amount of physical 
RAM in th

Re: [Rd] Using STL containers in R/C++

2006-01-05 Thread Dominick Samperi
Dirk Eddelbuettel wrote:
> Dominick Samperi wrote a Rcpp.{hpp,cpp} class for C++ to R interface that is 
> used in RQuantLib. Dominick was musing about releasing this stand-alone to 
> CRAN
> as well, but I don't think it has happened.
>   
It just happened. I uploaded Rcpp to CRAN today. The package contains a 
PDF file
Rcpp.pdf that describes the package and the class library.

Dominick

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Q: R 2.2.1: Memory Management Issues?

2006-01-05 Thread Simon Urbanek
Karen,

On Jan 5, 2006, at 5:18 PM, <[EMAIL PROTECTED]>  
<[EMAIL PROTECTED]> wrote:

> I am trying to run a R script which makes use of the MCLUST package.
> The script can successfully read in the approximately 17000 data  
> points ok, but then throws an error:
> 
> Error:  cannot allocate vector of size 1115070Kb

This is 1.1GB of RAM to allocate alone for one vector(!). As you  
stated yourself the total upper limit is 2GB, so you cannot even fit  
two of those in memory anyway - not much you can do with it even if  
it is allocated.

> summary(EMclust(y),y)

I suspect that memory is your least problem. Did you even try to run  
EMclust on a small subsample? I suspect that if you did, you would  
figure out that what you are trying to do is not likely to terminate  
within days...

> (1) I had initially thought that Windows 2000 should be able to  
> allocate up to about 2 GB memory.  So, why is there a problem to  
> allocate a little over 1GB on a defragmented disk with over 15 GB  
> free?  (Is this a pagefile size issue?)

Because that is not the only 1GB vector that is allocated. Your "15GB/ 
defragmented" are irrelevant - if at all, look how much virtual  
memory is set up in you system's preferences.

> (2) Do you think the origin of the problem is
> (a) the R environment, or
> (b) the function in the MCLUST package using an in-memory  
> instead of an on-disk approach?

Well, a toy example of 17000x2 needs 2.3GB and it's unlikely to  
terminate anytime soon, so I'd rather call it shooting with the wrong  
gun. Maybe you should consider different approach to your problem -  
possibly ask at the BioConductor list, because people there have more  
experience with large data and this is not really a technical  
question about R, but rather how to apply statistical methods.

> (3)
> (a) If the problem originates in the R environment, would  
> switching to the Linux version of R solve the problem?

Any reasonable unix will do - technically (64-bit versions  
preferably, but in your case even 32-bit would do). Again, I don't  
think memory is your only problem here, though.

Cheers,
Simon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Q: R 2.2.1: Memory Management Issues?

2006-01-05 Thread Karen.Green
Dear Simon,

Thank you for taking time to address my questions.

>> summary(EMclust(y),y)
>
>I suspect that memory is your least problem. Did you even try to run  
>EMclust on a small subsample? I suspect that if you did, you would  
>figure out that what you are trying to do is not likely to terminate  
>within days...

The empirically derived limit on my machine (under R 1.9.1) was approximately 
7500 data points.
I have been able to successfully run the script that uses package MCLUST on 
several hundred smaller data sets.

I even had written a work-around for the case of greater than 9600 data points. 
 My work-around first orders the
points by their value then takes a sample (e.g. every other point or 1 point 
every n points) in order to bring the number under 9600.  No problems with the 
computations were observed, but you are correct that a deconvolution on that 
larger dataset of 9600 takes almost 30 minutes.  However, for our purposes, we 
do not have many datasets over 9600 so the time is not a major constraint.

Unfortunately, my management does not like using a work-around and really wants 
to operate on the larger data sets.
I was told to find a way to make it operate on the larger data sets or avoid 
using R and find another solution.  

>From previous programming projects in a different scientific field long ago, I 
>recall making a trade-off of using temp files instead of holding data in 
>memory in order to make working with larger data sets possible.  I am 
>wondering if something like that would be possible for this situation, but I 
>don't have enough knowledge at this moment to make this decision.

Karen
---
Karen M. Green, Ph.D.
[EMAIL PROTECTED]
Research Investigator
Drug Design Group
Sanofi Aventis Pharmaceuticals
Tucson, AZ  85737

-Original Message-
From: Simon Urbanek [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 05, 2006 5:13 PM
To: Green, Karen M. PH/US
Cc: R-devel@stat.math.ethz.ch
Subject: Re: [Rd] Q: R 2.2.1: Memory Management Issues?
Importance: High


Karen,

On Jan 5, 2006, at 5:18 PM, <[EMAIL PROTECTED]>  
<[EMAIL PROTECTED]> wrote:

> I am trying to run a R script which makes use of the MCLUST package.
> The script can successfully read in the approximately 17000 data  
> points ok, but then throws an error:
> 
> Error:  cannot allocate vector of size 1115070Kb

This is 1.1GB of RAM to allocate alone for one vector(!). As you  
stated yourself the total upper limit is 2GB, so you cannot even fit  
two of those in memory anyway - not much you can do with it even if  
it is allocated.

> summary(EMclust(y),y)

I suspect that memory is your least problem. Did you even try to run  
EMclust on a small subsample? I suspect that if you did, you would  
figure out that what you are trying to do is not likely to terminate  
within days...

> (1) I had initially thought that Windows 2000 should be able to  
> allocate up to about 2 GB memory.  So, why is there a problem to  
> allocate a little over 1GB on a defragmented disk with over 15 GB  
> free?  (Is this a pagefile size issue?)

Because that is not the only 1GB vector that is allocated. Your "15GB/ 
defragmented" are irrelevant - if at all, look how much virtual  
memory is set up in you system's preferences.

> (2) Do you think the origin of the problem is
> (a) the R environment, or
> (b) the function in the MCLUST package using an in-memory  
> instead of an on-disk approach?

Well, a toy example of 17000x2 needs 2.3GB and it's unlikely to  
terminate anytime soon, so I'd rather call it shooting with the wrong  
gun. Maybe you should consider different approach to your problem -  
possibly ask at the BioConductor list, because people there have more  
experience with large data and this is not really a technical  
question about R, but rather how to apply statistical methods.

> (3)
> (a) If the problem originates in the R environment, would  
> switching to the Linux version of R solve the problem?

Any reasonable unix will do - technically (64-bit versions  
preferably, but in your case even 32-bit would do). Again, I don't  
think memory is your only problem here, though.

Cheers,
Simon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Q: R 2.2.1: Memory Management Issues?

2006-01-05 Thread Karen.Green
Dear Simon,

Thank you for taking time to address my questions.

>> summary(EMclust(y),y)
>
>I suspect that memory is your least problem. Did you even try to run  
>EMclust on a small subsample? I suspect that if you did, you would  
>figure out that what you are trying to do is not likely to terminate  
>within days...

The empirically derived limit on my machine (under R 1.9.1) was approximately 
7500 data points.
I have been able to successfully run the script that uses package MCLUST on 
several hundred smaller data sets.

I even had written a work-around for the case of greater than 9600 data points 
(the limit when using R 2.2.1).  My work-around first orders the points by 
their value then takes a sample (e.g. every other point or 1 point every n 
points) in order to bring the number under 9600.  No problems with the 
computations were observed, but you are correct that a deconvolution on that 
larger dataset of 9600 takes almost 30 minutes.  However, for our purposes, we 
do not have many datasets over 9600 so the time is not a major constraint.

Unfortunately, my management does not like using a work-around and really wants 
to operate on the larger data sets.
I was told to find a way to make it operate on the larger data sets or avoid 
using R and find another solution.  

>From previous programming projects in a different scientific field long ago, I 
>recall making a trade-off of using temp files instead of holding data in 
>memory in order to make working with larger data sets possible.  I am 
>wondering if something like that would be possible for this situation, but I 
>don't have enough knowledge at this moment to make this decision.

Karen
---
Karen M. Green, Ph.D.
[EMAIL PROTECTED]
Research Investigator
Drug Design Group
Sanofi Aventis Pharmaceuticals
Tucson, AZ  85737

-Original Message-
From: Simon Urbanek [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 05, 2006 5:13 PM
To: Green, Karen M. PH/US
Cc: R-devel@stat.math.ethz.ch
Subject: Re: [Rd] Q: R 2.2.1: Memory Management Issues?
Importance: High


Karen,

On Jan 5, 2006, at 5:18 PM, <[EMAIL PROTECTED]>  
<[EMAIL PROTECTED]> wrote:

> I am trying to run a R script which makes use of the MCLUST package.
> The script can successfully read in the approximately 17000 data  
> points ok, but then throws an error:
> 
> Error:  cannot allocate vector of size 1115070Kb

This is 1.1GB of RAM to allocate alone for one vector(!). As you  
stated yourself the total upper limit is 2GB, so you cannot even fit  
two of those in memory anyway - not much you can do with it even if  
it is allocated.

> summary(EMclust(y),y)

I suspect that memory is your least problem. Did you even try to run  
EMclust on a small subsample? I suspect that if you did, you would  
figure out that what you are trying to do is not likely to terminate  
within days...

> (1) I had initially thought that Windows 2000 should be able to  
> allocate up to about 2 GB memory.  So, why is there a problem to  
> allocate a little over 1GB on a defragmented disk with over 15 GB  
> free?  (Is this a pagefile size issue?)

Because that is not the only 1GB vector that is allocated. Your "15GB/ 
defragmented" are irrelevant - if at all, look how much virtual  
memory is set up in you system's preferences.

> (2) Do you think the origin of the problem is
> (a) the R environment, or
> (b) the function in the MCLUST package using an in-memory  
> instead of an on-disk approach?

Well, a toy example of 17000x2 needs 2.3GB and it's unlikely to  
terminate anytime soon, so I'd rather call it shooting with the wrong  
gun. Maybe you should consider different approach to your problem -  
possibly ask at the BioConductor list, because people there have more  
experience with large data and this is not really a technical  
question about R, but rather how to apply statistical methods.

> (3)
> (a) If the problem originates in the R environment, would  
> switching to the Linux version of R solve the problem?

Any reasonable unix will do - technically (64-bit versions  
preferably, but in your case even 32-bit would do). Again, I don't  
think memory is your only problem here, though.

Cheers,
Simon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Q: R 2.2.1: Memory Management Issues?

2006-01-05 Thread Simon Urbanek
On Jan 5, 2006, at 7:33 PM, <[EMAIL PROTECTED]>  
<[EMAIL PROTECTED]> wrote:

> The empirically derived limit on my machine (under R 1.9.1) was  
> approximately 7500 data points.
> I have been able to successfully run the script that uses package  
> MCLUST on several hundred smaller data sets.
>
> I even had written a work-around for the case of greater than 9600  
> data points.  My work-around first orders the
> points by their value then takes a sample (e.g. every other point  
> or 1 point every n points) in order to bring the number under  
> 9600.  No problems with the computations were observed, but you are  
> correct that a deconvolution on that larger dataset of 9600 takes  
> almost 30 minutes.  However, for our purposes, we do not have many  
> datasets over 9600 so the time is not a major constraint.
>
> Unfortunately, my management does not like using a work-around and  
> really wants to operate on the larger data sets.
> I was told to find a way to make it operate on the larger data sets  
> or avoid using R and find another solution.

Well, sure, if your only concern is the memory then moving to unix  
will give you several hundred more data points you can use. I would  
recommend a  64-bit unix preferably, because then there is  
practically no software limit on the size of virtual memory.  
Nevertheless there is still a limit of ca. 4GB for a single vector,  
so that should give you around 32500 rows that mclust can handle as- 
is (I don't want to see the runtime, though ;)). For anything else  
you'll really have to think about another approach..

Cheers,
Simon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Pb with agrep()

2006-01-05 Thread Herve Pages
Martin Maechler wrote:

>If you specify max$all as "100%", i.e, as 0.  ('< 1' !)  everything works
>as you expect it:
>
>agrep(pattern, subject, max = list(ins=0, del=0, sub= 2, all = 0.))
>## --> 2 3 4
>  
>
OK I got it! Thanks for the explanation.
Cheers,

Hervé

-- 

Hervé Pagès
E-mail: [EMAIL PROTECTED]
 Phone: (206) 667-5791
   Fax: (206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel