from:"Vladimir Dergachev"

Re: [Rd] SEXP i/o, .Call(), and garbage collection.

2007-02-01 Thread Vladimir Dergachev

On Thursday 01 February 2007 2:01 pm, Hin-Tak Leung wrote:
> One possible reason for such problems is if you copy the pointers
> for say, attributes, classes, names, rather than duplicating them.
> With very few exceptions, mostly in classes, no two R objects of
> the sort you normally encounter/create/play-with should share *any*
> part of their data-structure. e.g. such problem can result if you
> assign the row names of the input to the output (even if both have
> the same row names).
>

Hmm.. I thought that using setAttrib() would automatically increase the 
reference count, right ?

In particular, I quite often use "pseudo-factor" string vectors - where the 
string objects are passed through cache and reused when forming a string 
vector. The result is true character() type but with considerable memory 
savings. The downside is that R reference count field is usually saturated.

best

    Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] xlsReadWrite Pro and embedding objects and files in Excel worksheets

2007-02-08 Thread Vladimir Dergachev

On Thursday 08 February 2007 2:09 pm, tshort wrote:

> I don't know of an R package that has a function to encode files as a
> multipart mime, but the link above is a good start.

Tclib has mime encoding module one could use it from within R with 
.Tcl("package require tcllib")

best

        Vladimir Dergachev

>
> - Tom

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] RODBC problems with unixodbc

2007-02-20 Thread Vladimir Dergachev

On Tuesday 20 February 2007 1:51 pm, Sebastian P. Luque wrote:
> Hi,
>
> I noticed that if a column is named "end" in a data frame (table.df
> below), it leads to errors when trying to sqlSave()'it to a postgresql
> connection:
>
>
> ---<---cut here---start-->---
> con <- odbcConnect("PostgreSQL-DB", uid="user", pwd="password",
>case="postgresql")
> R> sqlSave(con, table.df)
> Error in sqlSave(con, table.df) :
>   [RODBC] ERROR: Could not SQLExecDirect
> 42601 7 [unixODBC]Error while executing the query;
> ERROR:  syntax error at or near "end" at character 140
> ---<---cut here---end>---
>
>
> If I rename the column to something else (e.g. "ending"), this proceeds
> without problems.  What could the problem be here?  Thanks.

It is likely "end" is a reserved word

 best

 Vladimir Dergachev

>
>
> Cheers,

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] JIT compiler library

2007-02-21 Thread Vladimir Dergachev


Since this escaped my notice before, I thought it useful to post a link here - 
in case you have not seen it either:

http://www.gnu.org/software/lightning/lightning.html

This a portable JIT compiler library with fairly easy syntax (one syntax - 
many cpus).

  best

  Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] as.Date nuance

2007-03-23 Thread Vladimir Dergachev


Hi, 

  I have encountered a nuance in as.Date() behaviour that is not altogether 
obvious - not sure whether this is intended or not:

> as.Date("2001-01-01error")
[1] "2001-01-01"

I.e. it ignores the rest of the characters. This happens both in 2.3.1 and 
2.4.1 versions. 

This also happens with explicit format specification:
> as.Date("2006-01-01error", format="%Y-%m-%d")
[1] "2006-01-01"

        thank you

Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] as.Date nuance

2007-03-24 Thread Vladimir Dergachev

On Saturday 24 March 2007 6:21 am, Prof Brian Ripley wrote:
> This is how strptime() works: it processes the input to match the format.

Except that the format does not match the string - there are leftover 
characters. Even by R's own definition:

> match("a", "ab")
[1] NA

as, of course, is reasonable.

Is there some way to make sure there is an exact match ?

thank you !

   Vladimir Dergachev

>
> On Fri, 23 Mar 2007, Vladimir Dergachev wrote:
> >  I have encountered a nuance in as.Date() behaviour that is not
> > altogether
> >
> > obvious - not sure whether this is intended or not:
> >> as.Date("2001-01-01error")
> >
> > [1] "2001-01-01"
> >
> > I.e. it ignores the rest of the characters. This happens both in 2.3.1
> > and 2.4.1 versions.
>
> It has always occurred.
>
> > This also happens with explicit format specification:
> >> as.Date("2006-01-01error", format="%Y-%m-%d")
> >
> > [1] "2006-01-01"
> >
> >thank you
> >
> >Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] as.Date nuance

2007-03-26 Thread Vladimir Dergachev

On Saturday 24 March 2007 12:12 pm, Gabor Grothendieck wrote:
> It matches in the sense of grep or regexpr
>
> grep("a", "ab") > 0
> regexpr("a", "ab") > 0
>
> Try this:
>
> x <- c("2006-01-01error", "2006-01-01")
> as.Date(x, "%Y-%m-%d") + ifelse(regexpr("^-..-..$", x) > 0, 0, NA)
>

Well, still I would have expected as.Date() to do the same thing as.integer() 
or as.numeric() do - return NA and produce a warning.

After poking in the code I also noticed that the format guess is done using 
the first element only:

> as.Date(c("2006", "2006-01-01"))
Error in fromchar(x) : character string is not in a standard unambiguous 
format

> as.Date(c("2006-01-01", "2006"))
[1] "2006-01-01" NA

I attached a patch that changes do_strptime to behave like coerceToInteger, 
please let me know if it is reasonable - I'll then see about getting 
as.Date() to work correctly..

thank you

   Vladimir Dergachev

Index: src/main/datetime.c
===
--- src/main/datetime.c (revision 40895)
+++ src/main/datetime.c (working copy)
@@ -818,9 +818,9 @@
 SEXP attribute_hidden do_strptime(SEXP call, SEXP op, SEXP args, SEXP env)
 {
 SEXP x, sformat, ans, ansnames, klass, stz, tzone;
-int i, n, m, N, invalid, isgmt = 0, settz = 0;
+int i, n, m, N, invalid, isgmt = 0, settz = 0, warn = 0;
 struct tm tm, tm2;
-char *tz = NULL, oldtz[20] = "";
+char *tz = NULL, oldtz[20] = "", *p;
 double psecs = 0.0;
 
 checkArity(op, args);
@@ -859,10 +859,15 @@
tm.tm_year = tm.tm_mon = tm.tm_mday = tm.tm_yday = 
tm.tm_wday = NA_INTEGER;
tm.tm_isdst = -1;
-   invalid = STRING_ELT(x, i%n) == NA_STRING ||
-   !R_strptime(CHAR(STRING_ELT(x, i%n)),
-   CHAR(STRING_ELT(sformat, i%m)), &tm, &psecs);
+   invalid = STRING_ELT(x, i%n) == NA_STRING;
if(!invalid) {
+   invalid = !(p=R_strptime(CHAR(STRING_ELT(x, i%n)),
+   CHAR(STRING_ELT(sformat, i%m)), &tm, &psecs)) ||
+   (*p);
+   warn |= invalid;
+   }
+
+   if(!invalid) {
/* Solaris sets missing fields to 0 */
if(tm.tm_mday == 0) tm.tm_mday = NA_INTEGER;
if(tm.tm_mon == NA_INTEGER || tm.tm_mday == NA_INTEGER
@@ -901,6 +906,8 @@
 }
 if(settz) reset_tz(oldtz);
 
+if(warn) warning(_("NAs introduced by coercion"));
+
 UNPROTECT(3);
 return ans;
 }

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] inline C/C++ in R: question and suggestion

2007-05-22 Thread Vladimir Dergachev

On Tuesday 22 May 2007 3:52 pm, Duncan Murdoch wrote:
> On 5/22/2007 1:59 PM, Oleg Sklyar wrote:
>
> One suggestion that probably doesn't affect your package:  It would be
> even nicer if R incorporated something that Duncan Temple Lang suggested
> last year, namely a new kind of quoting that didn't need escapes in the
> string.  He suggested borrowing triple quotes from Python; I suggested
> something more like heredocs as in shells or Perl, or like \verb in TeX,
> in case you wanted triple quotes in your C function.  It would be nice
> to settle on something, so that instead of
>

I second that. My favorite implementation of this is in Tcl, where curly 
braces {} mean that the text they enclose is unmodified. Since language 
constructs using them are normally balanced this is not an impediment.

One extremely useful application of this (aside from long strings) is 
specifying inline data frames - I don't know how to do this otherwise.

I.e. something like:

A<- scan.string({#
Id Value Mark
1   a   3
2   b   4
#   })

      best

  Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Quoting (was: inline C/C++ in R: question and suggestion

2007-05-22 Thread Vladimir Dergachev

On Tuesday 22 May 2007 4:58 pm, Duncan Murdoch wrote:
> On 22/05/2007 4:01 PM, Vladimir Dergachev wrote:
> > On Tuesday 22 May 2007 3:52 pm, Duncan Murdoch wrote:
> >> On 5/22/2007 1:59 PM, Oleg Sklyar wrote:
> >
> > I second that. My favorite implementation of this is in Tcl, where curly
> > braces {} mean that the text they enclose is unmodified. Since language
> > constructs using them are normally balanced this is not an impediment.
>
> That wouldn't work in R, because the parser couldn't tell whether
>
> { a }

One easy workaround is to have string{ ... } construct - it should be very 
easy to parse string{ differently from { alone.

>
> was a block of code or a quoted string.
>
> > One extremely useful application of this (aside from long strings) is
> > specifying inline data frames - I don't know how to do this otherwise.
> >
> > I.e. something like:
> >
> > A<- scan.string({#
> > Id Value Mark
> > 1   a   3
> > 2   b   4
> > #   })
>
> When your data doesn't contain quote marks, you can just use regular
> quotes to do that.  I don't know of a scan.string function, but this works:
>
> A <- read.table(textConnection("#
> Id Value Mark
> 1 a 3
> 2 b 4
> #"), head = TRUE)

Cool, thank you !

>
> I think DTL's suggestion would be most useful when putting a lot of code
> in a string, where the escapes make the code harder to read.  For
> example, just about any function using a complicated regular expression.

Also anything using .Tcl(). Quotes in data frame definition are useful because 
they could be employed to delimit text fields with spaces in them.

 best

Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Quoting

2007-05-22 Thread Vladimir Dergachev

On Tuesday 22 May 2007 7:05 pm, Peter Dalgaard wrote:
> Vladimir Dergachev wrote:
> >> I think DTL's suggestion would be most useful when putting a lot of code
> >> in a string, where the escapes make the code harder to read.  For
> >> example, just about any function using a complicated regular expression.
> >
> > Also anything using .Tcl(). Quotes in data frame definition are useful
> > because they could be employed to delimit text fields with spaces in
> > them.
>
> .Tcl() is usually the wrong solution anyway, you really should use tcl()
> unless absolutely necessary.
>

Actually I could not figure out how to use tcl() - it seems to work only to 
call a single Tcl/Tk command.

I mostly use .Tcl() to create guis along the lines of

foreach {control  desc var value} {
label "Just a description" title1 0
entry "Edit some text" text_var {Hello there}
} {
switch -exact -- $control {
label { 
label .l$var -text $desc
grid .l$var - -sticky news
}
entry {
label .l$var -text $desc
entry .e$var -variable $var
grid .l$var .e$var -sticky news
global $var
set $var $value
}
# other control types follow
}
}

this can get pretty versatile and works for plots and other things..

  best

  Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] is.finite confusion

2007-05-23 Thread Vladimir Dergachev


I have recently made a silly screwup by applying is.finite() to a 
character vector:

>  is.finite(c("a", "b"))
[1] FALSE FALSE

This does work with factors of course (as they are integer underneath)

I wonder if a fix could be put in so that it either reports an error when 
applied to a character vector - or, perhaps better, act as is.na()

  thank you

   Vladimir Dergachev
PS test on R 2.5.0, 2.3.1

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] is.finite confusion

2007-05-23 Thread Vladimir Dergachev

On Wednesday 23 May 2007 1:29 pm, Prof Brian Ripley wrote:
> No, because it is carefully documented to do this, and people rely on it
> working as documented.  (Did you do the homework the posting guide asked
> for?)  What harm came out of learning that the values were not finite?

I read the manpage if that is what you are talking about. 

The particular thing I was attempting to do is to convert all entries that are 
not values to NULL before storing the result in the database. From my point 
of view string value was perfectly finite and my code worked with a 
data.frame I had because it happened to have factors in it.

Yes, I easily concede that since I know about it now I am not likely to make 
the same mistake again. Just was trying (politely) to be of help to other 
users.

 best

 Vladimir Dergachev

>
> On Wed, 23 May 2007, Vladimir Dergachev wrote:
> >I have recently made a silly screwup by applying is.finite() to a
> >
> > character vector:
> >>  is.finite(c("a", "b"))
> >
> > [1] FALSE FALSE
> >
> > This does work with factors of course (as they are integer underneath)
> >
> > I wonder if a fix could be put in so that it either reports an error when
> > applied to a character vector - or, perhaps better, act as is.na()
>
> What way is that?  It acts in the same way, as I understand the help
> pages.
>
> >  thank you
> >
> >   Vladimir Dergachev
> > PS test on R 2.5.0, 2.3.1

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] R scripts slowing down after repeated called to compiled code

2007-05-25 Thread Vladimir Dergachev

On Friday 25 May 2007 7:12 pm, Michael Braun wrote:
> Thanks in advance to anyone that might be able to help me with this
>
> Also, it is not just the compiled call that slows down.  EVERYTHING
> slows down, even those that consist only of standard R functions.  The
> time for each of these function calls is roughly proportional to the
> time of the .Call to the C function.
>
> Another observation is that when I terminate the algorithm, do a rm
> (list=ls()), and then a gc(), not all of the memory is returned to the
> OS.  It is not until I terminate the R session that I get all of the
> memory back.  In my C code, I am not doing anything to de-allocate the
> SEXP's I create, relying on the PROTECT/UNPROTECT mechanism instead (is
> this right?).
>
> I spent most of the day thinking I have a memory leak, but that no
> longer appears to be the case.  I tried using Rprof(), but that only
> gives me the aggregated relative time spent in each function (more than
> 80% of the time, it's in the .Call).

One possibility is that you are somehow creating a lot of R objects (say by 
calling assign() or missing UNPROTECT()) and this slows garbage collector 
down. The garbage collector running time will grow with the number of objects 
you have - their total size does not have to be large.

Could you try printing numbers from gc() call and checking whether the numbers 
of allocated objects grow a lot ?

  best

Vladimir Dergachev

>
> So I'm stuck.  Can anyone help?
>
> Thanks,
>
> Michael

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] data messed up by read.table ? (PR#9779)

2007-07-05 Thread Vladimir Dergachev

On Thursday 05 July 2007 7:00:46 am [EMAIL PROTECTED] wrote:
> Full_Name: Joerg Rauh
> Version: 2.5.0
> OS: Windows 2000
> Submission from: (NULL) (84.168.226.163)
>
>
> Following Michael J. Crawley "Statistical Computing" on page 9 the
> worms.txt is required. After downloading it from the book's supporting
> website, which is http://www.bio.ic.ac.uk/research/mjcraw/statcomp/data/ I
> visually check the data against the book and they look identical. Then I do
> a read.table as suggested:
> worms<-read.table("C:/Programme/R/R-2.5.0/Data/Worms.txt", header = T).
>

I see the same effect on 2.5.0 and 2.5.1 running on Linux.

However, the following line reads the data correctly:

read.table('worms.txt', header=TRUE, quote="\"")

Thus the problem is likely because of single quotes in the Field.Name column, 
perhaps a single quote character was added to the list of defaults since the 
book was released.

 best

 Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] S4 slot with NA default

2008-03-26 Thread Vladimir Dergachev

On Wednesday 26 March 2008 12:04:11 pm Robin Hankin wrote:
> Hi
>
> How do I specify an S4 class with a slot that is potentially numeric,
> but NA
> by default?  I   want the slot to be NA until I calculate its value
>   (an expensive operation, not needed for all applications).   When
> its value is
> known, I  will create a new object with the correct value inserted in
> the slot.
>
> I want "NA" to signify "not known".
>
> My attempt fails because NA is not numeric:
>

Try as.numeric(NA) - by default, plain NA is of type "logical"

  best

Vladimir Dergachev

>
>
>
> --
> Robin Hankin
> Uncertainty Analyst and Neutral Theorist,
> National Oceanography Centre, Southampton
> European Way, Southampton SO14 3ZH, UK
>   tel  023-8059-7743
>
> ______
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



-- 
Vladimir DergachevRCG Ardis Capital LLC

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Some R questions

2006-10-31 Thread Vladimir Dergachev

Hi all, 

   I am working with some large data sets (1-4 GB) and have some questions 
that I hope someone can help me with:

   1.  Is there a way to turn off garbage collector from within C interface ?
what I am trying to do is suck data from mysql (using my own C
functions) and I see that allocating each column (with about 1-4 million
items) takes between 0.5 and 1 seconds. My first thought was that it
would be nice to turn off garbage collector, allocate all the data, 
copy values and then turn the garbage collector back on.

   2.  For creating STRSXP should I be using mkChar() or mkString() to create
element values ? Is there a way to do it without allocating a cons cell 
?
(otherwise a single STRSXP with 1e6 length slows down garbage collector)

   3.   Is "row.names" attribute required for data frames and, if so, can I
use some other type besides STRSXP ?

   4.   While poking around to find out why some of my code is excessively slow
I have come upon definition of `[.data.frame` - subscription operator
for data frames, which appears to be written in R. I am wondering 
whether
I am looking at the right place and whether anyone would be interested 
in
a piece of C code optimizing it - in particular extraction of single 
element
is quite slow (i.e. calls like T[i, j]).

   thank you very much !

     Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Fwd: Re: Some R questions

2006-11-01 Thread Vladimir Dergachev


A correction to my previous post:

after running the example A[,1] and A[[1]] the running time decrease so that 
eventually A[[1]] takes 0.025 seconds (according to system.time()) and A[,1] 
takes 1.8 seconds. 

The ratio of time still 2-digit, but, apparently, the garbage collector is a 
good deal faster when memory is already available.

best

Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Some R questions

2006-11-01 Thread Vladimir Dergachev

On Tuesday 31 October 2006 9:30 pm, miguel manese wrote:
> Hi,
>
> Had experience with this on doing SQLiteDF...
>
> On 11/1/06, Vladimir Dergachev <[EMAIL PROTECTED]> wrote:
> > Hi all,
> >
> >I am working with some large data sets (1-4 GB) and have some
> > questions that I hope someone can help me with:
> >
> >1.  Is there a way to turn off garbage collector from within C
> > interface ? what I am trying to do is suck data from mysql (using my own
> > C functions) and I see that allocating each column (with about 1-4
> > million items) takes between 0.5 and 1 seconds. My first thought was that
> > it would be nice to turn off garbage collector, allocate all the data,
> > copy values and then turn the garbage collector back on.
>
> I believe not. FWIW a numeric() vector is a chunk of memory with a
> VECTOR_SEXP header and then your data contiguously allocated. If you
> are desparate enough and assuming the garbage collector is indeed the
> culprit, you may want to implement your own  lightweight allocVector
> (the function expanded to by NEW_NUMERIC(), etc.)

Thank you very much for the suggestion ! After looking around in the code 
I realized that what I really wanted was R_gc_internal() - as then I can tell
the garbage collector in advance that I will require that much heap and that 
it does not need to go and allocate it each time I asked (btw I would have 
expected it to double the heap each time it runs out of it, but this is not 
what goes on, at least in R 2.3.1).

After some mucking around here is a poor mans substitute which might be 
useful:

void fault_mem_region(long size)
{
long chunk;
int max=(1<<30) / sizeof(int);
int block_count=0;
SEXP block;
while(size>0) {
chunk=size;
if(chunk > max)
chunk=max;
PROTECT(block=allocVector(INTSXP, chunk));
block_count++;
size-=chunk;
}
UNPROTECT(block_count);
}

On a 48 column data frame (with 1.2e6 rows)  the call 
fault_mem_region(ncol+nrow*11+ncol*nrow)  shaved off 5 seconds from 33 second 
running time (which includes running mysql query).

It is not perfect however as I could see the last columns allocating slower 
than initial ones. 

Also, while looking around in allocVector I saw that after running garbage 
collector it simply calls malloc and if malloc fails it calls garbage 
collector again.

What would be nice is the ability to bypass that first garbage collector call 
when allocating large nodes.

>
> >2.  For creating STRSXP should I be using mkChar() or mkString() to
> > create element values ? Is there a way to do it without allocating a cons
> > cell ? (otherwise a single STRSXP with 1e6 length slows down garbage
> > collector)
>
> A string vector (STRSXP) is composed of CHARSXP's. mkChar makes ar
> CHARSXP, and mkString makes a STRSXP with 1 CHARSXP, more like a
> shorthand for
>
> SEXP str = NEW_CHARACTER(1);
> SET_STRING_ELT(str, 0, mkChar("foo"));

Makes sense - thank you !

>
> >3.   Is "row.names" attribute required for data frames and, if so, can
> > I use some other type besides STRSXP ?
>
> It is required. It can be integers, for 2.4.0+
>

Great !

> >4.   While poking around to find out why some of my code is
> > excessively slow I have come upon definition of `[.data.frame` -
> > subscription operator for data frames, which appears to be written in R.
> > I am wondering whether I am looking at the right place and whether anyone
> > would be interested in a piece of C code optimizing it - in particular
> > extraction of single element is quite slow (i.e. calls like T[i, j]).
>
> [.data.frame is such a pain to implement because there is just too
> many ways to index a data frame. You may want to do a specialized
> index-er that just considers the index-ing styles you use. But I think
> you are not just vectorizing enough. If you have to access your data
> frames like that then it must be inside some loop, which would kill
> your social life.

Hmm, I thought to implement subscription with integer or logical vectors and 
then some hash-based lookup for column and (possibly) row names.

The slowness manifests itself for vectorized code as well. I believe it is due 
to the code mucking about with row.names attribute which introduces a penalty 
on any [,] operation - penalty that grows linearly with the number of rows. 

Thus for large data frames   A[,1] is slower than A[[1]]. For example, for the 
data frame I mentioned above E<-A[[1]] took 0.46 seconds (way too much in my 
opinion), but E<-A[,1] took 62.45 seconds - more than a minute and more than 
twice the time it took to load the entire thing into memory. Silly, isn't 
it ?

Also, there are good reasons to want to add

[Rd] allocVector bug ?

2006-11-01 Thread Vladimir Dergachev


Hi all, 

  I was looking at the following piece of code in src/main/memory.c, function 
allocVector :

if (size <= NodeClassSize[1]) {
node_class = 1;
alloc_size = NodeClassSize[1];
}
else {
node_class = LARGE_NODE_CLASS;
alloc_size = size;
for (i = 2; i < NUM_SMALL_NODE_CLASSES; i++) {
if (size <= NodeClassSize[i]) {
node_class = i;
alloc_size = NodeClassSize[i];
break;
}
}
}


It appears that for LARGE_NODE_CLASS the variable alloc_size should not be 
size, but something far less as we are not using vector heap, but rather 
calling malloc directly in the code below (and from discussions I read on 
this mailing list I think that these two are different - please let me know 
if I am wrong).

So when allocate a large vector the garbage collector goes nuts trying to find 
all that space which is not going to be needed after all.

I made an experiment and replaced the line alloc_size=size with alloc_size=0.

R compiled fine (both 2.4.0 and 2.3.1) and passed make check with no issues 
(it all printed OK).

Furthermore, all allocVector calls completed in no time and my test case run 
very fast (22 seconds, as opposed to minutes). 

In addition, attach() was instantaneous which was wonderful.

Could anyone with deeper knowledge of R internals comment on whether this 
makes any sense ?

   thank you very much !

    Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] allocVector bug ?

2006-11-06 Thread Vladimir Dergachev

Hi Luke, 

   Thank you for the patient reply ! 

   I have looked into the issue a little deeper, comments below:

On Thursday 02 November 2006 11:26 pm, Luke Tierney wrote:
> On Wed, 1 Nov 2006, Vladimir Dergachev wrote:
> > Hi all,
> >
> >  I was looking at the following piece of code in src/main/memory.c,
> > function allocVector :
> >
> >if (size <= NodeClassSize[1]) {
> > node_class = 1;
> > alloc_size = NodeClassSize[1];
> >}
> >else {
> > node_class = LARGE_NODE_CLASS;
> > alloc_size = size;
> > for (i = 2; i < NUM_SMALL_NODE_CLASSES; i++) {
> > if (size <= NodeClassSize[i]) {
> > node_class = i;
> > alloc_size = NodeClassSize[i];
> > break;
> > }
> > }
> >}
> >
> >
> > It appears that for LARGE_NODE_CLASS the variable alloc_size should not
> > be size, but something far less as we are not using vector heap, but
> > rather calling malloc directly in the code below (and from discussions I
> > read on this mailing list I think that these two are different - please
> > let me know if I am wrong).
> >
> > So when allocate a large vector the garbage collector goes nuts trying to
> > find all that space which is not going to be needed after all.
>
> This is as intended, not a bug. The garbage collector does not "go
> nuts" -- it is doing a garbage collection that may release memory in
> advance of making a large allocation.  The size of the current
> allocation request is used as part of the process of deciding when to
> satisfy an allocation by malloc (of a single large noda or a page) and
> when to first do a gc.  It is essential to do this for large
> allocations as well to keep the memory footprint down and help reduce
> fragmentation.

I generally agree with this, however I believe that current logic breaks down 
for large allocation sizes and my code ends up spending 70% (and up) of 
computer time spinning inside garbage collector (I run oprofile to observe 
what is going on).

I do realize that garbage collection is not an easy problem  and that hardware 
and software environments change - my desire is simply to have a version of R 
that is usable for the problems I am dealing with as, aside from slowdown 
with large vector sizes, I find R a very capable tool.

I would greatly appreciate if you could comment on the following observations:

  1. The time spent during single garbage collector run grows with the number 
of nodes - from looking at the code I believe it is linear, but I am not 
certain.

  2. In my case the data.frame contains a few string vectors. These allocate 
lots of CHARSXPs which are the main cause of slowdown of each garbage 
collector run. Would you have any suggestions on optimizing this particular 
situation ? 

  3. Any time a data.frame is created, or one performs an attach() operation 
there is a series of allocations - and if one of them causes memory to expand 
all the rest will too. 

  I put in a fprintf() statement to show alloc_size, VHEAP_FREE and RV_size 
when allocVector is called (this is done only for node_class == 
LARGE_NODE_CLASS).

  First output snippet is from the time the script starts and tries to create 
data.frame:

alloc_size=128 VHEAP_FREE=604182 R_VSize=786432
alloc_size=88 VHEAP_FREE=660051 R_VSize=786432
alloc_size=88 VHEAP_FREE=659963 R_VSize=786432
alloc_size=4078820 VHEAP_FREE=659874 R_VSize=786432
alloc_size=4078820 VHEAP_FREE=260678 R_VSize=4465461
alloc_size=4078820 VHEAP_FREE=260678 R_VSize=8544282
alloc_size=4078820 VHEAP_FREE=260678 R_VSize=12623103
...
alloc_size=4078820 VHEAP_FREE=260677 R_VSize=271628325
alloc_size=4078820 VHEAP_FREE=260677 R_VSize=275707147

As you can see the VHEAP_FREE() attach(B)
alloc_size=4078820 VHEAP_FREE=1274112 R_VSize=294022636
alloc_size=4078820 VHEAP_FREE=499351 R_VSize=297325768
...
alloc_size=4078820 VHEAP_FREE=602082 R_VSize=568670030
alloc_size=4078820 VHEAP_FREE=602082 R_VSize=572748850
alloc_size=4078820 VHEAP_FREE=602082 R_VSize=576827670
alloc_size=88 VHEAP_FREE=602082 R_VSize=580906490
alloc_size=88 VHEAP_FREE=601915 R_VSize=580906490
alloc_size=88 VHEAP_FREE=601798 R_VSize=580906490
alloc_size=88 VHEAP_FREE=601678 R_VSize=580906490
...
alloc_size=44 VHEAP_FREE=591581 R_VSize=580906490
alloc_size=88 VHEAP_FREE=591323 R_VSize=580906490
alloc_size=44 VHEAP_FREE=591220 R_VSize=580906490

So we have the same behaviour as before - the garbage collector gets run every 
time attach creates a new large vector, but functions perfectly for smaller 
vector sizes.

Next, I did detach(B) (which freed up memory) followed by "F<-B[,1]":

alloc_size=113 VHEAP_FREE=588448 R_VSize=580906490
alloc_size=618 VHEAP_FREE=588335 R_VSize=580906490
alloc_size=618 VHEAP_FREE=587717 R_VSize=

Re: [Rd] gc()$Vcells < 0 (PR#9345)

2006-11-06 Thread Vladimir Dergachev

On Monday 06 November 2006 6:12 pm, [EMAIL PROTECTED] wrote:
> version.string Version 2.3.0 (2006-04-24)
>
> > x<-matrix(nrow=44000,ncol=48000)
> > y<-matrix(nrow=44000,ncol=48000)
> > z<-matrix(nrow=44000,ncol=48000)
> > gc()
>
>   used(Mb) gc trigger(Mb) max used(Mb)
> Ncells  177801 9.5 40750021.8   3518.7
> Vcells -1126881981 24170.6 NA 24173.4   NA 24170.6
>

Happens to me with versions 2.40 and 2.3.1. The culprit is this line
in src/main/memory.c:

INTEGER(value)[1] = R_VSize - VHEAP_FREE();

Since the amount used is greater than 4G and INTEGER is 32bit long 
(even on 64 bit machines) this returns (harmless) nonsense. 

The megabyte value nearby is correct and gc trigger and max used fields are 
marked as NA already.

      best

 Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] data frame subscription operator

2006-11-06 Thread Vladimir Dergachev


Hi all, 

   I was looking at the data frame subscription operator (attached in the end 
of this e-mail) and got puzzled by the following line:

class(x) <- attr(x, "row.names") <- NULL

This appears to set the class and row.names attributes of the incoming data 
frame to NULL. So far I was not able to figure out why this is necessary - 
could anyone help ?

The reason I am looking at it is that changing attributes forces duplication 
of the data frame and this is the largest cause of slowness of data.frames in 
general.

   thank you very much !

    Vladimir Dergachev


> `[.data.frame`
function (x, i, j, drop = if (missing(i)) TRUE else length(cols) ==
1)
{
mdrop <- missing(drop)
Narg <- nargs() - (!mdrop)
if (Narg < 3) {
if (!mdrop)
warning("drop argument will be ignored")
if (missing(i))
return(x)
if (is.matrix(i))
return(as.matrix(x)[i])
y <- NextMethod("[")
nm <- names(y)
if (!is.null(nm) && any(is.na(nm)))
stop("undefined columns selected")
if (any(duplicated(nm)))
names(y) <- make.unique(nm)
return(structure(y, class = oldClass(x), row.names = attr(x,
"row.names")))
}
rows <- attr(x, "row.names")
cols <- names(x)
cl <- oldClass(x)
class(x) <- attr(x, "row.names") <- NULL
if (missing(i)) {
if (!missing(j))
x <- x[j]
cols <- names(x)
if (any(is.na(cols)))
stop("undefined columns selected")
}
else {
if (is.character(i))
i <- pmatch(i, as.character(rows), duplicates.ok = TRUE)
rows <- rows[i]
if (!missing(j)) {
x <- x[j]
cols <- names(x)
if (any(is.na(cols)))
stop("undefined columns selected")
}
for (j in seq_along(x)) {
xj <- x[[j]]
x[[j]] <- if (length(dim(xj)) != 2)
xj[i]
else xj[i, , drop = FALSE]
}
}
if (drop) {
drop <- FALSE
n <- length(x)
if (n == 1) {
x <- x[[1]]
drop <- TRUE
}
else if (n > 1) {
xj <- x[[1]]
nrow <- if (length(dim(xj)) == 2)
dim(xj)[1]
else length(xj)
if (!mdrop && nrow == 1) {
drop <- TRUE
names(x) <- cols
attr(x, "row.names") <- NULL
}
}
}
if (!drop) {
names(x) <- cols
if (any(is.na(rows) | duplicated(rows))) {
rows[is.na(rows)] <- "NA"
rows <- make.unique(rows)
}
if (any(duplicated(nm <- names(x
names(x) <- make.unique(nm)
attr(x, "row.names") <- rows
class(x) <- cl
}
x
}


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] gc()$Vcells < 0 (PR#9345)

2006-11-07 Thread Vladimir Dergachev

On Tuesday 07 November 2006 6:28 am, Prof Brian Ripley wrote:
> On Mon, 6 Nov 2006, Vladimir Dergachev wrote:
> > On Monday 06 November 2006 6:12 pm, [EMAIL PROTECTED] wrote:
> >> version.string Version 2.3.0 (2006-04-24)
> >>
> >>> x<-matrix(nrow=44000,ncol=48000)
> >>> y<-matrix(nrow=44000,ncol=48000)
> >>> z<-matrix(nrow=44000,ncol=48000)
> >>> gc()
> >>
> >>   used(Mb) gc trigger(Mb) max used(Mb)
> >> Ncells  177801 9.5 40750021.8   3518.7
> >> Vcells -1126881981 24170.6 NA 24173.4   NA 24170.6
> >
> > Happens to me with versions 2.40 and 2.3.1. The culprit is this line
> > in src/main/memory.c:
> >
> >INTEGER(value)[1] = R_VSize - VHEAP_FREE();
> >
> > Since the amount used is greater than 4G and INTEGER is 32bit long
> > (even on 64 bit machines) this returns (harmless) nonsense.
>
> That's not quite correct.  The units here are Vcells (8 bytes), and
> integer() is signed, so this can happen only if more than 16Gb of heap is
> allocated.

I see - thank you for the explanation !

>
> We are aware that we begin to hit problems at 16Gb: it is for example the
> maximum size of an R vector.  Those objects are logical and so about 7.8Gb
> each: their length as vectors is 98% of the maximum possible.  However,
> the first time we discussed it we thought it would be about 5 years before
> those limits would become important -- I think three of those years have
> since passed.
>
> > The megabyte value nearby is correct and gc trigger and max used fields
> > are marked as NA already.
>
> and now 'used' is also marked as NA in 2.4.0 patched.

Great, thank you !

>
> This is only a reporting issue.  When I first used R it reported only
> numbers, and I added the Mb as a more comprehensible figure (especially
> for Ncells).  I think it would be sensible now to only report these
> figures in Mb or Gb (and also the reports for gcinfo(TRUE)).

Why not use KB ? This still preserves information about small allocations and 
raises the limit to 16 TB - surely at least 5 years off ! :)

Alternatively, doubles should be able to hold the entire number, but this 
would require changes to how information is displayed.

>
> The model behind the report actually pre-dates the GC change in 1.2.0.
> The 'Vcells' are nowadays the sum of all the allocations from VECSXPs
> (which include their headers), rather than the 'vector heap' (although
> some of the earlier terminology persists).

I see.

   thank you !

  Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] variable problem

2006-11-07 Thread Vladimir Dergachev

On Tuesday 07 November 2006 3:28 pm, Tom McCallum wrote:
> Hi everyone,

Hi Tom, 

Would this snippet work:

for(i in 1:length(mylist))do.call(f, mylist[i])

On the other hand it is not easy to see why you would want to call the 
same function with differently named arguments - perhaps what you are really 
trying to do has a different (and better) solution ?

   best

      Vladimir Dergachev

>
> I am not sure this is possible so I would be interested in your
> responses.  Say I have a variable 'v' with the string "myargument" in and
> I have a function 'f' that takes this argument as follows;
>
> f <- function( myargument=5 ) {
>... does something...
> }
>
> Is there anyway I can say something like;
>
> f( v=10 ) such that it will be evaluated as f( myargument=10 ).
>
> I presume there may be some use eval and substitute but if someone could
> point me in the right direction then that would be great.
>
> The end idea is to have a list of m items, declared somewhere else,  which
> can be evaluated as particular arguments named after their list names
>
> e.g
>
> mylist <- list( "a"=1, "b"=2, "c"=3 )
>
> which can be passed to a function taking arguments a,b, or c and it will
> be able to evaluate them accordingly :
>
> long hand this would evaluate to something like
>   f( a=mylist[["a"]] );
>   f( b=mylist[["b"]] );
>   f( c=mylist[["c"]] );
>
> but I would have actually rewritten something like
>   for ( myvar in names( mylist ) ) {
>   f( some_clever_substitution_to_act_as_argument(myvar) = 
> mylist[[ myvar
> ]] );
>   }
>
> I hope I have explained myself clearly enough, if not please say so and I
> will try and give a better example.
>
> Many thanks for your help
>
> Tom

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] data frame subscription operator

2006-11-08 Thread Vladimir Dergachev

On Wednesday 08 November 2006 3:21 am, Prof Brian Ripley wrote:
>
> > So far I was not able to figure out why this is necessary -
> > could anyone help ?
>
> You need to remove the class to avoid recursion: a few lines later x[i]
> needs to be a call to the primitive and not the data frame method.

I see. Is there a way to get at the primitive directly, i.e. something like
`[.list`(x, i) ?

>
> > The reason I am looking at it is that changing attributes forces
> > duplication of the data frame and this is the largest cause of slowness
> > of data.frames in general.
>
> Do you have evidence of that?  R has facilities to profile its code, and I
> have never seen  [.data.frame taking a significant proportion of the total
> time.  If it does for your application, consider if a data frame is an
> appropriate way to store your data.  I am not sure we would accept that
> data frames do have 'slowness in general', but their generality does make
> them slower than alternatives where the generality is not needed.

Evidence:

# this can be copy'n'pasted directly into an R session
# small N - both system calls return small, but comparable running times
N<-10
A<-data.frame(X=1:N, Y=rnorm(N), Z=as.character(rnorm(N)))
system.time(B<-A[,1])
system.time(B<-A[1,1])

#larger N - both times are larger and still comparable
N<-100
A<-data.frame(X=1:N, Y=rnorm(N), Z=as.character(rnorm(N)))
system.time(B<-A[,1])
system.time(B<-A[1,1])

The running times would also grow with the number of columns. Also I have 
modified 2.4.0 version of R to print out large allocations and I get the 
impression that the data frame is being duplicated. Same happens for 
`[<-.data.frame` - but this function has much more complex code, I have not 
looked through it yet.

Of course, getting a small portion (i.e. A[1:5,]) also takes a lot of time - 
but the examples showed above should be O(1).

My data is a result of data base query - it has naturally columns of different 
types and the columns are named (no row.names though) - which is why I used 
data.frames. What would you suggest ?

thank you very much !

 Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] allocVector bug ?

2006-11-08 Thread Vladimir Dergachev

On Wednesday 08 November 2006 12:56 pm, Luke Tierney wrote:
> On Mon, 6 Nov 2006, Vladimir Dergachev wrote:
> > Hi Luke,
> >
> >
> > I generally agree with this, however I believe that current logic breaks
> > down for large allocation sizes and my code ends up spending 70% (and up)
> > of computer time spinning inside garbage collector (I run oprofile to
> > observe what is going on).
>
> Again please be careful about these sorts of statements.  I am sure
> there are bugs in the memory manager and places where things "break
> down" but this isn't one of them.  The memory manager is quite
> deliberately biased towards keeping the total allocation low, if
> necessary at the expense of some extra gc overhead.  This is needed if
> we want to use the same settings across a wide range of
> configurations, some of which have relatively little memory available
> (think student labs).  The memory manager does try to learn about the
> needs of a session, and as a result triggering value get adjusted.  It
> is not true that every large allocation causes a gc.  This may be true
> _initially_, but once total memory usage stabilizes at a particular
> level it is no longer true (look at the way the heap limits are
> adjusted).
>
> This approach of adjusting based on usage within a session is
> reasonable and works well for longer sessions.  It may not work well
> for short scripts that need large allocations.  I doubt that any
> automated setting can work well in that situation while at the same
> time keeping memory usage in other settings low. So it may be useful
> to find ways of specifying a collection strategy appropriate for these
> situations. If you can send me a simplified version of your usage
> scenario then I will give this some thought and see if we can come up
> with some reasonable ways of allowing user code to tweak gc behavior
> for these situations.
>

Hi Luke, 

   Yes, I gladly concede the point that for a heuristic algorithm the notion 
of what is a "bug" is murky (besides crashes, etc, which is not what I am not 
talking about).

   Here is why I called this a bug:

 1. My understanding is that each time gc() needs to increase memory it 
performs a full garbage collection run. Right ?

 2. This is not a problem with small memory sizes as they imply 
(presumably) small number of objects.

 3. However, if one wants to allocate many objects (say columns in a data 
frame or just vectors) this results in large penalty

Example 1: This simulates allocation of a data.frame with some character 
columns which are assumed to be factors. On my system first assignment is 
nearly instantaneous, why subsequent assignments take slightly less than 0.1 
seconds each.

L<-list()
Chars<-as.character(1:10)
for(i in 1:100)L[[i]]<-system.time(assign(paste("test", i), 1:100))
Times<-do.call(rbind, L)

Example 2: Same as example 1 but we first grow the memory with fake 
allocation:

L<-list()
Chars<-as.character(1:10)
Data<-1:1
rm(Data)
for(i in 1:100)L[[i]]<-system.time(assign(paste("test", i), 1:100))
Times<-do.call(rbind, L)

In this case the first 20 or so allocations are very quck (faster than 0.02 
sec) and then garbage collector kicks in and the time rises to 0.08 seconds 
each - still less than in Example 1.

This example is relevant because this sequence of allocations is exactly what 
happens when one uses read.table or scan (or database query) to load data.

What is more, if the user then manipulates the loaded data by creating columns 
that are a combination of existing ones then this is very slow as well.

I looked more carefully at your code in src/main/memory.c, function 
AdjustHeapSize:

R_VSize = VNeeded;
if (vect_occup > R_VGrowFrac) {
R_size_t change = R_VGrowIncrMin + R_VGrowIncrFrac * R_NSize;
if (R_MaxVSize - R_VSize >= change)
R_VSize += change;
}

Could it be that R_NSize should be R_VSize ? This would explain why I see a 
problem in case R_VSize>>R_NSize.

thank you very much !

 Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] data frame subscription operator

2006-11-08 Thread Vladimir Dergachev

On Wednesday 08 November 2006 11:41 am, Gabor Grothendieck wrote:
> .subset and .subset2 are equivalent to [ and [[ except that
> dispatch does not take place.  See ?.subset
>

Thank you Gabor !

I made an experiment and got rid of 

 class(x) <- attr(x, "row.names") <- NULL

 while replacing all occurrences of x[ and x[[ with .subset and .subset2 . 

 Results:

X<-A[,1]  is now instantaneous, as it should be.

X<-A[1,1] is faster for data frames with many columns, but still appears 
to make a copy of A[,1] before indexing. Not sure why..

 thank you

    Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] allocVector bug ?

2006-11-09 Thread Vladimir Dergachev

On Thursday 09 November 2006 12:21 pm, Luke Tierney wrote:
> On Wed, 8 Nov 2006, Vladimir Dergachev wrote:
> > On Wednesday 08 November 2006 12:56 pm, Luke Tierney wrote:
> >> On Mon, 6 Nov 2006, Vladimir Dergachev wrote:
> >
> > Hi Luke,
> >
> >   Yes, I gladly concede the point that for a heuristic algorithm the
> > notion of what is a "bug" is murky (besides crashes, etc, which is not
> > what I am not talking about).
> >
> >   Here is why I called this a bug:
> >
> > 1. My understanding is that each time gc() needs to increase memory
> > it performs a full garbage collection run. Right ?
>
> The allocation process does not call gc before every call to malloc.
> It only calls gc if the allocation would cross a threshold level.
> Those theshold levels are adjusted in an effort to compromise between
> keeping memory footprint low and not calling gc too often.  The code
> you quote below is part of this adjustment process.  If this process
> is working properly then as memory use grows there will initially be
> more gc activity and then less as the thresholds adjust.

Well, I was seeing it call gc for every large vector. This probably happens be 
only for those larger  than R_VGrowIncrFrac * R_NSize. On my system R_NSize 
is never more than 1e6 so this would explain the problems when using 1e6 (and 
larger) vectors.

>
> > 2. This is not a problem with small memory sizes as they imply
> > (presumably) small number of objects.
> >
> > 3. However, if one wants to allocate many objects (say columns in a
> > data frame or just vectors) this results in large penalty
> >
> > Example 1: This simulates allocation of a data.frame with some character
> > columns which are assumed to be factors. On my system first assignment is
> > nearly instantaneous, why subsequent assignments take slightly less than
> > 0.1 seconds each.
>
> I'm not sure these are quite doing what you intend.  You define Chars
> but don't use it.  Also, system.time by default calls gc() before
> doing the evaluation. Giving FALSE as the second argument may give you
> a more realistic picture.

The Chars are defined to create lots of ncells and make gc() run time more 
realistic. It also mimics having a data.frame with a few factor columns.

As for system.time - thank you, I missed that ! 
Setting gcFirst=FALSE changes behavior in the first example to be 2 times 
faster and makes all the allocations in the second example faster.

I guess that extra call to gc() caused R_VSize to shrink too fast.

> > I looked more carefully at your code in src/main/memory.c, function
> > AdjustHeapSize:
> >
> > R_VSize = VNeeded;
> >if (vect_occup > R_VGrowFrac) {
> > R_size_t change = R_VGrowIncrMin + R_VGrowIncrFrac * R_NSize;
> > if (R_MaxVSize - R_VSize >= change)
> > R_VSize += change;
> >}
> >
> > Could it be that R_NSize should be R_VSize ? This would explain why I see
> > a problem in case R_VSize>>R_NSize.
>
> That does indeed look like a bug and that R_NSize should be R_VSize --
> well spotted, thanks.  I will need to experiment with this a bit more
> to see if it can safely be changed.  It will increase the memory
> footprint a bit.  Probaly not by enough to matter but if it does we
> may need to adjust some of the tuning constants.
>

Would there be something I can help you with ? Is there a script to run 
through common usage patterns ?

  thank you !

  Vladimir Dergachev


> Best,
>
> luke
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] String to list and visa versa

2006-11-14 Thread Vladimir Dergachev

On Tuesday 14 November 2006 12:00 pm, Tom McCallum wrote:
> Hi,
>
> I need to collapse a list into a string and then reparse it back into the
> list.  Normally when I need to do this I simply use write.csv and
> read.csv, but I need to do this in memory within R rather than writing out
> to file.  Are there any bespoke commands that any knows of that does
> something like this or any tips for doing this that anyone can suggest?  I
> basically don't care upon the string representation, only that I can
> manipulate the list as a string and then reparse it back to a valid list
> object.

#List -> string:

#
# Put whatever you want into collapse to separate list entries
#
paste(unlist(L), collapse=",")

#String->list
strsplit(S, ",")

best

   Vladimir Dergachev


>
> Many thanks for your help,
>
> Tom

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] String to list and visa versa

2006-11-14 Thread Vladimir Dergachev

On Tuesday 14 November 2006 12:28 pm, Prof Brian Ripley wrote:
> This approach won't work in very many cases (but then nor will write.csv).
>
> The safest way I know is to use serialize() and unserialize().  Next to
> that, deparse(control="all") and parse(text=) are quite good and give a
> human-readable character representation.
>
> If fidelity is not the main issue, as.character and toString spring to
> mind.  unlist is recursive, and is not going to come close to being
> faithful for other than very simple lists. And what if ',' is a character
> in one of the list elements?

Yes, but then one can replace ',' with something rarely used like \007.
I picked ',' because write.csv/read.csv worked before.

You are right, for storage serialize/unserialize seem best, however for 
manipulation one would usually prefer a well-defined format.
 
     best

 Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] base-Ex.R make check failure

2006-11-15 Thread Vladimir Dergachev


Hi all, 

   make check fails for me with latest SVN code for file base-Ex.R:


> sw[1,]   # a one-row data frame
Warning in format.data.frame(x, digits = digits, na.encode = FALSE) :
 corrupt data frame: columns will be truncated or padded with NAs
   Fertility Agriculture Examination Education
Courtelary  80.2  17  1512
> sw[1,, drop=TRUE]  # a list
Warning in format.data.frame(x, digits = digits, na.encode = FALSE) :
 corrupt data frame: columns will be truncated or padded with NAs
   Fertility Agriculture Examination Education
Courtelary  80.2  17  1512
>
> swiss[ c(1, 1:2), ]   # duplicate row, unique row names are created
Error in `[[<-.data.frame`(`*tmp*`, j, value = c(80.2, 80.2, 83.1)) :
replacement has 3 rows, data has 47
Execution halted

  R-2.4.0 runs through the same test just fine.

  Does anyone else see the same thing ? 

 thank you !

      Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] data frame subset patch

2006-11-28 Thread Vladimir Dergachev


Hi all, 

   Here is a patch that significantly speeds up `[.data.frame` operator.
It applies cleanly to both 2.4.0 and svn trunk. Make check was OK for 2.40.
(for svn trunk it fails even without this patch.. ).

   What it does - we get rid of class and attr statements that modify incoming 
data frame and use explicit calls to .subset and .subset2 instead.

Test case:

N<-10
T<-data.frame(a=1:N, b=rnorm(N), c=as.character(round(runif(N)*10)))
system.time({X<-0 ; for(i in 1:1000)X<-X+T[i,2]})

Without patch the output on my system is
[1]  8.488  2.436 10.926  0.000  0.000


With this patch the output is:
[1] 1.084 0.624 1.707 0.000 0.000

thank you !

     Vladimir Dergachev
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] data frame subset patch, take 2

2006-12-06 Thread Vladimir Dergachev

Hi Robert,

Here is the second iteration of data frame subset patch.
It now passes make check on both 2.4.0 and 2.5.0 (svn as of a few days ago).
Same speedup as before.

Changes:

* Introduced two new functions .subassign2 and .subassign that are 
complimentary to .subset2 and .subset.

* Changed x[[j]]<- assignment to x<-.subassign2(x, j, ..) to fix the 
problem 
with the previous patch.

   thank you !

    Vladimir Dergachev
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] empty pages in xyplot (2.4.0)

2006-12-08 Thread Vladimir Dergachev


In 2.4.0 (and SVN) I am seeing xyplot creating empty pages for high page 
counts in layout - contrary to the manual which says high page counts should 
not matter. Everything works fine in 2.3.1.

library("lattice")
A<-data.frame(x=1:10, y=sin(1:10), z=round(1:10/3))
xyplot(x~y|z, A, layout=c(1,1,10))

The snippet above produces a valid plot in R 2.3.1, while in 2.4.0 and later I 
see a blank page with "x" and "y" letters on it.

Can anyone else reproduce this ?

 thank you very much !

      Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] data frame subset patch, take 2

2006-12-13 Thread Vladimir Dergachev

On Wednesday 13 December 2006 6:01 am, Martin Maechler wrote:
>
> - Vladimir, have you verified your 'take2' against recent versions
>   of R-devel?

Yes. 

>
> - If they still work, could you re-post them to R-devel, this
>   time using a proper MIME type,
>   i.e. most probably one of
>  application/x-tar
>  application/x-compressed-tar
>  application/x-gzip
>
>   In case you don't know how to achieve this,
>   I'd be interested to get it by "private" e-mail.

No problem. The old e-mail did have a mime type: "text/x-diff".
I am resending the patch - now compressed, hopefully it will get pass whatever 
filters are in place.

With regard to speedups in R, here is my wish list - I would greatly 
appreciate comments on what makes sense here or not, etc:

1. I greatly miss equivalents of Tcl append and lappend commands - not
the function performed by these commands but efficiency (they are O(1) on 
average). Tcl easily handles lists with 1e6 components and strings of 10s of 
megabytes in length.

2. It would be nice to have true hashed arrays in R (i.e. O(1) access 
times). So far I have used named lists for this, but they are O(n):

> L<-list(); system.time(for(i in 1:1)L[[paste(i)]]<-i);
[1] 2.864 0.004 2.868 0.000 0.000
> L<-list(); system.time(for(i in 1:2)L[[paste(i)]]<-i);
[1] 11.789  0.216 12.004  0.000  0.000

3. Efficient manipulation of large numbers of strings. The big reason 
character row.names are slow is because they require a large number of string 
objects that slow down garbage collector.

This is possibly not a problem that has an easy solution, here are a 
couple of approaches I have considered:

a) Inline strings - use a structure like
union {
struct { 
unsigned char size;
char body[15];
} inlined_string;  /* use this when size<16 */

struct {
unsigned char flag;  
char reserved[7]; /* for 64 bit */
CHRSXP ptr;
} indirect_string; /* use this when flag=255 */
}

 This basically turns small strings into an enum type stored 
within a 128-bit integer. This would greatly decrease required number of 
CHRSXP in many common cases (in particular for many rownames). 

 The biggest disadvantage is more complicated access to string 
data. Also this does not solve the issue of how to deal with 1e6 long 
strings - though I feel like 15 characters should be good enough for most 
uses.

b) CHRSXPs are always leaf nodes. One could implement true reference 
counting
and create a separate garbage collector pool for them. This way one can rely 
on reference counting to free string objects during normal operation, but 
also keep track of the number of referenced strings during garbage collector 
passes - and trigger string garbage collection passes (with a warning) when 
the number of referenced strings is much smaller the number of objects in 
string pool.

This gets rid of overhead that strings impose on garbage 
collector. The disadvantage are very large changes to R code.

best

Vladimir Dergachev

subset.patch.2.diff.gz
Description: GNU Zip compressed data
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] data frame subset patch, take 2

2006-12-13 Thread Vladimir Dergachev

On Wednesday 13 December 2006 1:23 pm, Marcus G. Daniels wrote:
> Vladimir Dergachev wrote:
> > 2. It would be nice to have true hashed arrays in R (i.e. O(1) access
> > times). So far I have used named lists for this, but they are O(n):
>
> new.env(hash=TRUE) with get/assign/exists works ok.  But I suspect its
> just too easy to use named lists because it is easy, and that has bad
> performance ramifications for user code (perhaps the R developers are
> more vigilant about this for the R code itself).

Cool, thank you ! 

I wonder whether environments could be extended to allow names() to work 
(altough I see that ls() does the same function) and to allow for(i in E) 
loops.

   thank you

   Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] data frame subset patch, take 2

2006-12-19 Thread Vladimir Dergachev

On Saturday 16 December 2006 4:41 pm, Martin Maechler wrote:
>
> Correction: the problems show on both platforms;
>
> one is in mgcv, gam(), an error in [[ <-  -- pretty clearly linked to your
> changes but not reproducible when tried isolatedly
> interactively,
>
> the other one is a seg.fault "memory not mapped" when running
> the  example
>
>> nuke.boot <- boot(nuke.data, nuke.fun, R=999, m=1,
>
>+  fit.pred=new.fit, x.pred=new.data)
>
>
> MM> My guess: typically when dealing with model.frames (which
> MM> internally are "just" data frames with a particular "terms"
> attribute) MM> but the problems are not reproducible when run
> interactively. MM> It may really be that .subset() and .subset2() are
> sometimes MM> used in cases they should not be in your new code; or they
> even have a bug that MM> is not triggered unless by using them in the new
> context of [.data.frame
>
> MM> So I'm sorry, but we might have to wait for a "take 3"
> MM> or rather try to find the problem with your patch.
> MM> Maybe you can try yourself?

Hi Martin, 

   thank you very much for the feedback !

   Of course, there is going to be take 3 :)

   I have reproduced your tests with slightly different results:
boot.Rcheck fails, stats.Rcheck segfaults, cluster.Rcheck fails.

   More importantly I was able to reproduce the problem interactively with 
boot.Rcheck. When interactive I found that the issue has random outcomes - 
sometimes it segfaults and sometimes it produces this:

   1) boot.Rcheck fails with 

> nuke.boot <- boot(nuke.data, nuke.fun, R=999, m=1, 
+  fit.pred=new.fit, x.pred=new.data)
Error: incompatible types (from NULL to list) in [[ assignment
Execution halted

but does not segfault. Other times it errors out in different places or 
goes through fine. On one occasion I observed a very interesting behaviour - 
the R console looked like it was completely confused about which functions 
are being called and about arguments passed to them.

   After some tinkering, I realized that, perhaps, the problem is with me 
adding .subassign and .subassign2 functions and this somehow interfering with 
saved workspaces. So I did make clean (after updating SVN) and the problem 
appears to be gone.

   Could you try dong make clean && make on your installation and reporting 
the results ?

  thank you very much !

   Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to execute R scripts simultaneously from multiple threads

2007-01-03 Thread Vladimir Dergachev

On Wednesday 03 January 2007 3:47 am, Erik van Zijst wrote:
> Hi All,
>
> My problem is about parallel execution of R-scripts. My platform is linux.
>
> A program that is written in C needs to execute multiple R-scripts
> simultaneously. The C program makes use of multi-threading. Each thread
> must initiate the execution of one script. Performance is very important.
>
> Appearantly the R C-API does not provide a mechanism for parallel
> execution..
>
> It is preferred that the solution is not based on multi-processing (like
> C/S), because that would introduce IPC overhead.

One thing to keep in mind is that IPC is very fast in Linux. So unless you are 
making lots of calls to really tiny functions this should not be an issue.

What can be an issue is the overhead of starting a new R process. In which 
case you can make some helper processes that do the same thing you wanted 
from a multi-thread one and just pass the data around.

 best

      Vladimir Dergachev

>
> Hopefully some thread-safe (single-proces) solution is readily
> available, written in C.
>
> What is the best solution to do this?
>
> (If there is no single-process solution, what is the alternative?)
>
> Regards,
> Erik.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to execute R scripts simultaneously from multiple threads

2007-01-04 Thread Vladimir Dergachev

On Thursday 04 January 2007 4:54 am, Erik van Zijst wrote:
> Vladimir Dergachev wrote:
> > On Wednesday 03 January 2007 3:47 am, Erik van Zijst wrote:
> >>Appearantly the R C-API does not provide a mechanism for parallel
> >>execution..
> >>
> >>It is preferred that the solution is not based on multi-processing (like
> >>C/S), because that would introduce IPC overhead.
> >
> > One thing to keep in mind is that IPC is very fast in Linux. So unless
> > you are making lots of calls to really tiny functions this should not be
> > an issue.
>
> Using pipes or shared memory to pass things around to other processes on
> the same box is very fast indeed, but if we base our design around
> something like RServe which uses TCP it could be significantly slower.
> Our R-based system will be running scripts in response to high-volume
> real-time stock exchange data, so we expect lots of calls to many tiny
> functions indeed.

Very interesting :) 

If you are running RServe on the other box you will need to send data over 
ethernet anyway (and will probably use TCP). If it is on the same box and you 
use "localhost" the packets will go over loopback - which would be 
significantly faster.

At some point (years ago) there was even an argument on some mailiing list 
(xfree86-devel ?) about whether Xserver should support shared memory as unix 
socket was "fast enough" - with the other side arguing that when you pass 
megabyte images around (as in DVD playback) there is non-negligible overhead.

   best

   Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] help for memory problem with 64-bit machines

2007-01-05 Thread Vladimir Dergachev

On Friday 05 January 2007 12:10 pm, Peter Dalgaard wrote:
> Hin-Tak Leung wrote:
> > I got the same error with 64-bit R 2.4.1 on FC6 x86_64, and 32-bit
> > R 2.4.1 on the same machine is okay. There is definitely something wrong
> > with your code.
> >
> > I would suggest fixing all the compier warnings - there are piles of
> > them about uninitialized variables, and about doing comparison
> > between signed and unsigned expressions, etc first. Put -Wall in
> > CFLAGS CXXFLAGS and FFLAGS and you'll see.

Also, the issue I most commonly see is the difference in size for "long" data 
type. On 32 bit sizeof(long)=sizeof(int)=32, but on 64bit sizeof(long)=64 
while sizeof(int)=32. 

This reverses a formerly safe practice of using long for 32 bit (and insuring 
the code compiles correctly on 16 bit machines).

    best

Vladimir Dergachev

> >
> > good luck.
> >
> > Hin-Tak Leung
>
> Good advice. Also, the most common culprit for 64/32 problem is pointers
> stored as integers so watch out for any of those. And notice that you
> can set a breakpoint at randsk1_ and start poking around to see what is
> inside various variables and singlestep to the point of the crash (it's
> a bit painful and confusing in Fortran code, though.)
>
> -pd
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to execute R scripts simultaneously from multiple threads

2007-01-08 Thread Vladimir Dergachev

On Monday 08 January 2007 6:36 am, Hin-Tak Leung wrote:
> Erik van Zijst wrote:
> > Vladimir Dergachev wrote:
>
> 
>
> >> At some point (years ago) there was even an argument on some mailiing
> >> list (xfree86-devel ?) about whether Xserver should support shared
> >> memory as unix socket was "fast enough" - with the other side arguing
> >> that when you pass megabyte images around (as in DVD playback) there is
> >> non-negligible overhead.
> >
> > We're currently doing performance tests with the RServe-approach where
> > we measure the actual evaluation time of a function. I'm interested in
> > the evaluation-time versus overhead ratio. Loopback TCP might work as
> > long as this ratio is sufficiently high.
>
> Slightly off-topic, Vladimir sounded as if there was any argument of
> supporting shared memory in X... AFAIK, the shared memory extension
> *is* part of Xorg!
>
> $ grep 'MIT-SHM' /var/log/Xorg.0.log
> (II) Initializing built-in extension MIT-SHM

It is - and it was when the discussion happened (several years ago). 
The issue was whether to introduce shared memory support for Xv extension.
(and yes, it was introduced..)

   best

   Vladimir Dergachev

>
> - the shared memory extension is also crucial for
> client-side font-rendering (xft/freetype) a.k.a. all those
> nicely anti-aliased texts in firefox and openoffice, besides
> DVD playbacks.
>
> HTL

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] C vs. C++ as learning and development tool for R

2007-01-19 Thread Vladimir Dergachev

On Friday 19 January 2007 1:29 pm, Gabor Grothendieck wrote:
> > If you decide to use C++ with R you should check out the documentation
> > that comes with the package RcppTemplate, and the sample code that
> > comes with that package. In my experience C++ (or C or FORTRAN) is
> > needed for many compute intensive tasks, and the R framework provides
> > a nice front-end with its extensive collection of visualization and
> > statistical analysis tools.
>
> Actually I have found the opposite. I have never found C/C++ to be
> necessary. I have always been able to optimize the R code itself to get it
> to run sufficiently fast for my purposes.
>

The nice thing about being able to use C code is that this provides confidence 
that however slowly your R script runs right now you will be able to make it 
faster - no matter what.

On quite a few occasions I have started writing C code and after thinking 
about how I would structure it realized that I can do the same thing in R and 
still get 50% of the speed improvement I get from C.

Also, I am not sure whether this is mentioned anywhere, but I found it to be 
more convenient to use dyn.load directly instead of creating a full-blown R 
package. This way the edit-compile-test cycle is much more convenient.

  best

   Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] C vs. C++ as learning and development tool for R

2007-01-19 Thread Vladimir Dergachev

On Friday 19 January 2007 6:46 pm, Ross Boylan wrote:
> On Fri, Jan 19, 2007 at 03:55:30AM -0500, Kimpel, Mark William wrote:
> I can't say much about "libraries already on other machines", but the
> C runtime is probably the one you can count on being there the most.

Well, I don't think it is there on Windows machines - and it is specific to 
the compiler. Visual C has several different versions, Borland had its own 
and there were several major releases of GNU C library.

My preference is that on Windows one only distributes static binaries, or, 
uses a small loadable object (i.e. dll) from Tcl/Tk or R.

On Linux I found it is best to link C and X11/GL libraries dynamically (as 
older versions are usually available) and link everything else statically. 
Major exception: condor linked binaries are static.

Caveat - I have not distributed anything but GPL/LGPL code, so making static 
binaries was not an issue. If you have a closed source application than any 
LGPL libraries you use must be linked dynamically and you cannot use GPL code 
at all.

   best

       Vladimir Dergachev

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

43 matches

Mail list logo