Re: [Rd] SEXP i/o, .Call(), and garbage collection.
On Thursday 01 February 2007 2:01 pm, Hin-Tak Leung wrote: > One possible reason for such problems is if you copy the pointers > for say, attributes, classes, names, rather than duplicating them. > With very few exceptions, mostly in classes, no two R objects of > the sort you normally encounter/create/play-with should share *any* > part of their data-structure. e.g. such problem can result if you > assign the row names of the input to the output (even if both have > the same row names). > Hmm.. I thought that using setAttrib() would automatically increase the reference count, right ? In particular, I quite often use "pseudo-factor" string vectors - where the string objects are passed through cache and reused when forming a string vector. The result is true character() type but with considerable memory savings. The downside is that R reference count field is usually saturated. best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] xlsReadWrite Pro and embedding objects and files in Excel worksheets
On Thursday 08 February 2007 2:09 pm, tshort wrote: > I don't know of an R package that has a function to encode files as a > multipart mime, but the link above is a good start. Tclib has mime encoding module one could use it from within R with .Tcl("package require tcllib") best Vladimir Dergachev > > - Tom __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RODBC problems with unixodbc
On Tuesday 20 February 2007 1:51 pm, Sebastian P. Luque wrote: > Hi, > > I noticed that if a column is named "end" in a data frame (table.df > below), it leads to errors when trying to sqlSave()'it to a postgresql > connection: > > > ---<---cut here---start-->--- > con <- odbcConnect("PostgreSQL-DB", uid="user", pwd="password", >case="postgresql") > R> sqlSave(con, table.df) > Error in sqlSave(con, table.df) : > [RODBC] ERROR: Could not SQLExecDirect > 42601 7 [unixODBC]Error while executing the query; > ERROR: syntax error at or near "end" at character 140 > ---<---cut here---end>--- > > > If I rename the column to something else (e.g. "ending"), this proceeds > without problems. What could the problem be here? Thanks. It is likely "end" is a reserved word best Vladimir Dergachev > > > Cheers, __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] JIT compiler library
Since this escaped my notice before, I thought it useful to post a link here - in case you have not seen it either: http://www.gnu.org/software/lightning/lightning.html This a portable JIT compiler library with fairly easy syntax (one syntax - many cpus). best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] as.Date nuance
Hi, I have encountered a nuance in as.Date() behaviour that is not altogether obvious - not sure whether this is intended or not: > as.Date("2001-01-01error") [1] "2001-01-01" I.e. it ignores the rest of the characters. This happens both in 2.3.1 and 2.4.1 versions. This also happens with explicit format specification: > as.Date("2006-01-01error", format="%Y-%m-%d") [1] "2006-01-01" thank you Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.Date nuance
On Saturday 24 March 2007 6:21 am, Prof Brian Ripley wrote: > This is how strptime() works: it processes the input to match the format. Except that the format does not match the string - there are leftover characters. Even by R's own definition: > match("a", "ab") [1] NA as, of course, is reasonable. Is there some way to make sure there is an exact match ? thank you ! Vladimir Dergachev > > On Fri, 23 Mar 2007, Vladimir Dergachev wrote: > > I have encountered a nuance in as.Date() behaviour that is not > > altogether > > > > obvious - not sure whether this is intended or not: > >> as.Date("2001-01-01error") > > > > [1] "2001-01-01" > > > > I.e. it ignores the rest of the characters. This happens both in 2.3.1 > > and 2.4.1 versions. > > It has always occurred. > > > This also happens with explicit format specification: > >> as.Date("2006-01-01error", format="%Y-%m-%d") > > > > [1] "2006-01-01" > > > >thank you > > > >Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.Date nuance
On Saturday 24 March 2007 12:12 pm, Gabor Grothendieck wrote: > It matches in the sense of grep or regexpr > > grep("a", "ab") > 0 > regexpr("a", "ab") > 0 > > Try this: > > x <- c("2006-01-01error", "2006-01-01") > as.Date(x, "%Y-%m-%d") + ifelse(regexpr("^-..-..$", x) > 0, 0, NA) > Well, still I would have expected as.Date() to do the same thing as.integer() or as.numeric() do - return NA and produce a warning. After poking in the code I also noticed that the format guess is done using the first element only: > as.Date(c("2006", "2006-01-01")) Error in fromchar(x) : character string is not in a standard unambiguous format > as.Date(c("2006-01-01", "2006")) [1] "2006-01-01" NA I attached a patch that changes do_strptime to behave like coerceToInteger, please let me know if it is reasonable - I'll then see about getting as.Date() to work correctly.. thank you Vladimir Dergachev Index: src/main/datetime.c === --- src/main/datetime.c (revision 40895) +++ src/main/datetime.c (working copy) @@ -818,9 +818,9 @@ SEXP attribute_hidden do_strptime(SEXP call, SEXP op, SEXP args, SEXP env) { SEXP x, sformat, ans, ansnames, klass, stz, tzone; -int i, n, m, N, invalid, isgmt = 0, settz = 0; +int i, n, m, N, invalid, isgmt = 0, settz = 0, warn = 0; struct tm tm, tm2; -char *tz = NULL, oldtz[20] = ""; +char *tz = NULL, oldtz[20] = "", *p; double psecs = 0.0; checkArity(op, args); @@ -859,10 +859,15 @@ tm.tm_year = tm.tm_mon = tm.tm_mday = tm.tm_yday = tm.tm_wday = NA_INTEGER; tm.tm_isdst = -1; - invalid = STRING_ELT(x, i%n) == NA_STRING || - !R_strptime(CHAR(STRING_ELT(x, i%n)), - CHAR(STRING_ELT(sformat, i%m)), &tm, &psecs); + invalid = STRING_ELT(x, i%n) == NA_STRING; if(!invalid) { + invalid = !(p=R_strptime(CHAR(STRING_ELT(x, i%n)), + CHAR(STRING_ELT(sformat, i%m)), &tm, &psecs)) || + (*p); + warn |= invalid; + } + + if(!invalid) { /* Solaris sets missing fields to 0 */ if(tm.tm_mday == 0) tm.tm_mday = NA_INTEGER; if(tm.tm_mon == NA_INTEGER || tm.tm_mday == NA_INTEGER @@ -901,6 +906,8 @@ } if(settz) reset_tz(oldtz); +if(warn) warning(_("NAs introduced by coercion")); + UNPROTECT(3); return ans; } __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] inline C/C++ in R: question and suggestion
On Tuesday 22 May 2007 3:52 pm, Duncan Murdoch wrote: > On 5/22/2007 1:59 PM, Oleg Sklyar wrote: > > One suggestion that probably doesn't affect your package: It would be > even nicer if R incorporated something that Duncan Temple Lang suggested > last year, namely a new kind of quoting that didn't need escapes in the > string. He suggested borrowing triple quotes from Python; I suggested > something more like heredocs as in shells or Perl, or like \verb in TeX, > in case you wanted triple quotes in your C function. It would be nice > to settle on something, so that instead of > I second that. My favorite implementation of this is in Tcl, where curly braces {} mean that the text they enclose is unmodified. Since language constructs using them are normally balanced this is not an impediment. One extremely useful application of this (aside from long strings) is specifying inline data frames - I don't know how to do this otherwise. I.e. something like: A<- scan.string({# Id Value Mark 1 a 3 2 b 4 # }) best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Quoting (was: inline C/C++ in R: question and suggestion
On Tuesday 22 May 2007 4:58 pm, Duncan Murdoch wrote: > On 22/05/2007 4:01 PM, Vladimir Dergachev wrote: > > On Tuesday 22 May 2007 3:52 pm, Duncan Murdoch wrote: > >> On 5/22/2007 1:59 PM, Oleg Sklyar wrote: > > > > I second that. My favorite implementation of this is in Tcl, where curly > > braces {} mean that the text they enclose is unmodified. Since language > > constructs using them are normally balanced this is not an impediment. > > That wouldn't work in R, because the parser couldn't tell whether > > { a } One easy workaround is to have string{ ... } construct - it should be very easy to parse string{ differently from { alone. > > was a block of code or a quoted string. > > > One extremely useful application of this (aside from long strings) is > > specifying inline data frames - I don't know how to do this otherwise. > > > > I.e. something like: > > > > A<- scan.string({# > > Id Value Mark > > 1 a 3 > > 2 b 4 > > # }) > > When your data doesn't contain quote marks, you can just use regular > quotes to do that. I don't know of a scan.string function, but this works: > > A <- read.table(textConnection("# > Id Value Mark > 1 a 3 > 2 b 4 > #"), head = TRUE) Cool, thank you ! > > I think DTL's suggestion would be most useful when putting a lot of code > in a string, where the escapes make the code harder to read. For > example, just about any function using a complicated regular expression. Also anything using .Tcl(). Quotes in data frame definition are useful because they could be employed to delimit text fields with spaces in them. best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Quoting
On Tuesday 22 May 2007 7:05 pm, Peter Dalgaard wrote: > Vladimir Dergachev wrote: > >> I think DTL's suggestion would be most useful when putting a lot of code > >> in a string, where the escapes make the code harder to read. For > >> example, just about any function using a complicated regular expression. > > > > Also anything using .Tcl(). Quotes in data frame definition are useful > > because they could be employed to delimit text fields with spaces in > > them. > > .Tcl() is usually the wrong solution anyway, you really should use tcl() > unless absolutely necessary. > Actually I could not figure out how to use tcl() - it seems to work only to call a single Tcl/Tk command. I mostly use .Tcl() to create guis along the lines of foreach {control desc var value} { label "Just a description" title1 0 entry "Edit some text" text_var {Hello there} } { switch -exact -- $control { label { label .l$var -text $desc grid .l$var - -sticky news } entry { label .l$var -text $desc entry .e$var -variable $var grid .l$var .e$var -sticky news global $var set $var $value } # other control types follow } } this can get pretty versatile and works for plots and other things.. best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] is.finite confusion
I have recently made a silly screwup by applying is.finite() to a character vector: > is.finite(c("a", "b")) [1] FALSE FALSE This does work with factors of course (as they are integer underneath) I wonder if a fix could be put in so that it either reports an error when applied to a character vector - or, perhaps better, act as is.na() thank you Vladimir Dergachev PS test on R 2.5.0, 2.3.1 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] is.finite confusion
On Wednesday 23 May 2007 1:29 pm, Prof Brian Ripley wrote: > No, because it is carefully documented to do this, and people rely on it > working as documented. (Did you do the homework the posting guide asked > for?) What harm came out of learning that the values were not finite? I read the manpage if that is what you are talking about. The particular thing I was attempting to do is to convert all entries that are not values to NULL before storing the result in the database. From my point of view string value was perfectly finite and my code worked with a data.frame I had because it happened to have factors in it. Yes, I easily concede that since I know about it now I am not likely to make the same mistake again. Just was trying (politely) to be of help to other users. best Vladimir Dergachev > > On Wed, 23 May 2007, Vladimir Dergachev wrote: > >I have recently made a silly screwup by applying is.finite() to a > > > > character vector: > >> is.finite(c("a", "b")) > > > > [1] FALSE FALSE > > > > This does work with factors of course (as they are integer underneath) > > > > I wonder if a fix could be put in so that it either reports an error when > > applied to a character vector - or, perhaps better, act as is.na() > > What way is that? It acts in the same way, as I understand the help > pages. > > > thank you > > > > Vladimir Dergachev > > PS test on R 2.5.0, 2.3.1 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R scripts slowing down after repeated called to compiled code
On Friday 25 May 2007 7:12 pm, Michael Braun wrote: > Thanks in advance to anyone that might be able to help me with this > > Also, it is not just the compiled call that slows down. EVERYTHING > slows down, even those that consist only of standard R functions. The > time for each of these function calls is roughly proportional to the > time of the .Call to the C function. > > Another observation is that when I terminate the algorithm, do a rm > (list=ls()), and then a gc(), not all of the memory is returned to the > OS. It is not until I terminate the R session that I get all of the > memory back. In my C code, I am not doing anything to de-allocate the > SEXP's I create, relying on the PROTECT/UNPROTECT mechanism instead (is > this right?). > > I spent most of the day thinking I have a memory leak, but that no > longer appears to be the case. I tried using Rprof(), but that only > gives me the aggregated relative time spent in each function (more than > 80% of the time, it's in the .Call). One possibility is that you are somehow creating a lot of R objects (say by calling assign() or missing UNPROTECT()) and this slows garbage collector down. The garbage collector running time will grow with the number of objects you have - their total size does not have to be large. Could you try printing numbers from gc() call and checking whether the numbers of allocated objects grow a lot ? best Vladimir Dergachev > > So I'm stuck. Can anyone help? > > Thanks, > > Michael __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] data messed up by read.table ? (PR#9779)
On Thursday 05 July 2007 7:00:46 am [EMAIL PROTECTED] wrote: > Full_Name: Joerg Rauh > Version: 2.5.0 > OS: Windows 2000 > Submission from: (NULL) (84.168.226.163) > > > Following Michael J. Crawley "Statistical Computing" on page 9 the > worms.txt is required. After downloading it from the book's supporting > website, which is http://www.bio.ic.ac.uk/research/mjcraw/statcomp/data/ I > visually check the data against the book and they look identical. Then I do > a read.table as suggested: > worms<-read.table("C:/Programme/R/R-2.5.0/Data/Worms.txt", header = T). > I see the same effect on 2.5.0 and 2.5.1 running on Linux. However, the following line reads the data correctly: read.table('worms.txt', header=TRUE, quote="\"") Thus the problem is likely because of single quotes in the Field.Name column, perhaps a single quote character was added to the list of defaults since the book was released. best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] S4 slot with NA default
On Wednesday 26 March 2008 12:04:11 pm Robin Hankin wrote: > Hi > > How do I specify an S4 class with a slot that is potentially numeric, > but NA > by default? I want the slot to be NA until I calculate its value > (an expensive operation, not needed for all applications). When > its value is > known, I will create a new object with the correct value inserted in > the slot. > > I want "NA" to signify "not known". > > My attempt fails because NA is not numeric: > Try as.numeric(NA) - by default, plain NA is of type "logical" best Vladimir Dergachev > > > > -- > Robin Hankin > Uncertainty Analyst and Neutral Theorist, > National Oceanography Centre, Southampton > European Way, Southampton SO14 3ZH, UK > tel 023-8059-7743 > > ______ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Vladimir DergachevRCG Ardis Capital LLC __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Some R questions
Hi all, I am working with some large data sets (1-4 GB) and have some questions that I hope someone can help me with: 1. Is there a way to turn off garbage collector from within C interface ? what I am trying to do is suck data from mysql (using my own C functions) and I see that allocating each column (with about 1-4 million items) takes between 0.5 and 1 seconds. My first thought was that it would be nice to turn off garbage collector, allocate all the data, copy values and then turn the garbage collector back on. 2. For creating STRSXP should I be using mkChar() or mkString() to create element values ? Is there a way to do it without allocating a cons cell ? (otherwise a single STRSXP with 1e6 length slows down garbage collector) 3. Is "row.names" attribute required for data frames and, if so, can I use some other type besides STRSXP ? 4. While poking around to find out why some of my code is excessively slow I have come upon definition of `[.data.frame` - subscription operator for data frames, which appears to be written in R. I am wondering whether I am looking at the right place and whether anyone would be interested in a piece of C code optimizing it - in particular extraction of single element is quite slow (i.e. calls like T[i, j]). thank you very much ! Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Fwd: Re: Some R questions
A correction to my previous post: after running the example A[,1] and A[[1]] the running time decrease so that eventually A[[1]] takes 0.025 seconds (according to system.time()) and A[,1] takes 1.8 seconds. The ratio of time still 2-digit, but, apparently, the garbage collector is a good deal faster when memory is already available. best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Some R questions
On Tuesday 31 October 2006 9:30 pm, miguel manese wrote: > Hi, > > Had experience with this on doing SQLiteDF... > > On 11/1/06, Vladimir Dergachev <[EMAIL PROTECTED]> wrote: > > Hi all, > > > >I am working with some large data sets (1-4 GB) and have some > > questions that I hope someone can help me with: > > > >1. Is there a way to turn off garbage collector from within C > > interface ? what I am trying to do is suck data from mysql (using my own > > C functions) and I see that allocating each column (with about 1-4 > > million items) takes between 0.5 and 1 seconds. My first thought was that > > it would be nice to turn off garbage collector, allocate all the data, > > copy values and then turn the garbage collector back on. > > I believe not. FWIW a numeric() vector is a chunk of memory with a > VECTOR_SEXP header and then your data contiguously allocated. If you > are desparate enough and assuming the garbage collector is indeed the > culprit, you may want to implement your own lightweight allocVector > (the function expanded to by NEW_NUMERIC(), etc.) Thank you very much for the suggestion ! After looking around in the code I realized that what I really wanted was R_gc_internal() - as then I can tell the garbage collector in advance that I will require that much heap and that it does not need to go and allocate it each time I asked (btw I would have expected it to double the heap each time it runs out of it, but this is not what goes on, at least in R 2.3.1). After some mucking around here is a poor mans substitute which might be useful: void fault_mem_region(long size) { long chunk; int max=(1<<30) / sizeof(int); int block_count=0; SEXP block; while(size>0) { chunk=size; if(chunk > max) chunk=max; PROTECT(block=allocVector(INTSXP, chunk)); block_count++; size-=chunk; } UNPROTECT(block_count); } On a 48 column data frame (with 1.2e6 rows) the call fault_mem_region(ncol+nrow*11+ncol*nrow) shaved off 5 seconds from 33 second running time (which includes running mysql query). It is not perfect however as I could see the last columns allocating slower than initial ones. Also, while looking around in allocVector I saw that after running garbage collector it simply calls malloc and if malloc fails it calls garbage collector again. What would be nice is the ability to bypass that first garbage collector call when allocating large nodes. > > >2. For creating STRSXP should I be using mkChar() or mkString() to > > create element values ? Is there a way to do it without allocating a cons > > cell ? (otherwise a single STRSXP with 1e6 length slows down garbage > > collector) > > A string vector (STRSXP) is composed of CHARSXP's. mkChar makes ar > CHARSXP, and mkString makes a STRSXP with 1 CHARSXP, more like a > shorthand for > > SEXP str = NEW_CHARACTER(1); > SET_STRING_ELT(str, 0, mkChar("foo")); Makes sense - thank you ! > > >3. Is "row.names" attribute required for data frames and, if so, can > > I use some other type besides STRSXP ? > > It is required. It can be integers, for 2.4.0+ > Great ! > >4. While poking around to find out why some of my code is > > excessively slow I have come upon definition of `[.data.frame` - > > subscription operator for data frames, which appears to be written in R. > > I am wondering whether I am looking at the right place and whether anyone > > would be interested in a piece of C code optimizing it - in particular > > extraction of single element is quite slow (i.e. calls like T[i, j]). > > [.data.frame is such a pain to implement because there is just too > many ways to index a data frame. You may want to do a specialized > index-er that just considers the index-ing styles you use. But I think > you are not just vectorizing enough. If you have to access your data > frames like that then it must be inside some loop, which would kill > your social life. Hmm, I thought to implement subscription with integer or logical vectors and then some hash-based lookup for column and (possibly) row names. The slowness manifests itself for vectorized code as well. I believe it is due to the code mucking about with row.names attribute which introduces a penalty on any [,] operation - penalty that grows linearly with the number of rows. Thus for large data frames A[,1] is slower than A[[1]]. For example, for the data frame I mentioned above E<-A[[1]] took 0.46 seconds (way too much in my opinion), but E<-A[,1] took 62.45 seconds - more than a minute and more than twice the time it took to load the entire thing into memory. Silly, isn't it ? Also, there are good reasons to want to add
[Rd] allocVector bug ?
Hi all, I was looking at the following piece of code in src/main/memory.c, function allocVector : if (size <= NodeClassSize[1]) { node_class = 1; alloc_size = NodeClassSize[1]; } else { node_class = LARGE_NODE_CLASS; alloc_size = size; for (i = 2; i < NUM_SMALL_NODE_CLASSES; i++) { if (size <= NodeClassSize[i]) { node_class = i; alloc_size = NodeClassSize[i]; break; } } } It appears that for LARGE_NODE_CLASS the variable alloc_size should not be size, but something far less as we are not using vector heap, but rather calling malloc directly in the code below (and from discussions I read on this mailing list I think that these two are different - please let me know if I am wrong). So when allocate a large vector the garbage collector goes nuts trying to find all that space which is not going to be needed after all. I made an experiment and replaced the line alloc_size=size with alloc_size=0. R compiled fine (both 2.4.0 and 2.3.1) and passed make check with no issues (it all printed OK). Furthermore, all allocVector calls completed in no time and my test case run very fast (22 seconds, as opposed to minutes). In addition, attach() was instantaneous which was wonderful. Could anyone with deeper knowledge of R internals comment on whether this makes any sense ? thank you very much ! Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] allocVector bug ?
Hi Luke, Thank you for the patient reply ! I have looked into the issue a little deeper, comments below: On Thursday 02 November 2006 11:26 pm, Luke Tierney wrote: > On Wed, 1 Nov 2006, Vladimir Dergachev wrote: > > Hi all, > > > > I was looking at the following piece of code in src/main/memory.c, > > function allocVector : > > > >if (size <= NodeClassSize[1]) { > > node_class = 1; > > alloc_size = NodeClassSize[1]; > >} > >else { > > node_class = LARGE_NODE_CLASS; > > alloc_size = size; > > for (i = 2; i < NUM_SMALL_NODE_CLASSES; i++) { > > if (size <= NodeClassSize[i]) { > > node_class = i; > > alloc_size = NodeClassSize[i]; > > break; > > } > > } > >} > > > > > > It appears that for LARGE_NODE_CLASS the variable alloc_size should not > > be size, but something far less as we are not using vector heap, but > > rather calling malloc directly in the code below (and from discussions I > > read on this mailing list I think that these two are different - please > > let me know if I am wrong). > > > > So when allocate a large vector the garbage collector goes nuts trying to > > find all that space which is not going to be needed after all. > > This is as intended, not a bug. The garbage collector does not "go > nuts" -- it is doing a garbage collection that may release memory in > advance of making a large allocation. The size of the current > allocation request is used as part of the process of deciding when to > satisfy an allocation by malloc (of a single large noda or a page) and > when to first do a gc. It is essential to do this for large > allocations as well to keep the memory footprint down and help reduce > fragmentation. I generally agree with this, however I believe that current logic breaks down for large allocation sizes and my code ends up spending 70% (and up) of computer time spinning inside garbage collector (I run oprofile to observe what is going on). I do realize that garbage collection is not an easy problem and that hardware and software environments change - my desire is simply to have a version of R that is usable for the problems I am dealing with as, aside from slowdown with large vector sizes, I find R a very capable tool. I would greatly appreciate if you could comment on the following observations: 1. The time spent during single garbage collector run grows with the number of nodes - from looking at the code I believe it is linear, but I am not certain. 2. In my case the data.frame contains a few string vectors. These allocate lots of CHARSXPs which are the main cause of slowdown of each garbage collector run. Would you have any suggestions on optimizing this particular situation ? 3. Any time a data.frame is created, or one performs an attach() operation there is a series of allocations - and if one of them causes memory to expand all the rest will too. I put in a fprintf() statement to show alloc_size, VHEAP_FREE and RV_size when allocVector is called (this is done only for node_class == LARGE_NODE_CLASS). First output snippet is from the time the script starts and tries to create data.frame: alloc_size=128 VHEAP_FREE=604182 R_VSize=786432 alloc_size=88 VHEAP_FREE=660051 R_VSize=786432 alloc_size=88 VHEAP_FREE=659963 R_VSize=786432 alloc_size=4078820 VHEAP_FREE=659874 R_VSize=786432 alloc_size=4078820 VHEAP_FREE=260678 R_VSize=4465461 alloc_size=4078820 VHEAP_FREE=260678 R_VSize=8544282 alloc_size=4078820 VHEAP_FREE=260678 R_VSize=12623103 ... alloc_size=4078820 VHEAP_FREE=260677 R_VSize=271628325 alloc_size=4078820 VHEAP_FREE=260677 R_VSize=275707147 As you can see the VHEAP_FREE() attach(B) alloc_size=4078820 VHEAP_FREE=1274112 R_VSize=294022636 alloc_size=4078820 VHEAP_FREE=499351 R_VSize=297325768 ... alloc_size=4078820 VHEAP_FREE=602082 R_VSize=568670030 alloc_size=4078820 VHEAP_FREE=602082 R_VSize=572748850 alloc_size=4078820 VHEAP_FREE=602082 R_VSize=576827670 alloc_size=88 VHEAP_FREE=602082 R_VSize=580906490 alloc_size=88 VHEAP_FREE=601915 R_VSize=580906490 alloc_size=88 VHEAP_FREE=601798 R_VSize=580906490 alloc_size=88 VHEAP_FREE=601678 R_VSize=580906490 ... alloc_size=44 VHEAP_FREE=591581 R_VSize=580906490 alloc_size=88 VHEAP_FREE=591323 R_VSize=580906490 alloc_size=44 VHEAP_FREE=591220 R_VSize=580906490 So we have the same behaviour as before - the garbage collector gets run every time attach creates a new large vector, but functions perfectly for smaller vector sizes. Next, I did detach(B) (which freed up memory) followed by "F<-B[,1]": alloc_size=113 VHEAP_FREE=588448 R_VSize=580906490 alloc_size=618 VHEAP_FREE=588335 R_VSize=580906490 alloc_size=618 VHEAP_FREE=587717 R_VSize=
Re: [Rd] gc()$Vcells < 0 (PR#9345)
On Monday 06 November 2006 6:12 pm, [EMAIL PROTECTED] wrote: > version.string Version 2.3.0 (2006-04-24) > > > x<-matrix(nrow=44000,ncol=48000) > > y<-matrix(nrow=44000,ncol=48000) > > z<-matrix(nrow=44000,ncol=48000) > > gc() > > used(Mb) gc trigger(Mb) max used(Mb) > Ncells 177801 9.5 40750021.8 3518.7 > Vcells -1126881981 24170.6 NA 24173.4 NA 24170.6 > Happens to me with versions 2.40 and 2.3.1. The culprit is this line in src/main/memory.c: INTEGER(value)[1] = R_VSize - VHEAP_FREE(); Since the amount used is greater than 4G and INTEGER is 32bit long (even on 64 bit machines) this returns (harmless) nonsense. The megabyte value nearby is correct and gc trigger and max used fields are marked as NA already. best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] data frame subscription operator
Hi all, I was looking at the data frame subscription operator (attached in the end of this e-mail) and got puzzled by the following line: class(x) <- attr(x, "row.names") <- NULL This appears to set the class and row.names attributes of the incoming data frame to NULL. So far I was not able to figure out why this is necessary - could anyone help ? The reason I am looking at it is that changing attributes forces duplication of the data frame and this is the largest cause of slowness of data.frames in general. thank you very much ! Vladimir Dergachev > `[.data.frame` function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == 1) { mdrop <- missing(drop) Narg <- nargs() - (!mdrop) if (Narg < 3) { if (!mdrop) warning("drop argument will be ignored") if (missing(i)) return(x) if (is.matrix(i)) return(as.matrix(x)[i]) y <- NextMethod("[") nm <- names(y) if (!is.null(nm) && any(is.na(nm))) stop("undefined columns selected") if (any(duplicated(nm))) names(y) <- make.unique(nm) return(structure(y, class = oldClass(x), row.names = attr(x, "row.names"))) } rows <- attr(x, "row.names") cols <- names(x) cl <- oldClass(x) class(x) <- attr(x, "row.names") <- NULL if (missing(i)) { if (!missing(j)) x <- x[j] cols <- names(x) if (any(is.na(cols))) stop("undefined columns selected") } else { if (is.character(i)) i <- pmatch(i, as.character(rows), duplicates.ok = TRUE) rows <- rows[i] if (!missing(j)) { x <- x[j] cols <- names(x) if (any(is.na(cols))) stop("undefined columns selected") } for (j in seq_along(x)) { xj <- x[[j]] x[[j]] <- if (length(dim(xj)) != 2) xj[i] else xj[i, , drop = FALSE] } } if (drop) { drop <- FALSE n <- length(x) if (n == 1) { x <- x[[1]] drop <- TRUE } else if (n > 1) { xj <- x[[1]] nrow <- if (length(dim(xj)) == 2) dim(xj)[1] else length(xj) if (!mdrop && nrow == 1) { drop <- TRUE names(x) <- cols attr(x, "row.names") <- NULL } } } if (!drop) { names(x) <- cols if (any(is.na(rows) | duplicated(rows))) { rows[is.na(rows)] <- "NA" rows <- make.unique(rows) } if (any(duplicated(nm <- names(x names(x) <- make.unique(nm) attr(x, "row.names") <- rows class(x) <- cl } x } __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] gc()$Vcells < 0 (PR#9345)
On Tuesday 07 November 2006 6:28 am, Prof Brian Ripley wrote: > On Mon, 6 Nov 2006, Vladimir Dergachev wrote: > > On Monday 06 November 2006 6:12 pm, [EMAIL PROTECTED] wrote: > >> version.string Version 2.3.0 (2006-04-24) > >> > >>> x<-matrix(nrow=44000,ncol=48000) > >>> y<-matrix(nrow=44000,ncol=48000) > >>> z<-matrix(nrow=44000,ncol=48000) > >>> gc() > >> > >> used(Mb) gc trigger(Mb) max used(Mb) > >> Ncells 177801 9.5 40750021.8 3518.7 > >> Vcells -1126881981 24170.6 NA 24173.4 NA 24170.6 > > > > Happens to me with versions 2.40 and 2.3.1. The culprit is this line > > in src/main/memory.c: > > > >INTEGER(value)[1] = R_VSize - VHEAP_FREE(); > > > > Since the amount used is greater than 4G and INTEGER is 32bit long > > (even on 64 bit machines) this returns (harmless) nonsense. > > That's not quite correct. The units here are Vcells (8 bytes), and > integer() is signed, so this can happen only if more than 16Gb of heap is > allocated. I see - thank you for the explanation ! > > We are aware that we begin to hit problems at 16Gb: it is for example the > maximum size of an R vector. Those objects are logical and so about 7.8Gb > each: their length as vectors is 98% of the maximum possible. However, > the first time we discussed it we thought it would be about 5 years before > those limits would become important -- I think three of those years have > since passed. > > > The megabyte value nearby is correct and gc trigger and max used fields > > are marked as NA already. > > and now 'used' is also marked as NA in 2.4.0 patched. Great, thank you ! > > This is only a reporting issue. When I first used R it reported only > numbers, and I added the Mb as a more comprehensible figure (especially > for Ncells). I think it would be sensible now to only report these > figures in Mb or Gb (and also the reports for gcinfo(TRUE)). Why not use KB ? This still preserves information about small allocations and raises the limit to 16 TB - surely at least 5 years off ! :) Alternatively, doubles should be able to hold the entire number, but this would require changes to how information is displayed. > > The model behind the report actually pre-dates the GC change in 1.2.0. > The 'Vcells' are nowadays the sum of all the allocations from VECSXPs > (which include their headers), rather than the 'vector heap' (although > some of the earlier terminology persists). I see. thank you ! Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] variable problem
On Tuesday 07 November 2006 3:28 pm, Tom McCallum wrote: > Hi everyone, Hi Tom, Would this snippet work: for(i in 1:length(mylist))do.call(f, mylist[i]) On the other hand it is not easy to see why you would want to call the same function with differently named arguments - perhaps what you are really trying to do has a different (and better) solution ? best Vladimir Dergachev > > I am not sure this is possible so I would be interested in your > responses. Say I have a variable 'v' with the string "myargument" in and > I have a function 'f' that takes this argument as follows; > > f <- function( myargument=5 ) { >... does something... > } > > Is there anyway I can say something like; > > f( v=10 ) such that it will be evaluated as f( myargument=10 ). > > I presume there may be some use eval and substitute but if someone could > point me in the right direction then that would be great. > > The end idea is to have a list of m items, declared somewhere else, which > can be evaluated as particular arguments named after their list names > > e.g > > mylist <- list( "a"=1, "b"=2, "c"=3 ) > > which can be passed to a function taking arguments a,b, or c and it will > be able to evaluate them accordingly : > > long hand this would evaluate to something like > f( a=mylist[["a"]] ); > f( b=mylist[["b"]] ); > f( c=mylist[["c"]] ); > > but I would have actually rewritten something like > for ( myvar in names( mylist ) ) { > f( some_clever_substitution_to_act_as_argument(myvar) = > mylist[[ myvar > ]] ); > } > > I hope I have explained myself clearly enough, if not please say so and I > will try and give a better example. > > Many thanks for your help > > Tom __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] data frame subscription operator
On Wednesday 08 November 2006 3:21 am, Prof Brian Ripley wrote: > > > So far I was not able to figure out why this is necessary - > > could anyone help ? > > You need to remove the class to avoid recursion: a few lines later x[i] > needs to be a call to the primitive and not the data frame method. I see. Is there a way to get at the primitive directly, i.e. something like `[.list`(x, i) ? > > > The reason I am looking at it is that changing attributes forces > > duplication of the data frame and this is the largest cause of slowness > > of data.frames in general. > > Do you have evidence of that? R has facilities to profile its code, and I > have never seen [.data.frame taking a significant proportion of the total > time. If it does for your application, consider if a data frame is an > appropriate way to store your data. I am not sure we would accept that > data frames do have 'slowness in general', but their generality does make > them slower than alternatives where the generality is not needed. Evidence: # this can be copy'n'pasted directly into an R session # small N - both system calls return small, but comparable running times N<-10 A<-data.frame(X=1:N, Y=rnorm(N), Z=as.character(rnorm(N))) system.time(B<-A[,1]) system.time(B<-A[1,1]) #larger N - both times are larger and still comparable N<-100 A<-data.frame(X=1:N, Y=rnorm(N), Z=as.character(rnorm(N))) system.time(B<-A[,1]) system.time(B<-A[1,1]) The running times would also grow with the number of columns. Also I have modified 2.4.0 version of R to print out large allocations and I get the impression that the data frame is being duplicated. Same happens for `[<-.data.frame` - but this function has much more complex code, I have not looked through it yet. Of course, getting a small portion (i.e. A[1:5,]) also takes a lot of time - but the examples showed above should be O(1). My data is a result of data base query - it has naturally columns of different types and the columns are named (no row.names though) - which is why I used data.frames. What would you suggest ? thank you very much ! Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] allocVector bug ?
On Wednesday 08 November 2006 12:56 pm, Luke Tierney wrote: > On Mon, 6 Nov 2006, Vladimir Dergachev wrote: > > Hi Luke, > > > > > > I generally agree with this, however I believe that current logic breaks > > down for large allocation sizes and my code ends up spending 70% (and up) > > of computer time spinning inside garbage collector (I run oprofile to > > observe what is going on). > > Again please be careful about these sorts of statements. I am sure > there are bugs in the memory manager and places where things "break > down" but this isn't one of them. The memory manager is quite > deliberately biased towards keeping the total allocation low, if > necessary at the expense of some extra gc overhead. This is needed if > we want to use the same settings across a wide range of > configurations, some of which have relatively little memory available > (think student labs). The memory manager does try to learn about the > needs of a session, and as a result triggering value get adjusted. It > is not true that every large allocation causes a gc. This may be true > _initially_, but once total memory usage stabilizes at a particular > level it is no longer true (look at the way the heap limits are > adjusted). > > This approach of adjusting based on usage within a session is > reasonable and works well for longer sessions. It may not work well > for short scripts that need large allocations. I doubt that any > automated setting can work well in that situation while at the same > time keeping memory usage in other settings low. So it may be useful > to find ways of specifying a collection strategy appropriate for these > situations. If you can send me a simplified version of your usage > scenario then I will give this some thought and see if we can come up > with some reasonable ways of allowing user code to tweak gc behavior > for these situations. > Hi Luke, Yes, I gladly concede the point that for a heuristic algorithm the notion of what is a "bug" is murky (besides crashes, etc, which is not what I am not talking about). Here is why I called this a bug: 1. My understanding is that each time gc() needs to increase memory it performs a full garbage collection run. Right ? 2. This is not a problem with small memory sizes as they imply (presumably) small number of objects. 3. However, if one wants to allocate many objects (say columns in a data frame or just vectors) this results in large penalty Example 1: This simulates allocation of a data.frame with some character columns which are assumed to be factors. On my system first assignment is nearly instantaneous, why subsequent assignments take slightly less than 0.1 seconds each. L<-list() Chars<-as.character(1:10) for(i in 1:100)L[[i]]<-system.time(assign(paste("test", i), 1:100)) Times<-do.call(rbind, L) Example 2: Same as example 1 but we first grow the memory with fake allocation: L<-list() Chars<-as.character(1:10) Data<-1:1 rm(Data) for(i in 1:100)L[[i]]<-system.time(assign(paste("test", i), 1:100)) Times<-do.call(rbind, L) In this case the first 20 or so allocations are very quck (faster than 0.02 sec) and then garbage collector kicks in and the time rises to 0.08 seconds each - still less than in Example 1. This example is relevant because this sequence of allocations is exactly what happens when one uses read.table or scan (or database query) to load data. What is more, if the user then manipulates the loaded data by creating columns that are a combination of existing ones then this is very slow as well. I looked more carefully at your code in src/main/memory.c, function AdjustHeapSize: R_VSize = VNeeded; if (vect_occup > R_VGrowFrac) { R_size_t change = R_VGrowIncrMin + R_VGrowIncrFrac * R_NSize; if (R_MaxVSize - R_VSize >= change) R_VSize += change; } Could it be that R_NSize should be R_VSize ? This would explain why I see a problem in case R_VSize>>R_NSize. thank you very much ! Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] data frame subscription operator
On Wednesday 08 November 2006 11:41 am, Gabor Grothendieck wrote: > .subset and .subset2 are equivalent to [ and [[ except that > dispatch does not take place. See ?.subset > Thank you Gabor ! I made an experiment and got rid of class(x) <- attr(x, "row.names") <- NULL while replacing all occurrences of x[ and x[[ with .subset and .subset2 . Results: X<-A[,1] is now instantaneous, as it should be. X<-A[1,1] is faster for data frames with many columns, but still appears to make a copy of A[,1] before indexing. Not sure why.. thank you Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] allocVector bug ?
On Thursday 09 November 2006 12:21 pm, Luke Tierney wrote: > On Wed, 8 Nov 2006, Vladimir Dergachev wrote: > > On Wednesday 08 November 2006 12:56 pm, Luke Tierney wrote: > >> On Mon, 6 Nov 2006, Vladimir Dergachev wrote: > > > > Hi Luke, > > > > Yes, I gladly concede the point that for a heuristic algorithm the > > notion of what is a "bug" is murky (besides crashes, etc, which is not > > what I am not talking about). > > > > Here is why I called this a bug: > > > > 1. My understanding is that each time gc() needs to increase memory > > it performs a full garbage collection run. Right ? > > The allocation process does not call gc before every call to malloc. > It only calls gc if the allocation would cross a threshold level. > Those theshold levels are adjusted in an effort to compromise between > keeping memory footprint low and not calling gc too often. The code > you quote below is part of this adjustment process. If this process > is working properly then as memory use grows there will initially be > more gc activity and then less as the thresholds adjust. Well, I was seeing it call gc for every large vector. This probably happens be only for those larger than R_VGrowIncrFrac * R_NSize. On my system R_NSize is never more than 1e6 so this would explain the problems when using 1e6 (and larger) vectors. > > > 2. This is not a problem with small memory sizes as they imply > > (presumably) small number of objects. > > > > 3. However, if one wants to allocate many objects (say columns in a > > data frame or just vectors) this results in large penalty > > > > Example 1: This simulates allocation of a data.frame with some character > > columns which are assumed to be factors. On my system first assignment is > > nearly instantaneous, why subsequent assignments take slightly less than > > 0.1 seconds each. > > I'm not sure these are quite doing what you intend. You define Chars > but don't use it. Also, system.time by default calls gc() before > doing the evaluation. Giving FALSE as the second argument may give you > a more realistic picture. The Chars are defined to create lots of ncells and make gc() run time more realistic. It also mimics having a data.frame with a few factor columns. As for system.time - thank you, I missed that ! Setting gcFirst=FALSE changes behavior in the first example to be 2 times faster and makes all the allocations in the second example faster. I guess that extra call to gc() caused R_VSize to shrink too fast. > > I looked more carefully at your code in src/main/memory.c, function > > AdjustHeapSize: > > > > R_VSize = VNeeded; > >if (vect_occup > R_VGrowFrac) { > > R_size_t change = R_VGrowIncrMin + R_VGrowIncrFrac * R_NSize; > > if (R_MaxVSize - R_VSize >= change) > > R_VSize += change; > >} > > > > Could it be that R_NSize should be R_VSize ? This would explain why I see > > a problem in case R_VSize>>R_NSize. > > That does indeed look like a bug and that R_NSize should be R_VSize -- > well spotted, thanks. I will need to experiment with this a bit more > to see if it can safely be changed. It will increase the memory > footprint a bit. Probaly not by enough to matter but if it does we > may need to adjust some of the tuning constants. > Would there be something I can help you with ? Is there a script to run through common usage patterns ? thank you ! Vladimir Dergachev > Best, > > luke > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] String to list and visa versa
On Tuesday 14 November 2006 12:00 pm, Tom McCallum wrote: > Hi, > > I need to collapse a list into a string and then reparse it back into the > list. Normally when I need to do this I simply use write.csv and > read.csv, but I need to do this in memory within R rather than writing out > to file. Are there any bespoke commands that any knows of that does > something like this or any tips for doing this that anyone can suggest? I > basically don't care upon the string representation, only that I can > manipulate the list as a string and then reparse it back to a valid list > object. #List -> string: # # Put whatever you want into collapse to separate list entries # paste(unlist(L), collapse=",") #String->list strsplit(S, ",") best Vladimir Dergachev > > Many thanks for your help, > > Tom __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] String to list and visa versa
On Tuesday 14 November 2006 12:28 pm, Prof Brian Ripley wrote: > This approach won't work in very many cases (but then nor will write.csv). > > The safest way I know is to use serialize() and unserialize(). Next to > that, deparse(control="all") and parse(text=) are quite good and give a > human-readable character representation. > > If fidelity is not the main issue, as.character and toString spring to > mind. unlist is recursive, and is not going to come close to being > faithful for other than very simple lists. And what if ',' is a character > in one of the list elements? Yes, but then one can replace ',' with something rarely used like \007. I picked ',' because write.csv/read.csv worked before. You are right, for storage serialize/unserialize seem best, however for manipulation one would usually prefer a well-defined format. best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] base-Ex.R make check failure
Hi all, make check fails for me with latest SVN code for file base-Ex.R: > sw[1,] # a one-row data frame Warning in format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs Fertility Agriculture Examination Education Courtelary 80.2 17 1512 > sw[1,, drop=TRUE] # a list Warning in format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs Fertility Agriculture Examination Education Courtelary 80.2 17 1512 > > swiss[ c(1, 1:2), ] # duplicate row, unique row names are created Error in `[[<-.data.frame`(`*tmp*`, j, value = c(80.2, 80.2, 83.1)) : replacement has 3 rows, data has 47 Execution halted R-2.4.0 runs through the same test just fine. Does anyone else see the same thing ? thank you ! Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] data frame subset patch
Hi all, Here is a patch that significantly speeds up `[.data.frame` operator. It applies cleanly to both 2.4.0 and svn trunk. Make check was OK for 2.40. (for svn trunk it fails even without this patch.. ). What it does - we get rid of class and attr statements that modify incoming data frame and use explicit calls to .subset and .subset2 instead. Test case: N<-10 T<-data.frame(a=1:N, b=rnorm(N), c=as.character(round(runif(N)*10))) system.time({X<-0 ; for(i in 1:1000)X<-X+T[i,2]}) Without patch the output on my system is [1] 8.488 2.436 10.926 0.000 0.000 With this patch the output is: [1] 1.084 0.624 1.707 0.000 0.000 thank you ! Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] data frame subset patch, take 2
Hi Robert, Here is the second iteration of data frame subset patch. It now passes make check on both 2.4.0 and 2.5.0 (svn as of a few days ago). Same speedup as before. Changes: * Introduced two new functions .subassign2 and .subassign that are complimentary to .subset2 and .subset. * Changed x[[j]]<- assignment to x<-.subassign2(x, j, ..) to fix the problem with the previous patch. thank you ! Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] empty pages in xyplot (2.4.0)
In 2.4.0 (and SVN) I am seeing xyplot creating empty pages for high page counts in layout - contrary to the manual which says high page counts should not matter. Everything works fine in 2.3.1. library("lattice") A<-data.frame(x=1:10, y=sin(1:10), z=round(1:10/3)) xyplot(x~y|z, A, layout=c(1,1,10)) The snippet above produces a valid plot in R 2.3.1, while in 2.4.0 and later I see a blank page with "x" and "y" letters on it. Can anyone else reproduce this ? thank you very much ! Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] data frame subset patch, take 2
On Wednesday 13 December 2006 6:01 am, Martin Maechler wrote: > > - Vladimir, have you verified your 'take2' against recent versions > of R-devel? Yes. > > - If they still work, could you re-post them to R-devel, this > time using a proper MIME type, > i.e. most probably one of > application/x-tar > application/x-compressed-tar > application/x-gzip > > In case you don't know how to achieve this, > I'd be interested to get it by "private" e-mail. No problem. The old e-mail did have a mime type: "text/x-diff". I am resending the patch - now compressed, hopefully it will get pass whatever filters are in place. With regard to speedups in R, here is my wish list - I would greatly appreciate comments on what makes sense here or not, etc: 1. I greatly miss equivalents of Tcl append and lappend commands - not the function performed by these commands but efficiency (they are O(1) on average). Tcl easily handles lists with 1e6 components and strings of 10s of megabytes in length. 2. It would be nice to have true hashed arrays in R (i.e. O(1) access times). So far I have used named lists for this, but they are O(n): > L<-list(); system.time(for(i in 1:1)L[[paste(i)]]<-i); [1] 2.864 0.004 2.868 0.000 0.000 > L<-list(); system.time(for(i in 1:2)L[[paste(i)]]<-i); [1] 11.789 0.216 12.004 0.000 0.000 3. Efficient manipulation of large numbers of strings. The big reason character row.names are slow is because they require a large number of string objects that slow down garbage collector. This is possibly not a problem that has an easy solution, here are a couple of approaches I have considered: a) Inline strings - use a structure like union { struct { unsigned char size; char body[15]; } inlined_string; /* use this when size<16 */ struct { unsigned char flag; char reserved[7]; /* for 64 bit */ CHRSXP ptr; } indirect_string; /* use this when flag=255 */ } This basically turns small strings into an enum type stored within a 128-bit integer. This would greatly decrease required number of CHRSXP in many common cases (in particular for many rownames). The biggest disadvantage is more complicated access to string data. Also this does not solve the issue of how to deal with 1e6 long strings - though I feel like 15 characters should be good enough for most uses. b) CHRSXPs are always leaf nodes. One could implement true reference counting and create a separate garbage collector pool for them. This way one can rely on reference counting to free string objects during normal operation, but also keep track of the number of referenced strings during garbage collector passes - and trigger string garbage collection passes (with a warning) when the number of referenced strings is much smaller the number of objects in string pool. This gets rid of overhead that strings impose on garbage collector. The disadvantage are very large changes to R code. best Vladimir Dergachev subset.patch.2.diff.gz Description: GNU Zip compressed data __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] data frame subset patch, take 2
On Wednesday 13 December 2006 1:23 pm, Marcus G. Daniels wrote: > Vladimir Dergachev wrote: > > 2. It would be nice to have true hashed arrays in R (i.e. O(1) access > > times). So far I have used named lists for this, but they are O(n): > > new.env(hash=TRUE) with get/assign/exists works ok. But I suspect its > just too easy to use named lists because it is easy, and that has bad > performance ramifications for user code (perhaps the R developers are > more vigilant about this for the R code itself). Cool, thank you ! I wonder whether environments could be extended to allow names() to work (altough I see that ls() does the same function) and to allow for(i in E) loops. thank you Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] data frame subset patch, take 2
On Saturday 16 December 2006 4:41 pm, Martin Maechler wrote: > > Correction: the problems show on both platforms; > > one is in mgcv, gam(), an error in [[ <- -- pretty clearly linked to your > changes but not reproducible when tried isolatedly > interactively, > > the other one is a seg.fault "memory not mapped" when running > the example > >> nuke.boot <- boot(nuke.data, nuke.fun, R=999, m=1, > >+ fit.pred=new.fit, x.pred=new.data) > > > MM> My guess: typically when dealing with model.frames (which > MM> internally are "just" data frames with a particular "terms" > attribute) MM> but the problems are not reproducible when run > interactively. MM> It may really be that .subset() and .subset2() are > sometimes MM> used in cases they should not be in your new code; or they > even have a bug that MM> is not triggered unless by using them in the new > context of [.data.frame > > MM> So I'm sorry, but we might have to wait for a "take 3" > MM> or rather try to find the problem with your patch. > MM> Maybe you can try yourself? Hi Martin, thank you very much for the feedback ! Of course, there is going to be take 3 :) I have reproduced your tests with slightly different results: boot.Rcheck fails, stats.Rcheck segfaults, cluster.Rcheck fails. More importantly I was able to reproduce the problem interactively with boot.Rcheck. When interactive I found that the issue has random outcomes - sometimes it segfaults and sometimes it produces this: 1) boot.Rcheck fails with > nuke.boot <- boot(nuke.data, nuke.fun, R=999, m=1, + fit.pred=new.fit, x.pred=new.data) Error: incompatible types (from NULL to list) in [[ assignment Execution halted but does not segfault. Other times it errors out in different places or goes through fine. On one occasion I observed a very interesting behaviour - the R console looked like it was completely confused about which functions are being called and about arguments passed to them. After some tinkering, I realized that, perhaps, the problem is with me adding .subassign and .subassign2 functions and this somehow interfering with saved workspaces. So I did make clean (after updating SVN) and the problem appears to be gone. Could you try dong make clean && make on your installation and reporting the results ? thank you very much ! Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to execute R scripts simultaneously from multiple threads
On Wednesday 03 January 2007 3:47 am, Erik van Zijst wrote: > Hi All, > > My problem is about parallel execution of R-scripts. My platform is linux. > > A program that is written in C needs to execute multiple R-scripts > simultaneously. The C program makes use of multi-threading. Each thread > must initiate the execution of one script. Performance is very important. > > Appearantly the R C-API does not provide a mechanism for parallel > execution.. > > It is preferred that the solution is not based on multi-processing (like > C/S), because that would introduce IPC overhead. One thing to keep in mind is that IPC is very fast in Linux. So unless you are making lots of calls to really tiny functions this should not be an issue. What can be an issue is the overhead of starting a new R process. In which case you can make some helper processes that do the same thing you wanted from a multi-thread one and just pass the data around. best Vladimir Dergachev > > Hopefully some thread-safe (single-proces) solution is readily > available, written in C. > > What is the best solution to do this? > > (If there is no single-process solution, what is the alternative?) > > Regards, > Erik. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to execute R scripts simultaneously from multiple threads
On Thursday 04 January 2007 4:54 am, Erik van Zijst wrote: > Vladimir Dergachev wrote: > > On Wednesday 03 January 2007 3:47 am, Erik van Zijst wrote: > >>Appearantly the R C-API does not provide a mechanism for parallel > >>execution.. > >> > >>It is preferred that the solution is not based on multi-processing (like > >>C/S), because that would introduce IPC overhead. > > > > One thing to keep in mind is that IPC is very fast in Linux. So unless > > you are making lots of calls to really tiny functions this should not be > > an issue. > > Using pipes or shared memory to pass things around to other processes on > the same box is very fast indeed, but if we base our design around > something like RServe which uses TCP it could be significantly slower. > Our R-based system will be running scripts in response to high-volume > real-time stock exchange data, so we expect lots of calls to many tiny > functions indeed. Very interesting :) If you are running RServe on the other box you will need to send data over ethernet anyway (and will probably use TCP). If it is on the same box and you use "localhost" the packets will go over loopback - which would be significantly faster. At some point (years ago) there was even an argument on some mailiing list (xfree86-devel ?) about whether Xserver should support shared memory as unix socket was "fast enough" - with the other side arguing that when you pass megabyte images around (as in DVD playback) there is non-negligible overhead. best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] help for memory problem with 64-bit machines
On Friday 05 January 2007 12:10 pm, Peter Dalgaard wrote: > Hin-Tak Leung wrote: > > I got the same error with 64-bit R 2.4.1 on FC6 x86_64, and 32-bit > > R 2.4.1 on the same machine is okay. There is definitely something wrong > > with your code. > > > > I would suggest fixing all the compier warnings - there are piles of > > them about uninitialized variables, and about doing comparison > > between signed and unsigned expressions, etc first. Put -Wall in > > CFLAGS CXXFLAGS and FFLAGS and you'll see. Also, the issue I most commonly see is the difference in size for "long" data type. On 32 bit sizeof(long)=sizeof(int)=32, but on 64bit sizeof(long)=64 while sizeof(int)=32. This reverses a formerly safe practice of using long for 32 bit (and insuring the code compiles correctly on 16 bit machines). best Vladimir Dergachev > > > > good luck. > > > > Hin-Tak Leung > > Good advice. Also, the most common culprit for 64/32 problem is pointers > stored as integers so watch out for any of those. And notice that you > can set a breakpoint at randsk1_ and start poking around to see what is > inside various variables and singlestep to the point of the crash (it's > a bit painful and confusing in Fortran code, though.) > > -pd > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to execute R scripts simultaneously from multiple threads
On Monday 08 January 2007 6:36 am, Hin-Tak Leung wrote: > Erik van Zijst wrote: > > Vladimir Dergachev wrote: > > > > >> At some point (years ago) there was even an argument on some mailiing > >> list (xfree86-devel ?) about whether Xserver should support shared > >> memory as unix socket was "fast enough" - with the other side arguing > >> that when you pass megabyte images around (as in DVD playback) there is > >> non-negligible overhead. > > > > We're currently doing performance tests with the RServe-approach where > > we measure the actual evaluation time of a function. I'm interested in > > the evaluation-time versus overhead ratio. Loopback TCP might work as > > long as this ratio is sufficiently high. > > Slightly off-topic, Vladimir sounded as if there was any argument of > supporting shared memory in X... AFAIK, the shared memory extension > *is* part of Xorg! > > $ grep 'MIT-SHM' /var/log/Xorg.0.log > (II) Initializing built-in extension MIT-SHM It is - and it was when the discussion happened (several years ago). The issue was whether to introduce shared memory support for Xv extension. (and yes, it was introduced..) best Vladimir Dergachev > > - the shared memory extension is also crucial for > client-side font-rendering (xft/freetype) a.k.a. all those > nicely anti-aliased texts in firefox and openoffice, besides > DVD playbacks. > > HTL __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] C vs. C++ as learning and development tool for R
On Friday 19 January 2007 1:29 pm, Gabor Grothendieck wrote: > > If you decide to use C++ with R you should check out the documentation > > that comes with the package RcppTemplate, and the sample code that > > comes with that package. In my experience C++ (or C or FORTRAN) is > > needed for many compute intensive tasks, and the R framework provides > > a nice front-end with its extensive collection of visualization and > > statistical analysis tools. > > Actually I have found the opposite. I have never found C/C++ to be > necessary. I have always been able to optimize the R code itself to get it > to run sufficiently fast for my purposes. > The nice thing about being able to use C code is that this provides confidence that however slowly your R script runs right now you will be able to make it faster - no matter what. On quite a few occasions I have started writing C code and after thinking about how I would structure it realized that I can do the same thing in R and still get 50% of the speed improvement I get from C. Also, I am not sure whether this is mentioned anywhere, but I found it to be more convenient to use dyn.load directly instead of creating a full-blown R package. This way the edit-compile-test cycle is much more convenient. best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] C vs. C++ as learning and development tool for R
On Friday 19 January 2007 6:46 pm, Ross Boylan wrote: > On Fri, Jan 19, 2007 at 03:55:30AM -0500, Kimpel, Mark William wrote: > I can't say much about "libraries already on other machines", but the > C runtime is probably the one you can count on being there the most. Well, I don't think it is there on Windows machines - and it is specific to the compiler. Visual C has several different versions, Borland had its own and there were several major releases of GNU C library. My preference is that on Windows one only distributes static binaries, or, uses a small loadable object (i.e. dll) from Tcl/Tk or R. On Linux I found it is best to link C and X11/GL libraries dynamically (as older versions are usually available) and link everything else statically. Major exception: condor linked binaries are static. Caveat - I have not distributed anything but GPL/LGPL code, so making static binaries was not an issue. If you have a closed source application than any LGPL libraries you use must be linked dynamically and you cannot use GPL code at all. best Vladimir Dergachev __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel