[Rd] bug and patch: strptime first-of-month error in (possibly unsupported use of) "%j" format (PR#9577)
Full_Name: John Brzustowski Version: R-devel-trunk OS: linux (problem under Windows too) Submission from: (NULL) (74.101.124.238) (This bug was discovered by Phil Taylor, Acadia University.) I'm not sure from reading the documentation whether strptime(x, "%j") is meant to be supported, but if so, there is a bug which prevents it from working on the first day of months after January: > strptime(31:33, "%j") [1] "2007-01-31" NA "2007-02-02" # the full extent of R's taunting strptime(1 + cumsum(c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30)), "%j") [1] NA NA NA NA NA NA NA NA NA NA NA > The problem is an edge-case comparison in datetime.c:glibc_fix(). This generates a date like "Jan 32", which validate_tm() catches and NAs. (Values of field tm->tm_yday start at 0, not 1.) = PATCH: --- R/src/main/datetime.c (revision 40860) +++ R/src/main/datetime.c (working copy) @@ -796,7 +796,7 @@ if(tm->tm_yday != NA_INTEGER) { /* since we have yday, let that take precedence over mon/mday */ int yday = tm->tm_yday, mon = 0; - while(yday > (tmp = days_in_month[mon] + + while(yday >= (tmp = days_in_month[mon] + ((mon==1 && isleap(1900+tm->tm_year))? 1 : 0))) { yday -= tmp; mon++; __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] fix for broken largefile seek() on 32-bit linux (PR#9883)
Full_Name: John Brzustowski Version: R-devel-trunk, R-2.4.0 OS: linux Submission from: (NULL) (206.248.132.197) DESCRIPTION seek() on files larger than 2 gigabytes fails for large values of "where" on i386 linux 2.6.13 (and presumably other 32-bit unix-like platforms). e.g.: > f<-file("3gigabytefile.dat", "rb") > seek(f, 3e9, "start", "r") [1] 0 ## correct > seek(f, NA, "start", "r") [1] 0 ## should be 3e+09 DIAGNOSIS Typo: the compile-time tests for large file support use "HAVE_SEEKO" instead of "HAVE_FSEEKO", and so fail. The same typo appears in one of the extra/zlib files, so I'm fixing it in the patch below, but I haven't tested whether that actually produces a bug. PATCH Index: src/extra/zlib/gzio.c === --- src/extra/zlib/gzio.c (revision 42664) +++ src/extra/zlib/gzio.c (working copy) @@ -25,7 +25,7 @@ #include "zutil.h" /* R ADDITION */ -#if defined(HAVE_OFF_T) && defined(HAVE_SEEKO) +#if defined(HAVE_OFF_T) && defined(HAVE_FSEEKO) #define f_seek fseeko #define f_tell ftello #else Index: src/include/Rconnections.h === --- src/include/Rconnections.h (revision 42664) +++ src/include/Rconnections.h (working copy) @@ -63,7 +63,7 @@ typedef struct fileconn { FILE *fp; -#if defined(HAVE_OFF_T) && defined(HAVE_SEEKO) +#if defined(HAVE_OFF_T) && defined(HAVE_FSEEKO) off_t rpos, wpos; #else #ifdef Win32 Index: src/main/connections.c === --- src/main/connections.c (revision 42664) +++ src/main/connections.c (working copy) @@ -446,7 +446,7 @@ /* --- file connections - */ -#if defined(HAVE_OFF_T) && defined(HAVE_SEEKO) +#if defined(HAVE_OFF_T) && defined(HAVE_FSEEKO) #define f_seek fseeko #define f_tell ftello #else @@ -570,7 +570,7 @@ { Rfileconn this = con->private; FILE *fp = this->fp; -#if defined(HAVE_OFF_T) && defined(HAVE_SEEKO) +#if defined(HAVE_OFF_T) && defined(HAVE_FSEEKO) off_t pos; #else #ifdef Win32 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] "bug" and patch: quadratic running time for strsplit(..., fixed=TRUE) (PR#9902)
Full_Name: John Brzustowski Version: R-devel-trunk, R-2.4.0 OS: linux, gcc 4.0.3 Submission from: (NULL) (206.248.157.184) This isn't a bug, but an easily-remedied performance issue. SYMPTOM > for (i in 1000 * (1:20)) { y <- paste(rep("asdf", times=i), collapse=" ") t <- system.time(strsplit(y, " ", fixed=TRUE)) cat(sprintf("i=%5d time=%5d msec\n",i, round(1000*t[1]))) } i= 1000 time=2 msec i= 2000 time=9 msec i= 3000 time= 20 msec i= 4000 time= 34 msec i= 5000 time= 57 msec i= 6000 time= 77 msec i= 7000 time= 107 msec i= 8000 time= 136 msec i= 9000 time= 177 msec i=1 time= 230 msec i=11000 time= 275 msec i=12000 time= 308 msec i=13000 time= 371 msec i=14000 time= 446 msec i=15000 time= 544 msec i=16000 time= 639 msec i=17000 time= 726 msec i=18000 time= 864 msec i=19000 time= 944 msec i=2 time= 1106 msec DIAGNOSIS strsplit() uses strlen() in the bounds check clause of a for(;;) statement, which forces a full scan of the source string for each character in the source string. Unlike R's LENGTH() macro, strlen for C strings is an expensive operation, and in this case (at least), gcc 4.0.3's -O2 level optimizer is not able to recognize the call as a loop invariant, despite the declaration "const char *buf". REMEDIED BEHAVIOUR i= 1000 time=0 msec i= 2000 time=1 msec i= 3000 time=1 msec i= 4000 time=0 msec i= 5000 time=1 msec i= 6000 time=1 msec i= 7000 time=1 msec i= 8000 time=2 msec i= 9000 time=2 msec i=1 time=2 msec i=11000 time=2 msec i=12000 time=2 msec i=13000 time=2 msec i=14000 time=2 msec i=15000 time=4 msec i=16000 time=3 msec i=17000 time=3 msec i=18000 time=4 msec i=19000 time=3 msec i=2 time=4 msec RELATED ISSUES A simple search turns up other instances of this usage in R's source. For completeness, I'm submitting patches for all of them, but have not tested whether they in fact cause a detectable performance problem. In the case of modules/X11/dataentry.c, the patch also fixes a presumably ineffectual "bug". $ grep -nR "for *([^;]*;[^;]*strlen *(" * main/rlocale.c:137: for (i = 0; i < strlen(lc_str) && i < sizeof(lc_str); i++) main/printutils.c:486: for(j = 0; j < strlen(buf); j++) *q++ = buf[j]; main/sysutils.c:493:for(j = 0; j < strlen(sub); j++) *outbuf++ = sub[j]; modules/X11/rotated.c:608: for(i=0; i 1 && strncmp(bufp, split, slen))) continue; ntok++; @@ -480,7 +481,7 @@ /* This is UTF-8 safe since it compares whole strings, but it would be more efficient to skip along by chars. */ - for(; bufp - buf < strlen(buf); bufp++) { + for(; bufp < ebuf; bufp++) { if((slen == 1 && *bufp != *split) || (slen > 1 && strncmp(bufp, split, slen))) continue; if(slen) { Index: src/main/rlocale.c === --- src/main/rlocale.c (revision 42792) +++ src/main/rlocale.c (working copy) @@ -127,14 +127,14 @@ int Ri18n_wcwidth(wchar_t c) { char lc_str[128]; -unsigned int i; +unsigned int i, j; static char *lc_cache = ""; static int lc = 0; if (0 != strcmp(setlocale(LC_CTYPE, NULL), lc_cache)) { strncpy(lc_str, setlocale(LC_CTYPE, NULL), sizeof(lc_str)); - for (i = 0; i < strlen(lc_str) && i < sizeof(lc_str); i++) + for (i = 0, j = strlen(lc_str); i < j && i < sizeof(lc_str); i++) lc_str[i] = toupper(lc_str[i]); for (i = 0; i < (sizeof(cjk_locale_name)/sizeof(cjk_locale_name_t)); i++) { Index: src/main/printutils.c === --- src/main/printutils.c (revision 42792) +++ src/main/printutils.c (working copy) @@ -483,7 +483,8 @@ else #endif snprintf(buf, 11, "\\u%04x", k); - for(j = 0; j < strlen(buf); j++) *q++ = buf[j]; + memcpy(q, buf, j = strlen(buf)); + q += j; p += res; } i += (res - 1); Index: src/main/sysutils.c === --- src/main/sysutils.c (revision 42792) +++ src/main/sysutils.c (working copy) @@ -490,8 +490,9 @@ R_AllocStringBuffer(2*cbuff.bufsize, &cbuff); goto top_of_loop; } - for(j = 0; j < strlen(sub); j++) *outbuf++ = sub[j]; - outb -= strlen(sub); + memcpy(outbuf, sub, j = strlen(sub)); + outbuf += j; + outb -= j; } inbuf++; inb--; goto next_char; Index: src/modules/
[Rd] minor bug and patch: attr(x, "names")<-y when y is a pairlist (PR#10807)
Full_Name: John Brzustowski Version: R-devel trunk and R 2.4.0 OS: linux Submission from: (NULL) (76.10.152.79) # Bug: > x<-1:3 > attr(x, "names")<-pairlist("a", "b", "c") > x a a a 1 2 3 # Note that the simpler alternative does work: > names(x)<-pairlist("a", "b", "c") > x a b c 1 2 3 # After applying the patch: > x<-1:3 > attr(x, "names")<-pairlist("a", "b", "c") > x a b c 1 2 3 # The problem is in src/main/attrib.c: namesgets(), where # the pointer to the pairlist of names is not advanced. # Here is a simple patch against R-devel trunk. A cleaner # but more complicated approach would be # to refactor namesgets() and do_namesgets(). # # Note that the name list need not be as long as the LHS, # because subsequent code pads the name list with NAs, if needed. Index: trunk/src/main/attrib.c === --- trunk/src/main/attrib.c (revision 44547) +++ trunk/src/main/attrib.c (working copy) @@ -681,7 +681,7 @@ SEXP namesgets(SEXP vec, SEXP val) { int i; -SEXP s, rval; +SEXP s, rval, tval; PROTECT(vec); PROTECT(val); @@ -695,8 +695,8 @@ else { rval = allocVector(STRSXP, length(vec)); PROTECT(rval); - for (i = 0; i < length(vec); i++) { - s = coerceVector(CAR(val), STRSXP); + for (i = 0, tval=val; i < length(vec) && tval != R_NilValue; i++, tval=CDR(tval)) { + s = coerceVector(CAR(tval), STRSXP); SET_STRING_ELT(rval, i, STRING_ELT(s, 0)); } UNPROTECT(1); __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] patch: two minor debugging-related pointer protection stack issues (PR#10832)
Full_Name: John Brzustowski Version: R-devel trunk and R-2.4.0 OS: linux Submission from: (NULL) (76.10.152.79) Here are two minor potential issues in pointer protection stack (PPS) code which could arise in debugging R with valgrind. Only the second could possibly cause a problem in normal use of R, and only under laughably implausible circumstances. (1) valgrind is given a wrong address when it is told to mark the reserved "red zone" portion of the PPS as "no access". This might generate spurious access errors while debugging under valgrind. (2) unprotect_ptr always does one more copy than necessary when collapsing the slot of the pointer it is unprotecting. This is but a trivial performance penalty unless the PPS is full, in which case it causes a read of the first word in the PPS red zone. If issue (1) were corrected, that would trigger a spurious valgrind error. If all of the following were to occur: a) the PPS overflowed b) the cleanup code filled the PPS red zone c) unprotect_ptr was called from the cleanup code while the PPS red zone was full and d) the memory malloc'd for the PPS occupied the very end of a readable segment with fewer than sizeof(SEXP) trailing guard bytes then a segfault would result. (You are laughing at this point, right?) In any case, here is a patch against R-devel trunk: Index: trunk/src/main/memory.c === --- trunk/src/main/memory.c (revision 44589) +++ trunk/src/main/memory.c (working copy) @@ -1557,7 +1557,7 @@ R_Suicide("couldn't allocate memory for pointer stack"); R_PPStackTop = 0; #if VALGRIND_LEVEL > 1 -VALGRIND_MAKE_NOACCESS(R_PPStackTop+R_PPStackSize,PP_REDZONE_SIZE); +VALGRIND_MAKE_NOACCESS(R_PPStack+R_PPStackSize,PP_REDZONE_SIZE); #endif vsfac = sizeof(VECREC); R_VSize = (((R_VSize + 1)/ vsfac)); @@ -2329,11 +2329,11 @@ } while ( R_PPStack[--i] != s ); /* OK, got it, and i is indexing its location */ -/* Now drop stack above it */ +/* Now drop stack above it, if any */ -do { - R_PPStack[i] = R_PPStack[i + 1]; -} while ( i++ < R_PPStackTop ); +while (++i < R_PPStackTop) { + R_PPStack[i - 1] = R_PPStack[i]; +} R_PPStackTop--; } __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] bug/suggestion: debugger should respect option "deparse.max.lines" when printing the call (PR#13647)
Full_Name: John Brzustowski Version: 2.8.1 OS: linux Submission from: (NULL) (67.71.250.146) When entering a debug()'ed function, the call printout is not limited by options()$deparse.max.lines as it is when one uses browser() or trace(). Should it be? If so, here's a patch: diff -cr R-2.8.1/src/library/base/man/options.Rd R-2.8.1-patched/src/library/base/man/options.Rd *** R-2.8.1/src/library/base/man/options.Rd 2008-11-12 04:23:36.0 -0500 --- R-2.8.1-patched/src/library/base/man/options.Rd 2009-04-09 10:12:31.0 -0400 *** *** 77,84 initialized (see \code{\link{Startup}}).} \item{\code{deparse.max.lines}:}{controls the number of lines used ! when deparsing in \code{\link{traceback}} and ! \code{\link{browser}}. Initially unset, and only used if set to a positive integer.} \item{\code{digits}:}{controls the number of digits to print when --- 77,85 initialized (see \code{\link{Startup}}).} \item{\code{deparse.max.lines}:}{controls the number of lines used ! when deparsing in \code{\link{traceback}}, \code{\link{browser}}, ! and upon entry to a function whose debugging flag is set. ! Initially unset, and only used if set to a positive integer.} \item{\code{digits}:}{controls the number of digits to print when diff -cr R-2.8.1/src/main/eval.c R-2.8.1-patched/src/main/eval.c *** R-2.8.1/src/main/eval.c 2008-10-05 18:05:02.0 -0400 --- R-2.8.1-patched/src/main/eval.c 2009-04-09 09:45:03.0 -0400 *** *** 526,531 --- 526,532 volatile SEXP newrho; SEXP f, a, tmp; RCNTXT cntxt; + int itmp; /* formals = list of formal parameters */ /* actuals = values to be bound to formals */ *** *** 609,615 --- 610,621 SET_DEBUG(newrho, DEBUG(op)); if (DEBUG(op)) { Rprintf("debugging in: "); + + itmp = asInteger(GetOption(install("deparse.max.lines"), R_BaseEnv)); + if(itmp != NA_INTEGER && tmp > 0) R_BrowseLines = itmp; PrintValueRec(call,rho); + R_BrowseLines = 0; + /* Is the body a bare symbol (PR#6804) */ if (!isSymbol(body) & !isVectorAtomic(body)){ /* Find out if the body is function with only one statement. */ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] corrected patch: debugger should respect deparse.max.lines (PR#13648)
Full_Name: John Brzustowski Version: 2.8.1 OS: linux Submission from: (NULL) (67.71.250.146) My apologies - there was a bug in the patch I submitted on this topic earlier today. diff -cr R-2.8.1/src/library/base/man/options.Rd R-2.8.1-patched/src/library/base/man/options.Rd *** R-2.8.1/src/library/base/man/options.Rd 2008-11-12 04:23:36.0 -0500 --- R-2.8.1-patched/src/library/base/man/options.Rd 2009-04-09 10:12:31.0 -0400 *** *** 77,84 initialized (see \code{\link{Startup}}).} \item{\code{deparse.max.lines}:}{controls the number of lines used ! when deparsing in \code{\link{traceback}} and ! \code{\link{browser}}. Initially unset, and only used if set to a positive integer.} \item{\code{digits}:}{controls the number of digits to print when --- 77,85 initialized (see \code{\link{Startup}}).} \item{\code{deparse.max.lines}:}{controls the number of lines used ! when deparsing in \code{\link{traceback}}, \code{\link{browser}}, ! and upon entry to a function whose debugging flag is set. ! Initially unset, and only used if set to a positive integer.} \item{\code{digits}:}{controls the number of digits to print when diff -cr R-2.8.1/src/main/eval.c R-2.8.1-patched/src/main/eval.c *** R-2.8.1/src/main/eval.c 2008-10-05 18:05:02.0 -0400 --- R-2.8.1-patched/src/main/eval.c 2009-04-09 09:45:03.0 -0400 *** *** 526,531 --- 526,532 volatile SEXP newrho; SEXP f, a, tmp; RCNTXT cntxt; + int itmp; /* formals = list of formal parameters */ /* actuals = values to be bound to formals */ *** *** 609,615 --- 610,621 SET_DEBUG(newrho, DEBUG(op)); if (DEBUG(op)) { Rprintf("debugging in: "); + + itmp = asInteger(GetOption(install("deparse.max.lines"), R_BaseEnv)); + if(itmp != NA_INTEGER && itmp > 0) R_BrowseLines = itmp; PrintValueRec(call,rho); + R_BrowseLines = 0; + /* Is the body a bare symbol (PR#6804) */ if (!isSymbol(body) & !isVectorAtomic(body)){ /* Find out if the body is function with only one statement. */ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] 2.3.1: interacting bugs in load() and gzfile() (PR#9271)
Hello, If repeated calls are made to save() using the same pre-opened gzfile connection to a file, and then the connection is closed, the objects saved by the second and subsequent calls are not correctly restored by repeated calls to load() with a new gzfile connection to the same file. What follows are a session exposing the bugs, analysis (see ANALYSIS), patches (see PATCHES), and a session showing correct behaviour after patching (see FIXED). The problematic code is still present in the trunk of the svn repository. This is really due to two minor bugs, but because they interact, I'm submitting this as one report. Regards, John Brzustowski THE BUGS (with comments inserted) > sessionInfo() Version 2.3.1 (2006-06-01) i686-pc-linux-gnu attached base packages: [1] "methods" "stats" "graphics" "grDevices" "utils" "datasets" [7] "base" ## create two objects and save them to a gzfile: > x <- 1:10 > y <- x^2 > f <- gzfile("tmp.dat", "wb") > save(x, file=f) ## notice where the second object will be written: > seek(f, NA) [1] 88 > save(y, file=f) > close(f) ## open the file and try to load the two objects, one at a time > f <- gzfile("tmp.dat", "rb") > e <- new.env() > load(f,e) > ls(e) [1] "x" ## x was loaded correctly, but the connection has been closed: > f descriptionclass mode text "gzcon(tmp.dat)" "gzcon" "rb" "binary" opened can readcan write "closed" "yes" "no" ## (see ANALYSIS) ## Try again: > load(f,e) Warning message: this is already a gzcon connection in: gzcon(con, level, allowNonCompressed) > ls(e) [1] "x" ## why the warning? ## and why was nothing apparently read this time? > f descriptionclass mode text "gzcon(tmp.dat)" "gzcon" "rb" "binary" opened can readcan write "opened""yes" "no" > load(f,e) Warning message: this is already a gzcon connection in: gzcon(con, level, allowNonCompressed) > ls(e) [1] "x" "y" ## this time, y was read correctly, but again why the warning? > f descriptionclass mode text "gzcon(tmp.dat)" "gzcon" "rb" "binary" opened can readcan write "closed""yes" "no" ## ## try again, but first seek on the connection to the location of the second object > close(f) > f <- gzfile("tmp.dat", "rb") > e <- new.env() > seek(f,88) [1] 0 > load(f,e) > ls(e) [1] "x" ## this should have loaded the object "y", which according to the earlier call to seek() ## begins at file offset 88. ANALYSIS There are two minor bugs here: 1. The internal function loadFromConnection() incorrectly closes a connection that was already open when it was called, instead of closing a connection that was *not* already open. 2. gzfile() sets a class of its return value to "file", rather than "gzfile": > f <- gzfile("whatever.gz", "wb") > class(f) [1] "file" "connection" In contrast, bzfile() exhibits correct behaviour: > f2<-bzfile("more.bz2", "wb") > class(f2) [1] "bzfile" "connection" This prevents load() from noticing that the connection passed to it is already a gzfile, since it tests for inheritance from class "gzfile". Therefore, load() always re-opens the gzfile connection with gzcon(), which causes reading to proceed from the beginning of the file, regardless of any seek() on the original gzfile connection. This reopening of a gzfile with gzcon() causes the warning, and prevents the seek() before the load() from having any effect. Interaction: because (with bug 2) load() always reopens a gzfile connection, bug 1's effect is to not close that reopened connection. Ironically, this allows (with warnings) objects from multiple calls to save() to be restored by multiple calls to load(), as demonstrated by the section in the session above, provided load() is being called with a filename or closed connection rather than an open gzfile connection. PATCHES: for bug 1: diff -u R-2.3.1.original/src/main/saveload.c R-2.3.1/src/main/saveload.c --- R-2.3.1.original/src/main/saveload.c2006-04-09 18:19:51.0 -0400 +++ R-2.3.1/src/main/saveload.c 2006-10-02 13:10:21.0 -0400 @@ -2301,6 +2301,7 @@ PROTECT(res = RestoreToEnv(R_Unserialize(&in), aenv)); if (wasopen) { endcontext(&cntxt); +} else { con->close(con); } UNPROTECT(1); for bug 2: diff -u R-2.3.1.original/src/main/connections.c R-2.3.1/src/main/connections