[Rd] segfault / crash when asking for large memory via strrep()

2016-06-01 Thread Martin Maechler
We've had this more general topic on R-help,  and also in R-devel recently.
There's one case here where I get the feeling R never gets into
swapping but more directly aborts possibly from a bug we can
more easily fix.

Today I've been working (successfully! - not yet committed) at
fixing  str() for very large strings.

In this process, I've found that

   pc <- function(.) paste(., collapse=".1.2.3.4.5.")
   p  <- function(.) strrep(pc(.), 64L)
   p(p(p(p(LETTERS

produces a (memory related) segmentation fault (aka "crash")
very reproducibly and relatively quickly
both on my Linux (Fedora 22) desktop and on our Windows server.

 *** caught segfault ***
address 0x7fc52dc89000, cause 'memory not mapped'

Traceback:
 1: strrep(pc(.), 64L)
 2: p(p(p(p(LETTERS
 3: system.time(L2 <- p(p(p(p(LETTERS)

In the debugger, the symptoms point to the possibility of a
bug just in the C parts of strrep() :


Program received signal SIGSEGV, Segmentation fault.
0x754d6223 in __strcpy_sse2_unaligned () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install 
bzip2-libs-1.0.6-14.fc22.x86_64 libgcc-5.3.1-6.fc22.x86_64 
libgfortran-5.3.1-6.fc22.x86_64 libgomp-5.3.1-6.fc22.x86_64 
libicu-54.1-4.fc22.x86_64 libquadmath-5.3.1-6.fc22.x86_64 
libstdc++-5.3.1-6.fc22.x86_64 ncurses-libs-5.9-18.20150214.fc22.x86_64 
pcre-8.38-4.fc22.x86_64 readline-6.3-5.fc22.x86_64 xz-libs-5.2.0-2.fc22.x86_64 
zlib-1.2.8-7.fc22.x86_64
(gdb) bt
#0  0x754d6223 in __strcpy_sse2_unaligned () from /usr/lib64/libc.so.6
#1  0x00457def in do_strrep (call=, op=, 
args=, 
env=) at ../../../R/src/main/character.c:1658
#2  0x004d6844 in bcEval (body=body@entry=0xd66840, 
rho=rho@entry=0x45253b8, 
useCache=useCache@entry=TRUE) at ../../../R/src/main/eval.c:5648
#3  0x004dd240 in Rf_eval (e=0xd66840, rho=0x45253b8) at 
../../../R/src/main/eval.c:616
#4  0x004dedaf in Rf_applyClosure (call=call@entry=0x45250a8, 
op=op@entry=0xd668e8, 
arglist=0x45251f8, rho=rho@entry=0x4525000, suppliedvars=0xa57188)
at ../../../R/src/main/eval.c:1134
#5  0x004dd3b1 in Rf_eval (e=0x45250a8, rho=0x4525000) at 
../../../R/src/main/eval.c:732
#6  0x004dedaf in Rf_applyClosure (call=call@entry=0x4525718, 
op=op@entry=0x4524d28, 
arglist=0x4524f90, rho=rho@entry=0xa8ea30, suppliedvars=0xa57188)
at ../../../R/src/main/eval.c:1134
#7  0x004dd3b1 in Rf_eval (e=0x4525718, rho=0xa8ea30) at 
../../../R/src/main/eval.c:732
#8  0x004e0cde in do_set (call=0x4525670, op=0xa61358, args=, rho=0xa8ea30)
at ../../../R/src/main/eval.c:2196

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] segfault / crash when asking for large memory via strrep()

2016-06-01 Thread luke-tierney

That would be because the product nc * ni overflows in

cbuf = buf = CallocCharBuf(nc * ni);

Since we disallow strings with more than 2^31-1 bytes we could test
and reject this. It might be more future-proof to change the
declaration of

int j, ni, nc;

to

R_xlen_t j, ni, nc;

and let the character allocation code reject, but that would create a
memory leak since the Free call isn't reached. This is a problem in
any case though, as

SET_STRING_ELT(s, is, markKnown(cbuf, STRING_ELT(x, ix)));

could throw errors for a number of reasons and then the Free() is not
reached. It would be better to use R_alloc or register a cleanup
function to call Free on a jump.

Best,

luke

On Wed, 1 Jun 2016, Martin Maechler wrote:


We've had this more general topic on R-help,  and also in R-devel recently.
There's one case here where I get the feeling R never gets into
swapping but more directly aborts possibly from a bug we can
more easily fix.

Today I've been working (successfully! - not yet committed) at
fixing  str() for very large strings.

In this process, I've found that

  pc <- function(.) paste(., collapse=".1.2.3.4.5.")
  p  <- function(.) strrep(pc(.), 64L)
  p(p(p(p(LETTERS

produces a (memory related) segmentation fault (aka "crash")
very reproducibly and relatively quickly
both on my Linux (Fedora 22) desktop and on our Windows server.

*** caught segfault ***
address 0x7fc52dc89000, cause 'memory not mapped'

Traceback:
1: strrep(pc(.), 64L)
2: p(p(p(p(LETTERS
3: system.time(L2 <- p(p(p(p(LETTERS)

In the debugger, the symptoms point to the possibility of a
bug just in the C parts of strrep() :


Program received signal SIGSEGV, Segmentation fault.
0x754d6223 in __strcpy_sse2_unaligned () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install 
bzip2-libs-1.0.6-14.fc22.x86_64 libgcc-5.3.1-6.fc22.x86_64 
libgfortran-5.3.1-6.fc22.x86_64 libgomp-5.3.1-6.fc22.x86_64 
libicu-54.1-4.fc22.x86_64 libquadmath-5.3.1-6.fc22.x86_64 
libstdc++-5.3.1-6.fc22.x86_64 ncurses-libs-5.9-18.20150214.fc22.x86_64 
pcre-8.38-4.fc22.x86_64 readline-6.3-5.fc22.x86_64 xz-libs-5.2.0-2.fc22.x86_64 
zlib-1.2.8-7.fc22.x86_64
(gdb) bt
#0  0x754d6223 in __strcpy_sse2_unaligned () from /usr/lib64/libc.so.6
#1  0x00457def in do_strrep (call=, op=, 
args=,
   env=) at ../../../R/src/main/character.c:1658
#2  0x004d6844 in bcEval (body=body@entry=0xd66840, 
rho=rho@entry=0x45253b8,
   useCache=useCache@entry=TRUE) at ../../../R/src/main/eval.c:5648
#3  0x004dd240 in Rf_eval (e=0xd66840, rho=0x45253b8) at 
../../../R/src/main/eval.c:616
#4  0x004dedaf in Rf_applyClosure (call=call@entry=0x45250a8, 
op=op@entry=0xd668e8,
   arglist=0x45251f8, rho=rho@entry=0x4525000, suppliedvars=0xa57188)
   at ../../../R/src/main/eval.c:1134
#5  0x004dd3b1 in Rf_eval (e=0x45250a8, rho=0x4525000) at 
../../../R/src/main/eval.c:732
#6  0x004dedaf in Rf_applyClosure (call=call@entry=0x4525718, 
op=op@entry=0x4524d28,
   arglist=0x4524f90, rho=rho@entry=0xa8ea30, suppliedvars=0xa57188)
   at ../../../R/src/main/eval.c:1134
#7  0x004dd3b1 in Rf_eval (e=0x4525718, rho=0xa8ea30) at 
../../../R/src/main/eval.c:732
#8  0x004e0cde in do_set (call=0x4525670, op=0xa61358, args=, rho=0xa8ea30)
   at ../../../R/src/main/eval.c:2196

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] segfault / crash when asking for large memory via strrep()

2016-06-01 Thread luke-tierney

I've added a size/overflow check before the buffer allocation in
R-devel and R-patched.

It would be a good idea sometime to review the use of

calloc
...
free

patterns to make sure the ... can't raise an error or otherwise jump
and leave the memory pointer dangling.

Best,

luke

On Wed, 1 Jun 2016, luke-tier...@uiowa.edu wrote:


That would be because the product nc * ni overflows in

   cbuf = buf = CallocCharBuf(nc * ni);

Since we disallow strings with more than 2^31-1 bytes we could test
and reject this. It might be more future-proof to change the
declaration of

   int j, ni, nc;

to

   R_xlen_t j, ni, nc;

and let the character allocation code reject, but that would create a
memory leak since the Free call isn't reached. This is a problem in
any case though, as

SET_STRING_ELT(s, is, markKnown(cbuf, STRING_ELT(x, ix)));

could throw errors for a number of reasons and then the Free() is not
reached. It would be better to use R_alloc or register a cleanup
function to call Free on a jump.

Best,

luke

On Wed, 1 Jun 2016, Martin Maechler wrote:

We've had this more general topic on R-help,  and also in R-devel 
recently.

There's one case here where I get the feeling R never gets into
swapping but more directly aborts possibly from a bug we can
more easily fix.

Today I've been working (successfully! - not yet committed) at
fixing  str() for very large strings.

In this process, I've found that

  pc <- function(.) paste(., collapse=".1.2.3.4.5.")
  p  <- function(.) strrep(pc(.), 64L)
  p(p(p(p(LETTERS

produces a (memory related) segmentation fault (aka "crash")
very reproducibly and relatively quickly
both on my Linux (Fedora 22) desktop and on our Windows server.

*** caught segfault ***
address 0x7fc52dc89000, cause 'memory not mapped'

Traceback:
1: strrep(pc(.), 64L)
2: p(p(p(p(LETTERS
3: system.time(L2 <- p(p(p(p(LETTERS)

In the debugger, the symptoms point to the possibility of a
bug just in the C parts of strrep() :


Program received signal SIGSEGV, Segmentation fault.
0x754d6223 in __strcpy_sse2_unaligned () from 
/usr/lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install 
bzip2-libs-1.0.6-14.fc22.x86_64 libgcc-5.3.1-6.fc22.x86_64 
libgfortran-5.3.1-6.fc22.x86_64 libgomp-5.3.1-6.fc22.x86_64 
libicu-54.1-4.fc22.x86_64 libquadmath-5.3.1-6.fc22.x86_64 
libstdc++-5.3.1-6.fc22.x86_64 ncurses-libs-5.9-18.20150214.fc22.x86_64 
pcre-8.38-4.fc22.x86_64 readline-6.3-5.fc22.x86_64 
xz-libs-5.2.0-2.fc22.x86_64 zlib-1.2.8-7.fc22.x86_64

(gdb) bt
#0  0x754d6223 in __strcpy_sse2_unaligned () from 
/usr/lib64/libc.so.6
#1  0x00457def in do_strrep (call=, op=out>, args=,

   env=) at ../../../R/src/main/character.c:1658
#2  0x004d6844 in bcEval (body=body@entry=0xd66840, 
rho=rho@entry=0x45253b8,

   useCache=useCache@entry=TRUE) at ../../../R/src/main/eval.c:5648
#3  0x004dd240 in Rf_eval (e=0xd66840, rho=0x45253b8) at 
../../../R/src/main/eval.c:616
#4  0x004dedaf in Rf_applyClosure (call=call@entry=0x45250a8, 
op=op@entry=0xd668e8,

   arglist=0x45251f8, rho=rho@entry=0x4525000, suppliedvars=0xa57188)
   at ../../../R/src/main/eval.c:1134
#5  0x004dd3b1 in Rf_eval (e=0x45250a8, rho=0x4525000) at 
../../../R/src/main/eval.c:732
#6  0x004dedaf in Rf_applyClosure (call=call@entry=0x4525718, 
op=op@entry=0x4524d28,

   arglist=0x4524f90, rho=rho@entry=0xa8ea30, suppliedvars=0xa57188)
   at ../../../R/src/main/eval.c:1134
#7  0x004dd3b1 in Rf_eval (e=0x4525718, rho=0xa8ea30) at 
../../../R/src/main/eval.c:732
#8  0x004e0cde in do_set (call=0x4525670, op=0xa61358, 
args=, rho=0xa8ea30)

   at ../../../R/src/main/eval.c:2196

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel






--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] [RfC] Family dispersion

2016-06-01 Thread Luis Carvalho
Hi,

I'd like to hear your opinion about the following proposal to make the
computation of dispersion in GLMs more flexible. Dispersion is used in
summary.glm; the relevant code chunk with the dispersion calculation is listed
below (from glm.R):


summary.glm <- function(object, dispersion = NULL,
correlation = FALSE, symbolic.cor = FALSE, ...)
{
  est.disp <- FALSE
  df.r <- object$df.residual
  if(is.null(dispersion))   # calculate dispersion if needed
dispersion <-
  if(object$family$family %in% c("poisson", "binomial"))  1
  else if(df.r > 0) {
est.disp <- TRUE
if(any(object$weights==0))
  warning("observations with zero weight not used for calculating 
dispersion")
sum((object$weights*object$residuals^2)[object$weights > 0])/ df.r
  } else {
est.disp <- TRUE
NaN
  }
  # ...
}


Many exponential families have unit dispersion, or can be cast to have unit
dispersion, e.g. hypergeometric, negative binomial, and so on. However,
summary.glm only assigns unit dispersion to Poisson and binomial families, as
the code above indicates. My suggestion is to make this check more general by
having a 'dispersion' slot in the family class; for instance, we would have
poisson(...)$dispersion = 1 and binomial(...)$dispersion = 1. The updated
summary.glm would be:


default.dispersion <- function (object, ...) {
  df.r <- object$df.residual
  if (df.r > 0) {
if (any(object$weights == 0))
  warning("observations with zero weight not used for calculating 
dispersion")
sum((object$weights * object$residuals ^ 2)[object$weights > 0]) / df.r
  }
  else NaN
}

summary.glm <- function(object, dispersion = default.dispersion,
correlation = FALSE, symbolic.cor = FALSE, ...)
{
  if (!is.null(object$family$dispersion)) # use family dispersion?
dispersion <- object$family$dispersion
  est.disp <- is.function(dispersion)
  dispersion <- if (est.disp) dispersion(object, ...) else dispersion

  df.r <- object$df.residual
  # ... (unchanged code below)
}


Note that 'dispersion' can be a function taking a glm object or a number
(e.g. 1). Here are some examples:

R> library(MASS)
R> gm <- glm(formula, family=Gamma())
R> summary(gm, dispersion = gamma.dispersion) # ML estimate of dispersion

R> set.dispersion <- function (fam, disp) # update family dispersion
R>   structure(within(unclass(fam), dispersion <- disp), class = "family")
R> gm <- glm(formula, family=set.dispersion(Gamma(), gamma.dispersion))
R> summary(gm) # use family dispersion
R> Exp <- function (...) set.dispersion(Gamma(...), 1)

Thanks in advance for the feedback.

Cheers,
Luis


-- 
Computers are useless. They can only give you answers.
-- Pablo Picasso

-- 
Luis Carvalho
Associate Professor
Dept. of Mathematics and Statistics
Boston University
http://math.bu.edu/people/lecarval

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel