This seems to be simply integer overflow in a calculation.
Changed in R-patched to use doubles.

The issue I patched for Kenneth Roy Cabrera was for perl = FALSE only.

On Tue, 3 Nov 2009, William Dunlap wrote:

Here is a more self-contained way to reproduce the problem in 2.10.0
using the prebuilt Windows executable.  Putting a trace on gsub in
the call to strapply showed that it died in the first call to gsub
when the replacement included "\\1" and the string was about 900000
characters long (and included 150000 "words").  It looks like it
dies if the string is >= 731248 characters.

d<-substring(paste(collapse=" ", sapply(1:150000,function(i)"abcde")), 1, 
731248)
nchar(d)
[1] 731248
substring(d, nchar(d)-10)
[1] " abcde abcd"
p<-gsub("([[:alpha:]]+)", "\\1", d, perl=FALSE)
Error in gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
 Calloc could not allocate (-2146542248 of 1) memory
In addition: Warning messages:
1: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
 Reached total allocation of 1535Mb: see help(memory.size)
2: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
 Reached total allocation of 1535Mb: see help(memory.size)
p<-gsub("([[:alpha:]]+)", "\\1", d, perl=TRUE)
Error in gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
 Calloc could not allocate (-2146542248 of 1) memory
In addition: Warning messages:
1: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
 Reached total allocation of 1535Mb: see help(memory.size)
2: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
 Reached total allocation of 1535Mb: see help(memory.size)

Make d one character shorter and it succeeds with either
perl=TRUE or perl=FALSE.

version
              _
platform       i386-pc-mingw32
arch           i386
os             mingw32
system         i386, mingw32
status
major          2
minor          10.0
year           2009
month          10
day            26
svn rev        50208
language       R
version.string R version 2.10.0 (2009-10-26)
sessionInfo()
R version 2.10.0 (2009-10-26)
i386-pc-mingw32

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tcltk_2.10.0

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

-----Original Message-----
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org] On Behalf Of Richard R. Liu
Sent: Tuesday, November 03, 2009 3:00 PM
To: Kenneth Roy Cabrera Torres
Cc: r-help@r-project.org; Uwe Ligges
Subject: Re: [R] R 2.10.0: Error in gsub/calloc

Kenneth,

Thanks for the hint.  I downloaded and installed the latest
patch, but
to no avail.  I can reproduce the error on a single sentence, the
longest in the document.  It contains 743,393 characters.  It
isn't a
true sentence, but since it is more than three standard deviations
longer than the mean sentence length, I might be able to use
the mean
and the standard deviation as a way of weeding ot the really evident
"non-sentences" before I take into account the
characteristics of the
the tokens.

Regards,
Richard

On Nov 3, 2009, at 20:44 , Kenneth Roy Cabrera Torres wrote:

Try the patch version...
Maybe is the same problem I had with large
database when using gsub()

HTH

El mar, 03-11-2009 a las 20:31 +0100, Richard R. Liu escribió:
I apologize for not being clear.  d is a character vector of length
158908.  Each element in the vector has been designated by
sentDetect
(package: openNLP) as a sentence.  Some of these are really
sentences.  Others are merely groups of meaningless characters
separated by white space.  strapply is a function in the package
gosubfn.  It applies to each element of the first argument the
regular
expression (second argument).  Every match is then sent to the
designated function (third argument, in my case missing, hence the
identity function).  Thus, with strapply I am simply performing a
white-space tokenization of each sentence.  I am doing this in the
hope of being able to distinguish true sentences from false ones on
the basis of mean length of token, maximum length of token, or
similar.

Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Email:  richard....@pueo-owl.ch


On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:



richard....@pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I
don't think
this
is a Mac-specific problem.
I have a very large (158,908 possible sentences, ca. 58 MB) plain
text
document d which I am
trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
encountering the following error:


What is strapply() and what is d?

Uwe Ligges




Error in base::gsub(pattern, rs, x, ...) :
Calloc could not allocate (-1398215180 of 1) memory
This happens regardless of whether I run in 32- or
64-bit mode.
The
machine has 8 GB of RAM, so
I can hardly believe that RAM is a problem.
Thanks,
Richard
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained,
reproducible code.


--Apple-Mail-8--203371287--

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to