Re: [R] Memory management in R

Lorenzo Isella Sat, 09 Oct 2010 06:45:52 -0700

Hi David,

I am replying to you and to the other people who provided some insightinto my problems with grepl.

Well, at least we now know that the bug is reproducible.

Indeed it is a strange sequence the one I am postprocessing, probablypathological to some extent, nevertheless the problem is given by greplcrushing when a long (but not huge) chunk of repeated data is loaded hasto be acknowledged.Now, my problem is the following: given a potentially long string (orbefore that a sequence, where every element has been generated via thehash function, algo='crc32' of the digest package), how can I, startingfrom an arbitrary position i along the list, calculate the shortestsubstring in the future of i (i.e. the interval i:end of the series)that has not occurred in the past of i (i.e. [1:i-1])?Efficiency is not the main point here, I need to run this code only onceto get what I need, but it cannot crush on a 2000-entry string.

Cheers


Lorenzo


On 10/09/2010 01:30 AM, David Winsemius wrote:

What puzzles me is that the list is not really long (less than 2000
entries) and I have not experienced the same problem even with longer
lists.


But maybe your loop terminated in them eaarlier/ Someplace between
11*225 and 11*240 the grepping machine gives up:

 > eprs <- paste(rep("aaaaaaaaaa", 225), collapse="#")
 > grepl(eprs, eprs)
[1] TRUE

 > eprs <- paste(rep("aaaaaaaaaa", 240), collapse="#")
 > grepl(eprs, eprs)
Error in grepl(eprs, eprs) :
invalid regular expression
'aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaa

In addition: Warning message:
In grepl(eprs, eprs) : regcomp error: 'Out of memory'

The complexity of the problem may depend on the distribution of values.
You have a very skewed distribution with the vast majority being in the
same value as appeared in your error message :

 > table(x)
x
12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e abddf3d1
1419 299 1 1 1 3 1 1
ac76183b b955be36 c600173a e96f6bbd e9c56275
1 30 5 1 9

And you have 1159 of them in one clump (which would seem to be somewhat
improbably under a random null hypothesis:

 > max(rle(x)$lengths)
[1] 1159
 > which(rle(x)$lengths == 1159)
[1] 123
 > rle(x)$values[123]
[1] "12653a6"

HTH (although I think it means you need to construct a different
implementation strategy);

David.

Many thanks

Lorenzo


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Memory management in R

Reply via email to