Thanks to all who replied. perl = TRUE indeed seems to fix the problem. It 
would be great, however, to prevent others from stumbling in this pitfall by 
fixing the issue if this is possible. But as Prof. Ripley mentioned fixing this 
might be difficult/impossible so we might have to live with it. 


By the way, is there an easily accessible and search able list of such bugs for 
R (just for the future)?


Thanks a lot
Jannis



----- Ursprüngliche Message -----
Von: Sarah Goslee <sarah.gos...@gmail.com>
An: Duncan Murdoch <murdoch.dun...@gmail.com>
Cc: Jannis <bt_jan...@yahoo.de>; "r-help@r-project.org" <r-help@r-project.org>
Gesendet: 15:37 Freitag, 9.Dezember 2011
Betreff: Re: [R] unexpected behaviour of sub() / usage of regexp

But I do get the incorrect result on R 2.14.0 on linux:
> sub('[[:digit:]]{1,2}', '', '9ewww')
[1] "www"

And also:

> sub('[[:digit:]]{1,2}', '', '9ewww')
[1] "www"
> sub('[[:digit:]]{1,2}', '', 'ewww9')
[1] "ww9"
> sub('\\d{1,2}', '', 'ewww9')
[1] "ww9"

But:
> sub('\\d', '', 'ewww9')
[1] "ewww"
> sub('\\d*', '', '9ewww')
[1] "ewww"

So it seems to be something about the way the curly braces are
handled, but only with certain groups:

> sub('e{1,2}', '', '9ewww')
[1] "9www"
> sub('9{1,2}', '', '9ewww')
[1] "ewww"


But, as Prof. Ripley's email suggests, perl=TRUE solves the problem.
(I was trying out various combinations when it appeared in my inbox.)

> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C                 LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base



On Fri, Dec 9, 2011 at 9:25 AM, Duncan Murdoch <murdoch.dun...@gmail.com> wrote:
> On 09/12/2011 9:20 AM, Jannis wrote:
>>
>> Dear R users,
>>
>>
>> the way I understand the documentation of sub() and regexp the following
>> code:
>>
>>
>>
>> sub('[[:digit:]]{1,2}', '', '9ewww')
>>
>>
>>
>> ... should yield:
>>
>> 'ewww'
>>
>>
>> It returns, however:
>>
>> 'www'
>>
>>
>> Why is this the case? My code should just substitute 1 (minimum) or up to
>> 2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I
>> misinterpret something here?
>
>
> I get your expected output of "ewww" running 2.14.0 or 2.14.0-patched on
> Windows.   So it's not a universal problem...
>
> Duncan Murdoch
>
>>
>> Thanks for any ideas
>> Jannis
>>
>>
>> >  sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: i686-pc-linux-gnu (32-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C                [3]
>> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8      [5]
>> LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8     [7] LC_PAPER=C
>>           LC_NAME=C                   [9] LC_ADDRESS=C
>> LC_TELEPHONE=C            [11] LC_MEASUREMENT=en_US.UTF-8
>> LC_IDENTIFICATION=C
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>

-- 
Sarah Goslee
http://www.functionaldiversity.org


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to