Re: [R] Regular Expressions

Prof Brian Ripley Fri, 05 Nov 2010 00:10:08 -0700

On Thu, 4 Nov 2010, Noah Silverman wrote:

Hi,
I'm trying to figure out how to use capturing parenthesis in regularexpressions in R. (Doing this in Perl, Java, etc. is fairly trivial, but Ican't seem to find the functionality in R.)
For example, given the string:    "10 Nov 13.00 (PFE1020K13)"

I want to capture the first to digits and then the month abreviation.

In perl, this would be

/^(\d\d)\s(\w\w\w)\s/

Then I have the variables $1 and $1 assigned to the capturing parenthesis.
I've found the grep and sub commands in R, but the docs don't indicate anyway to capture things.
Any suggestions?

Read the the link to ?regexp. It *does* 'indicate the way to capturethings'.


     The backreference ‘\N’, where ‘N = 1 ... 9’, matches the substring
     previously matched by the Nth parenthesized subexpression of the
     regular expression.  (This is an extension for extended regular
     expressions: POSIX defines them only for basic ones.)

and there is an example on the help page for grep():

     ## Double all 'a' or 'b's;  "\" must be escaped, i.e., 'doubled'
     gsub("([ab])", "\\1_\\1_", "abc and ABC")

In your example

x <- "10 Nov 13.00 (PFE1020K13)"
regex <- "(\\d\\d)\\s(\\w\\w\\w).*"
sub(regex, "\\1", x, perl = TRUE)
sub(regex, "\\2", x, perl = TRUE)

A better way to do this would be something like

regex <- "([[:digit:]]{2})\\s([[:alpha:]]{3}).*"

which is also a POSIX extended regexp.

--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Regular Expressions

Reply via email to