On Sun, Dec 26, 2010 at 7:26 PM, Gabor Grothendieck <ggrothendi...@gmail.com> wrote: > On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <han...@depauw.edu> wrote: >> Hello R Folks... >> >> I've been looking around the 'net and I see many complex solutions in >> various languages to this question, but I have a pretty simple need (and I'm >> not much good at regex). I want to use a chemical formula as a function >> argument. The formula would be in "Hill order" which is to list C, then H, >> then all other elements in alphabetical order. My example will have only a >> limited number of elements, few enough that one can search directly for each >> element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that >> for oxygen and bromine, O or Br, there is no following number meaning a 1 is >> implied). >> >> Let's say >> >>> form <- "C5H11BrO" >> >> I'd like to get the count of each element, so in this case I need to extract >> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular >> weight by mulitplying). Sounds pretty simple, but my experiments with grep >> and strsplit don't immediately clue me into an obvious solution. As I said, >> I don't need a general solution to the problem of calculating molecular >> weight from an arbitrary formula, that seems quite challenging, just a way >> to convert "form" into a list or data frame which I can then do the math on. >> >> Here's hoping this is a simple issue for more experienced R users! TIA, > > This can be done by strapply in gsubfn. It matches the regular > expression to the target string passing the back references (the > parenthesized portions of the regular expression) through a specified > function as successive arguments. > > Thus the first arg is form, your input string. The second arg is the > regular expression which matches an upper case letter optionally > followed by lower case letters and all that is optionally followed by > digits. The third arg is a function shown in a formula > representation. strapply passes the back references (i.e. the portions > within parentheses) to the function as the two arguments. Finally > simplify is another function in formula notation which turns the > result into a matrix and then a data frame. Finally we make the > second column of the data frame numeric. > > library(gsubfn) > > DF <- strapply(form, > "([A-Z][a-z]*)(\\d*)", > ~ c(..1, if (nchar(..2)) ..2 else 1), > simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) > DF[[2]] <- as.numeric(DF[[2]]) > > DF looks like this: > >> DF > V1 V2 > 1 C 5 > 2 H 11 > 3 Br 1 > 4 O 1 >
Here is a variation that is slightly simpler. The function in the third argument has been changed from c to paste so that it outputs strings like "C 5". With this form of output we can use read.table to read it directly creating a data frame. > strapply(form, + "([A-Z][a-z]*)(\\d*)", + ~ paste(..1, if (nchar(..2)) ..2 else 1), + simplify = ~ read.table(textConnection(..1))) V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.