Assuming the rule is an upper case alphabetic character followed by two other characters followed by a string of digits then try this:
> library(gsubfn) > strapply(x, "[A-Z][^ ][^ ][0-9]+") [[1]] [1] "YP_177963" [[2]] [1] "CAA15575" [[3]] [1] "CAA17111" If you prefer the output as one long vector of strings try this: > strapply(x, "[A-Z][^ ][^ ][0-9]+", simplify = c) [1] "YP_177963" "CAA15575" "CAA17111" If the string that denotes a protein can be part of a word which itself does not denote a protein then we will need something like this: > strapply(x, "\\b[A-Z][^ ][^ ][0-9]+\\b", perl = TRUE) [[1]] [1] "YP_177963" [[2]] [1] "CAA15575" [[3]] [1] "CAA17111" however, I would expect this second solution using perl's \b to be much slower because the first one uses tcl code underneath whereas the second uses R code. See http://gsubfn.googlecode.com for more. On Wed, Sep 16, 2009 at 9:53 AM, Giulio Di Giovanni <perimessagg...@hotmail.com> wrote: > > > > Hi all, > > I have thousands of strings like these ones: > > > > "1159_1; YP_177963; PPE FAMILY PROTEIN" > > "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575" > > "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE DEHYDROGENASE" > > > > and various others.. > > > > I'm interested to extract the code for the protein (in this example: > YP_177963, CAA15575, CAA17111). > > I found only one common criterion to identify the protein codes in ALL my > strings: > > I need a sequence of characters selected in this way: > > > > start: > > the first alphabetic capital letter followed after three characters by a digit > > > > end: > > the last following digit before a non-digit character, or nothing. > > > > Tricky, isn't it? > > Well, I'm not an expert, and I played a lot with regular expressions and > sub() command with no big results. Also with substring.location in Hmisc > package (but here I don't know how to use regular expressions). > > Maybe there are other more useful functions or maybe is just a matter to use > regular expression in a better way... > > > > Can anybody help me? > > > > Thanks a lot in advance... > > > _________________________________________________________________ > Racconta la tua estate, crea il tuo blog. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.