This should do it for you: > pat <- ".*(\\b[A-Z]..[0-9]+).*" > grep(pat, x) [1] 1 3 5 > sub(pat, '\\1', x) [1] "YP_177963" "" "CAA15575" "" "CAA17111" >
On Wed, Sep 16, 2009 at 9:53 AM, Giulio Di Giovanni <perimessagg...@hotmail.com> wrote: > > > > Hi all, > > I have thousands of strings like these ones: > > > > "1159_1; YP_177963; PPE FAMILY PROTEIN" > > "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575" > > "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE DEHYDROGENASE" > > > > and various others.. > > > > I'm interested to extract the code for the protein (in this example: > YP_177963, CAA15575, CAA17111). > > I found only one common criterion to identify the protein codes in ALL my > strings: > > I need a sequence of characters selected in this way: > > > > start: > > the first alphabetic capital letter followed after three characters by a digit > > > > end: > > the last following digit before a non-digit character, or nothing. > > > > Tricky, isn't it? > > Well, I'm not an expert, and I played a lot with regular expressions and > sub() command with no big results. Also with substring.location in Hmisc > package (but here I don't know how to use regular expressions). > > Maybe there are other more useful functions or maybe is just a matter to use > regular expression in a better way... > > > > Can anybody help me? > > > > Thanks a lot in advance... > > > _________________________________________________________________ > Racconta la tua estate, crea il tuo blog. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.