Thanks guys. I've pulled my O'Reilly book and will begin reviewing it. ------------------------------------------------------------ Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine
15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail "The real problem is not whether machines think but whether men do." -- B. F. Skinner ****************************************************************** On Thu, Aug 20, 2009 at 12:37 PM, Phil Spector <spec...@stat.berkeley.edu>wrote: > Mark - > It looks like you're running into the greediness of regular expressions. > When R sees ".*" it tries to find the longest match, which also grabs > some of the stuff you want. You can either replace .* with something > like [^\\])]* (i.e. one or more of any character *except* "]" or ")" ), > or use perl=TRUE, which allows the question mark ("?") to mean the shortest > match instead of the longest. Here's what I'd use: > > gsub('[\\[(].*?[\\])]','',myCharVec,perl=TRUE) > > In English: substitute the shortest string starting with "[" or "(" and > ending with "]" or ")" with nothing. > > Hope this helps. > - Phil > > > > > On Thu, 20 Aug 2009, Mark Kimpel wrote: > > Well, I guess I'm not quite there yet. What I gave earlier was a >> simplified >> example, and did not accurately reflect the complexity of the task. >> >> This is my real world example. As you can see, what I need to do is delete >> an arbitrary number of characters, including brackets and parens enclosing >> them, multiple times within the same string. Help? >> >> myCharVec <- "medicare [link 220.30.05] ssa (1-800-772-1213). 2008 >> [link >> 145.30.05] amounts (2d) gross income (magi) here. (2e)" >> myCharVec >> myCharVec <- gsub('\\[.*\\]', '', myCharVec) >> myCharVec >> myCharVec <- gsub('\\(.*\\)', '', myCharVec) >> myCharVec >> >> #what I want >> # "medicare ssa . 2008 amounts gross income here." >> >> myCharVec <- "medicare [link 220.30.05] ssa (1-800-772-1213). 2008 >> [link >> 145.30.05] amounts (2d) gross income (magi) here. (2e)" >> >>> myCharVec >>> >> [1] "medicare [link 220.30.05] ssa (1-800-772-1213). 2008 [link >> 145.30.05] amounts (2d) gross income (magi) here. (2e)" >> >>> myCharVec <- gsub('\\[.*\\]', '', myCharVec) >>> myCharVec >>> >> [1] "medicare amounts (2d) gross income (magi) here. (2e)" >> >>> myCharVec <- gsub('\\(.*\\)', '', myCharVec) >>> myCharVec >>> >> [1] "medicare amounts " >> >>> >>> #what I want >>> # "medicare ssa . 2008 amounts gross income here." >>> >> ------------------------------------------------------------ >> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >> Indiana University School of Medicine >> >> 15032 Hunter Court, Westfield, IN 46074 >> >> (317) 490-5129 Work, & Mobile & VoiceMail >> >> "The real problem is not whether machines think but whether men do." -- B. >> F. Skinner >> ****************************************************************** >> >> >> On Thu, Aug 20, 2009 at 11:39 AM, William Dunlap <wdun...@tibco.com> >> wrote: >> >> >>> -----Original Message----- >>>> From: r-help-boun...@r-project.org >>>> [mailto:r-help-boun...@r-project.org] On Behalf Of Mark Kimpel >>>> Sent: Thursday, August 20, 2009 8:31 AM >>>> To: r-help@r-project.org >>>> Subject: [R] help with regular expressions in R >>>> ... >>>> myCharVec <- c("[the rain in spain]", "(the rain in spain)") >>>> gsub('\\[*.\\]', '', myCharVec) >>>> >>> >>> Change the '*.' to '.*'. >>> >>> Your expression matches 0 or more left square brackets, >>> followed by 1 character, followed by a right squared bracket. >>> >>> "\\[.*\]]" matches a left square bracket, followed by 0 or more >>> characters, followed by a right square bracket. >>> >>> Bill Dunlap >>> TIBCO Software Inc - Spotfire Division >>> wdunlap tibco.com >>> >>> >>>> #what I get >>>> # [1] "[the rain in spai" "(the rain in spain)" >>>> >>>> #what I want >>>> [1] "" "(the rain in spain)" >>>> >>>> sessionInfo() >>>>> >>>> R version 2.10.0 Under development (unstable) (2009-08-12 r49193) >>>> x86_64-unknown-linux-gnu >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices datasets utils methods base >>>> >>>> other attached packages: >>>> [1] RWeka_0.3-20 tm_0.4 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] grid_2.10.0 rJava_0.6-3 slam_0.1-3 >>>> >>>> >>>> ------------------------------------------------------------ >>>> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >>>> Indiana University School of Medicine >>>> >>>> 15032 Hunter Court, Westfield, IN 46074 >>>> >>>> (317) 490-5129 Work, & Mobile & VoiceMail >>>> >>>> "The real problem is not whether machines think but whether >>>> men do." -- B. >>>> F. Skinner >>>> ****************************************************************** >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.