> -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Mark Heckmann > Sent: Tuesday, June 09, 2009 4:45 AM > To: r-help@r-project.org > Cc: waclaw.marcin.kusnierc...@idi.ntnu.no; marc_schwa...@me.com > Subject: Re: [R] using regular expressions to retrieve a > digit-digit-dotstructure from a string > > Hey all, > > Thanks for your help. Your answers solved the problem I > posted and that is > just when I noticed that I misspecified the problem ;) > My problem is to separate a German texts by sentences. Unfortunately I > haven't found an R package doing this kind of text separation > in German, so > I try it "manually". > > Just using the dot as separator fails in occasions like: > txt <- "One January 1. I saw Rick. He was born in the 19. century." > > Here I want the algorithm to separate the string only at the > positions where > the dot is not preceded by a digit. The R-snippets posted > pick out "1." and > "19." > > txt <- "One January 1. I saw Rick. He was born in the 19. century." > > gregexpr('(?<=[0-9])[.]',txt, perl=T) > [[1]] > [1] 14 49 > attr(,"match.length") > [1] 1 1 > > But I just need it the other way round. So I tried: > > > strsplit(txt, "[[:alpha:]]\\." , perl=T) > [[1]] > [1] "One January 1. I saw Ric" " He was born in the 19. centur" > > But this erases the last letter from each sentence. Does > someone know a > solution?
In S+ strsplit() has an argument called subpattern that lets you specify which parenthesized part of the regular expression to use as the split point. It is the akin to the \\<digit> used in the replacement argument of sub and gsub. E.g., to split the string at the sequence of spaces after a period, but not after period preceded by a digit do: > txt <- "One January 1. I saw Rick. He was born in the 19. century." > strsplit(txt, "[^[:digit:]]\\.([[:space:]]+)", subpattern=1) [[1]]: [1] "One January 1. I saw Rick." "He was born in the 19. century." subpattern=0, the default, means text matched by the entire regular expression. regexpr has the same argument. Would such an argument solve your problem? Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com > TIA > Mark > > ------------------------------- > > Mark Heckmann > + 49 (0) 421 - 1614618 > www.markheckmann.de > R-Blog: http://ryouready.wordpress.com > > > > > -----Ursprüngliche Nachricht----- > Von: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] > Gesendet: Dienstag, 9. Juni 2009 12:48 > An: Wacek Kusnierczyk > Cc: Mark Heckmann; r-help@r-project.org > Betreff: Re: [R] using regular expressions to retrieve a > digit-digit-dot > structure from a string > > On Tue, Jun 9, 2009 at 3:04 AM, Wacek > Kusnierczyk<waclaw.marcin.kusnierc...@idi.ntnu.no> wrote: > > Gabor Grothendieck wrote: > >> On Mon, Jun 8, 2009 at 7:18 PM, Wacek > >> Kusnierczyk<waclaw.marcin.kusnierc...@idi.ntnu.no> wrote: > >> > >>> Gabor Grothendieck wrote: > >>> > >>>> Try this. See ?regex for more. > >>>> > >>>> > >>>> > >>>>> x <- 'This happened in the 21. century." (the dot behind 21 is' > >>>>> regexpr("(?![0-9]+)[.]", x, perl = TRUE) > >>>>> > >>>>> > >>>> [1] 24 > >>>> attr(,"match.length") > >>>> [1] 1 > >>>> > >>>> > >>> yes, but > >>> > >>> gregexpr('(?![0-9]+)[.]', 'a. 1. a1.', perl=TRUE) > >>> # 2 5 9 > >>> > >> > >> Yes, it should be: > >> > >> > >>> gregexpr('(?<=[0-9])[.]', 'a. 1. a1.', perl=TRU > E) > >>> > >> [[1]] > >> [1] 5 9 > >> attr(,"match.length") > >> [1] 1 1 > >> > >> which displays the position of every dot that is preceded > >> immediately by a digit. Or just replace gregexpr with regexpr > >> if its intended that it match only one. > >> > > > > i guess what was needed was something like > > > > gregexpr('(?<=\\b[0-9]+)[.]', 'a. 1. a1.', perl=TRUE) > > # 5 > > > > which won't work, however, because pcre does not support > variable-width > > lookbehinds. > > No, what I wrote was what I intended. I don't think we are > discussing the answer > at this point but just the interpretation of what was intended. You > are including > the word boundary in the question and I am not. I think its > also possible > that > regexpr is what is wanted, not gregexpr, but at this point I think the > poster has > enough answers that he can complete it himself by considering > what he wants > and using one of ours or a suitable modification. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.