Hey all, Thanks for your help. Your answers solved the problem I posted and that is just when I noticed that I misspecified the problem ;) My problem is to separate a German texts by sentences. Unfortunately I haven't found an R package doing this kind of text separation in German, so I try it "manually".
Just using the dot as separator fails in occasions like: txt <- "One January 1. I saw Rick. He was born in the 19. century." Here I want the algorithm to separate the string only at the positions where the dot is not preceded by a digit. The R-snippets posted pick out "1." and "19." txt <- "One January 1. I saw Rick. He was born in the 19. century." > gregexpr('(?<=[0-9])[.]',txt, perl=T) [[1]] [1] 14 49 attr(,"match.length") [1] 1 1 But I just need it the other way round. So I tried: > strsplit(txt, "[[:alpha:]]\\." , perl=T) [[1]] [1] "One January 1. I saw Ric" " He was born in the 19. centur" But this erases the last letter from each sentence. Does someone know a solution? TIA Mark ------------------------------- Mark Heckmann + 49 (0) 421 - 1614618 www.markheckmann.de R-Blog: http://ryouready.wordpress.com -----Ursprüngliche Nachricht----- Von: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] Gesendet: Dienstag, 9. Juni 2009 12:48 An: Wacek Kusnierczyk Cc: Mark Heckmann; r-help@r-project.org Betreff: Re: [R] using regular expressions to retrieve a digit-digit-dot structure from a string On Tue, Jun 9, 2009 at 3:04 AM, Wacek Kusnierczyk<waclaw.marcin.kusnierc...@idi.ntnu.no> wrote: > Gabor Grothendieck wrote: >> On Mon, Jun 8, 2009 at 7:18 PM, Wacek >> Kusnierczyk<waclaw.marcin.kusnierc...@idi.ntnu.no> wrote: >> >>> Gabor Grothendieck wrote: >>> >>>> Try this. See ?regex for more. >>>> >>>> >>>> >>>>> x <- 'This happened in the 21. century." (the dot behind 21 is' >>>>> regexpr("(?![0-9]+)[.]", x, perl = TRUE) >>>>> >>>>> >>>> [1] 24 >>>> attr(,"match.length") >>>> [1] 1 >>>> >>>> >>> yes, but >>> >>> gregexpr('(?![0-9]+)[.]', 'a. 1. a1.', perl=TRUE) >>> # 2 5 9 >>> >> >> Yes, it should be: >> >> >>> gregexpr('(?<=[0-9])[.]', 'a. 1. a1.', perl=TRU E) >>> >> [[1]] >> [1] 5 9 >> attr(,"match.length") >> [1] 1 1 >> >> which displays the position of every dot that is preceded >> immediately by a digit. Or just replace gregexpr with regexpr >> if its intended that it match only one. >> > > i guess what was needed was something like > > gregexpr('(?<=\\b[0-9]+)[.]', 'a. 1. a1.', perl=TRUE) > # 5 > > which won't work, however, because pcre does not support variable-width > lookbehinds. No, what I wrote was what I intended. I don't think we are discussing the answer at this point but just the interpretation of what was intended. You are including the word boundary in the question and I am not. I think its also possible that regexpr is what is wanted, not gregexpr, but at this point I think the poster has enough answers that he can complete it himself by considering what he wants and using one of ours or a suitable modification. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.