Thanks, Now it works great. I modified it a bit so the sentences will be split by questionmarks (.?!:), etc. as well.
strsplit(gsub("([[:alpha:]][\\.\\?\\!\\:])", "\\1*", txt), "\\* *") [[1]] e.g. > strsplit(gsub("([[:alpha:]][\\.\\?\\!\\:])", "\\1*", txt), "\\* *") [[1]] [1] "One January 1. I saw Rick?" "He was born in the 19. century." ------------------------------- Mark Heckmann + 49 (0) 421 - 1614618 www.markheckmann.de R-Blog: http://ryouready.wordpress.com -----Ursprüngliche Nachricht----- Von: Marc Schwartz [mailto:marc_schwa...@me.com] Gesendet: Dienstag, 9. Juni 2009 14:17 An: Mark Heckmann Cc: r-help@r-project.org; 'Gabor Grothendieck'; waclaw.marcin.kusnierc...@idi.ntnu.no Betreff: Re: AW: [R] using regular expressions to retrieve a digit-digit-dot structure from a string On Jun 9, 2009, at 6:44 AM, Mark Heckmann wrote: > Hey all, > > Thanks for your help. Your answers solved the problem I posted and > that is > just when I noticed that I misspecified the problem ;) > My problem is to separate a German texts by sentences. Unfortunately I > haven't found an R package doing this kind of text separation in > German, so > I try it "manually". > > Just using the dot as separator fails in occasions like: > txt <- "One January 1. I saw Rick. He was born in the 19. century." > > Here I want the algorithm to separate the string only at the > positions where > the dot is not preceded by a digit. The R-snippets posted pick out > "1." and > "19." > > txt <- "One January 1. I saw Rick. He was born in the 19. century." >> gregexpr('(?<=[0-9])[.]',txt, perl=T) > [[1]] > [1] 14 49 > attr(,"match.length") > [1] 1 1 > > But I just need it the other way round. So I tried: > >> strsplit(txt, "[[:alpha:]]\\." , perl=T) > [[1]] > [1] "One January 1. I saw Ric" " He was born in the 19. centur" > > But this erases the last letter from each sentence. Does someone > know a > solution? > > TIA > Mark <snip> This is one of those rare? times where it might be nice for strsplit() to have an option to retain the split regex at the end of each parsed segment, rather than removing it. There may be a better way, but trying to both avoid a loop over vector indices and trying to stay with R functions that use .Internal() for speed, you may be able to use something like this: > strsplit(gsub("([[:alpha:]]\\.)", "\\1*", txt), "\\* *") [[1]] [1] "One January 1. I saw Rick." "He was born in the 19. century." What I am essentially doing is to add an "*" to the ending of each sentence (you can use other characters) such that strsplit() can split on that character without affecting the rest of the sentence. So as an intermediate result, you get: > gsub("([[:alpha:]]\\.)", "\\1*", txt) [1] "One January 1. I saw Rick.* He was born in the 19. century.*" which then makes the strsplit() parsing a bit easier. Since both strsplit() and grep() use .Internal()s, hopefully this would still be reasonably fast. Note that I have strsplit() split on the "*" possibly followed by one or more " ", which is required for mid-line splits. HTH, Marc Schwartz ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.