Yes, I already apologized to Wacek for missing that and pointing out what he had already said.
Given everything in this thread (though it is hard to keep track of all of it, my e-mail client does not keep all the parts of the thread together), this is probably one of those few tasks that R is not the best tool for. There is a Perl module called Lingua::DE::Sentence with the description: "Perl extension for tokenizing german texts into their sentences" which seems to be exactly what the original poster was looking for. So the best option may be to use Perl and the above module to preprocess his texts, then use R for later steps. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- > project.org] On Behalf Of Gabor Grothendieck > Sent: Tuesday, June 09, 2009 11:27 AM > To: Greg Snow > Cc: Wacek Kusnierczyk; r-help@r-project.org; Mark Heckmann > Subject: Re: [R] using regular expressions to retrieve a digit-digit- > dot structure from a string > > Wacek already mentioned that; however, its still > arguably more complex to specify delimiters > than to specify content. Aside from having > to specify perl = TRUE and ungreedy matching > the content-based regexp is entirely straight forward > but for lookbehind (including \K) one has the added > complexity of distinguishing between matching and returned > values. > > On Tue, Jun 9, 2009 at 12:36 PM, Greg Snow<greg.s...@imail.org> wrote: > > You can sometimes fake variable width look behinds with Perl regexs > using '\K': > > > >> gregexpr('\\b[0-9]+\\K[.]', 'a. 1. a1. 11.', perl=TRUE) > > [[1]] > > [1] 5 13 > > attr(,"match.length") > > [1] 1 1 > > > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > greg.s...@imail.org > > 801.408.8111 > > > > > >> -----Original Message----- > >> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- > >> project.org] On Behalf Of Wacek Kusnierczyk > >> Sent: Tuesday, June 09, 2009 1:05 AM > >> To: Gabor Grothendieck > >> Cc: r-help@r-project.org; Mark Heckmann > >> Subject: Re: [R] using regular expressions to retrieve a digit- > digit- > >> dot structure from a string > >> > >> Gabor Grothendieck wrote: > >> > On Mon, Jun 8, 2009 at 7:18 PM, Wacek > >> > Kusnierczyk<waclaw.marcin.kusnierc...@idi.ntnu.no> wrote: > >> > > >> >> Gabor Grothendieck wrote: > >> >> > >> >>> Try this. See ?regex for more. > >> >>> > >> >>> > >> >>> > >> >>>> x <- 'This happened in the 21. century." (the dot behind 21 is' > >> >>>> regexpr("(?![0-9]+)[.]", x, perl = TRUE) > >> >>>> > >> >>>> > >> >>> [1] 24 > >> >>> attr(,"match.length") > >> >>> [1] 1 > >> >>> > >> >>> > >> >> yes, but > >> >> > >> >> gregexpr('(?![0-9]+)[.]', 'a. 1. a1.', perl=TRUE) > >> >> # 2 5 9 > >> >> > >> > > >> > Yes, it should be: > >> > > >> > > >> >> gregexpr('(?<=[0-9])[.]', 'a. 1. a1.', perl=TRUE) > >> >> > >> > [[1]] > >> > [1] 5 9 > >> > attr(,"match.length") > >> > [1] 1 1 > >> > > >> > which displays the position of every dot that is preceded > >> > immediately by a digit. Or just replace gregexpr with regexpr > >> > if its intended that it match only one. > >> > > >> > >> i guess what was needed was something like > >> > >> gregexpr('(?<=\\b[0-9]+)[.]', 'a. 1. a1.', perl=TRUE) > >> # 5 > >> > >> which won't work, however, because pcre does not support variable- > width > >> lookbehinds. > >> > >> > > >> >> which, i guess, is not what you want. if what you want is to > match > >> all > >> >> and only dots that follow at least one digit preceded by a word > >> >> boundary, then the following should do, as far as i can see: > >> >> > >> >> gregexpr('\\b[0-9]+\\K[.]', 'a. 1. a1.', perl=TRUE) > >> >> # 5 > >> >> > >> >> vQ > >> >> > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide http://www.R-project.org/posting- > >> guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.