Re: [R] using regular expressions to retrieve a digit-digit-dot structure from a string

Gabor Grothendieck Tue, 09 Jun 2009 11:27:52 -0700

This is just to you.

You might want to read r-help through gmane.org or one of the other
online readers or get a gmail or other online email account just
for reading newsgroups to circumvent your thread-challenged
email client.


Regards.


On Tue, Jun 9, 2009 at 1:42 PM, Greg Snow<greg.s...@imail.org> wrote:
> Yes, I already apologized to Wacek for missing that and pointing out what he 
> had already said.
>
> Given everything in this thread (though it is hard to keep track of all of 
> it, my e-mail client does not keep all the parts of the thread together), 
> this is probably one of those few tasks that R is not the best tool for.  
> There is a Perl module called Lingua::DE::Sentence with the description: 
> "Perl extension for tokenizing german texts into their sentences" which seems 
> to be exactly what the original poster was looking for.  So the best option 
> may be to use Perl and the above module to preprocess his texts, then use R 
> for later steps.
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.s...@imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
>> project.org] On Behalf Of Gabor Grothendieck
>> Sent: Tuesday, June 09, 2009 11:27 AM
>> To: Greg Snow
>> Cc: Wacek Kusnierczyk; r-help@r-project.org; Mark Heckmann
>> Subject: Re: [R] using regular expressions to retrieve a digit-digit-
>> dot structure from a string
>>
>> Wacek already mentioned that; however, its still
>> arguably more complex to specify delimiters
>> than to specify content.  Aside from having
>> to specify perl = TRUE and ungreedy matching
>> the content-based regexp is entirely straight forward
>> but for lookbehind (including \K) one has the added
>> complexity of distinguishing between matching and returned
>> values.
>>
>> On Tue, Jun 9, 2009 at 12:36 PM, Greg Snow<greg.s...@imail.org> wrote:
>> > You can sometimes fake variable width look behinds with Perl regexs
>> using '\K':
>> >
>> >> gregexpr('\\b[0-9]+\\K[.]', 'a. 1. a1. 11.', perl=TRUE)
>> > [[1]]
>> > [1]  5 13
>> > attr(,"match.length")
>> > [1] 1 1
>> >
>> >
>> > --
>> > Gregory (Greg) L. Snow Ph.D.
>> > Statistical Data Center
>> > Intermountain Healthcare
>> > greg.s...@imail.org
>> > 801.408.8111
>> >
>> >
>> >> -----Original Message-----
>> >> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
>> >> project.org] On Behalf Of Wacek Kusnierczyk
>> >> Sent: Tuesday, June 09, 2009 1:05 AM
>> >> To: Gabor Grothendieck
>> >> Cc: r-help@r-project.org; Mark Heckmann
>> >> Subject: Re: [R] using regular expressions to retrieve a digit-
>> digit-
>> >> dot structure from a string
>> >>
>> >> Gabor Grothendieck wrote:
>> >> > On Mon, Jun 8, 2009 at 7:18 PM, Wacek
>> >> > Kusnierczyk<waclaw.marcin.kusnierc...@idi.ntnu.no> wrote:
>> >> >
>> >> >> Gabor Grothendieck wrote:
>> >> >>
>> >> >>> Try this.  See ?regex for more.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>> x <- 'This happened in the 21. century." (the dot behind 21 is'
>> >> >>>> regexpr("(?![0-9]+)[.]", x, perl = TRUE)
>> >> >>>>
>> >> >>>>
>> >> >>> [1] 24
>> >> >>> attr(,"match.length")
>> >> >>> [1] 1
>> >> >>>
>> >> >>>
>> >> >> yes, but
>> >> >>
>> >> >>    gregexpr('(?![0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
>> >> >>    # 2 5 9
>> >> >>
>> >> >
>> >> > Yes, it should be:
>> >> >
>> >> >
>> >> >> gregexpr('(?<=[0-9])[.]', 'a. 1. a1.', perl=TRUE)
>> >> >>
>> >> > [[1]]
>> >> > [1] 5 9
>> >> > attr(,"match.length")
>> >> > [1] 1 1
>> >> >
>> >> > which displays the position of every dot that is preceded
>> >> > immediately by a digit.  Or just replace gregexpr with regexpr
>> >> > if its intended that it match only one.
>> >> >
>> >>
>> >> i guess what was needed was something like
>> >>
>> >>     gregexpr('(?<=\\b[0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
>> >>     # 5
>> >>
>> >> which won't work, however, because pcre does not support variable-
>> width
>> >> lookbehinds.
>> >>
>> >> >
>> >> >> which, i guess, is not what you want.  if what you want is to
>> match
>> >> all
>> >> >> and only dots that follow at least one digit preceded by a word
>> >> >> boundary, then the following should do, as far as i can see:
>> >> >>
>> >> >>    gregexpr('\\b[0-9]+\\K[.]', 'a. 1. a1.', perl=TRUE)
>> >> >>    # 5
>> >> >>
>> >> >> vQ
>> >> >>
>> >>
>> >> ______________________________________________
>> >> R-help@r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide http://www.R-project.org/posting-
>> >> guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] using regular expressions to retrieve a digit-digit-dot structure from a string

Reply via email to