Re: [R] using regular expressions to retrieve a digit-digit-dot structure from a string

Mark Heckmann Tue, 09 Jun 2009 08:06:08 -0700

Thanks,

Now it works great. I modified it a bit so the sentences will be split by
questionmarks (.?!:), etc. as well.

strsplit(gsub("([[:alpha:]][\\.\\?\\!\\:])", "\\1*", txt), "\\* *") [[1]]

e.g.

> strsplit(gsub("([[:alpha:]][\\.\\?\\!\\:])", "\\1*", txt), "\\* *") [[1]]
[1] "One January 1. I saw Rick?"      "He was born in the 19. century."

-------------------------------

Mark Heckmann
+ 49 (0) 421 - 1614618
www.markheckmann.de
R-Blog: http://ryouready.wordpress.com

-----Ursprüngliche Nachricht-----
Von: Marc Schwartz [mailto:[email protected]] 
Gesendet: Dienstag, 9. Juni 2009 14:17
An: Mark Heckmann
Cc: [email protected]; 'Gabor Grothendieck';
[email protected]
Betreff: Re: AW: [R] using regular expressions to retrieve a digit-digit-dot
structure from a string

On Jun 9, 2009, at 6:44 AM, Mark Heckmann wrote:

> Hey all,
>
> Thanks for your help. Your answers solved the problem I posted and  
> that is
> just when I noticed that I misspecified the problem ;)
> My problem is to separate a German texts by sentences. Unfortunately I
> haven't found an R package doing this kind of text separation in  
> German, so
> I try it "manually".
>
> Just using the dot as separator fails in occasions like:
> txt <- "One January 1. I saw Rick. He was born in the 19. century."
>
> Here I want the algorithm to separate the string only at the  
> positions where
> the dot is not preceded by a digit. The R-snippets posted pick out  
> "1." and
> "19."
>
> txt <- "One January 1. I saw Rick. He was born in the 19. century."
>> gregexpr('(?<=[0-9])[.]',txt, perl=T)
> [[1]]
> [1] 14 49
> attr(,"match.length")
> [1] 1 1
>
> But I just need it the other way round. So I tried:
>
>> strsplit(txt, "[[:alpha:]]\\." , perl=T)
> [[1]]
> [1] "One January 1. I saw Ric"       " He was born in the 19. centur"
>
> But this erases the last letter from each sentence. Does someone  
> know a
> solution?
>
> TIA
> Mark

<snip>

This is one of those rare? times where it might be nice for strsplit()  
to have an option to retain the split regex at the end of each parsed  
segment, rather than removing it.

There may be a better way, but trying to both avoid a loop over vector  
indices and trying to stay with R functions that use .Internal() for  
speed, you may be able to use something like this:

 > strsplit(gsub("([[:alpha:]]\\.)", "\\1*", txt), "\\* *")
[[1]]
[1] "One January 1. I saw Rick."      "He was born in the 19. century."

What I am essentially doing is to add an "*" to the ending of each  
sentence (you can use other characters) such that strsplit() can split  
on that character without affecting the rest of the sentence.  So as  
an intermediate result, you get:

 > gsub("([[:alpha:]]\\.)", "\\1*", txt)
[1] "One January 1. I saw Rick.* He was born in the 19. century.*"

which then makes the strsplit() parsing a bit easier. Since both  
strsplit() and grep() use .Internal()s, hopefully this would still be  
reasonably fast. Note that I have strsplit() split on the "*" possibly  
followed by one or more " ", which is required for mid-line splits.

HTH,

Marc Schwartz

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] using regular expressions to retrieve a digit-digit-dot structure from a string

Reply via email to