Re: [R] using regular expressions to retrieve a digit-digit-dot structure from a string

Mark Heckmann Tue, 09 Jun 2009 08:08:02 -0700

Hey all,

Thanks for your help. Your answers solved the problem I posted and that is
just when I noticed that I misspecified the problem ;) 
My problem is to separate a German texts by sentences. Unfortunately I
haven't found an R package doing this kind of text separation in German, so
I try it "manually".

Just using the dot as separator fails in occasions like:
txt <- "One January 1. I saw Rick. He was born in the 19. century."

Here I want the algorithm to separate the string only at the positions where
the dot is not preceded by a digit. The R-snippets posted pick out "1." and
"19."

txt <- "One January 1. I saw Rick. He was born in the 19. century."
> gregexpr('(?<=[0-9])[.]',txt, perl=T)
[[1]]
[1] 14 49
attr(,"match.length")
[1] 1 1

But I just need it the other way round. So I tried:

> strsplit(txt, "[[:alpha:]]\\." , perl=T)
[[1]]
[1] "One January 1. I saw Ric"       " He was born in the 19. centur"

But this erases the last letter from each sentence. Does someone know a
solution?

TIA
Mark

-------------------------------

Mark Heckmann
+ 49 (0) 421 - 1614618
www.markheckmann.de
R-Blog: http://ryouready.wordpress.com

-----Ursprüngliche Nachricht-----
Von: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] 
Gesendet: Dienstag, 9. Juni 2009 12:48
An: Wacek Kusnierczyk
Cc: Mark Heckmann; r-help@r-project.org
Betreff: Re: [R] using regular expressions to retrieve a digit-digit-dot
structure from a string

On Tue, Jun 9, 2009 at 3:04 AM, Wacek
Kusnierczyk<waclaw.marcin.kusnierc...@idi.ntnu.no> wrote:
> Gabor Grothendieck wrote:
>> On Mon, Jun 8, 2009 at 7:18 PM, Wacek
>> Kusnierczyk<waclaw.marcin.kusnierc...@idi.ntnu.no> wrote:
>>
>>> Gabor Grothendieck wrote:
>>>
>>>> Try this.  See ?regex for more.
>>>>
>>>>
>>>>
>>>>> x <- 'This happened in the 21. century." (the dot behind 21 is'
>>>>> regexpr("(?![0-9]+)[.]", x, perl = TRUE)
>>>>>
>>>>>
>>>> [1] 24
>>>> attr(,"match.length")
>>>> [1] 1
>>>>
>>>>
>>> yes, but
>>>
>>>    gregexpr('(?![0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
>>>    # 2 5 9
>>>
>>
>> Yes, it should be:
>>
>>
>>> gregexpr('(?<=[0-9])[.]', 'a. 1. a1.', perl=TRU
E)
>>>
>> [[1]]
>> [1] 5 9
>> attr(,"match.length")
>> [1] 1 1
>>
>> which displays the position of every dot that is preceded
>> immediately by a digit.  Or just replace gregexpr with regexpr
>> if its intended that it match only one.
>>
>
> i guess what was needed was something like
>
>    gregexpr('(?<=\\b[0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
>    # 5
>
> which won't work, however, because pcre does not support variable-width
> lookbehinds.

No, what I wrote was what I intended.   I don't think we are
discussing the answer
at this point but just the interpretation of what was intended.  You
are including
the word boundary in the question and I am not.  I think its also possible
that
regexpr is what is wanted, not gregexpr, but at this point I think the
poster has
enough answers that he can complete it himself by considering what he wants
and using one of ours or a suitable modification.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] using regular expressions to retrieve a digit-digit-dot structure from a string

Reply via email to