Re: Using Regex fragmenter to extract paragraphs

Erick Erickson Sun, 14 Dec 2008 07:35:21 -0800

Shouldn't you escape the question mark at the end too?

On Fri, Dec 12, 2008 at 6:22 PM, Mark Ferguson <[email protected]>wrote:


> Someone helped me with the regex and pointed out a couple mistakes, most
> notably the extra quantifier in .*{400,600}. My new regex is this:
>
> \w.{400,600}[\.!?]
>
> Unfortunately, my results still aren't any better. Some results start with
> a
> word character, some don't, and none seem to end with punctuation. Any
> ideas
> would else could be wrong?
>
> Mark
>
>
>
> On Fri, Dec 12, 2008 at 2:37 PM, Mark Ferguson <[email protected]
> >wrote:
>
> > Hello,
> >
> > I am trying to use the regex fragmenter and am having a hard time getting
> > the results I want. I am trying to get fragments that start on a word
> > character and end on punctuation, but for some reason the fragments being
> > returned to me seem to be very inflexible, despite that I've provided a
> > large slop. Here are the relevant parameters I'm using, maybe someone can
> > help point out where I've gone wrong:
> >
> > <str name="hl.fragsize">500</str>
> > <str name="hl.fragmenter">regex</str>
> > <str name="hl.regex.slop">0.8</str>
> > <str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
> > <str name="hl">true</str>
> > <str name="q">chinese</str>
> >
> > This should be matching between 400-600 characters, beginning with a word
> > character and ending with one of .!?. Here is an example of a typical
> > result:
> >
> > . Check these pictures out. Nine panda cubs on display for the first time
> > Thursday in southwest China. They're less than a year old. They just
> > recently stopped nursing. There are only 1,600 of these guys left in the
> > mountain forests of central China, another 120 in <span
> > class='hl'>Chinese</span> breeding facilities and zoos. And they're about
> 20
> > that live outside China in zoos. They exist almost entirely on bamboo.
> They
> > can live to be 30 years old. And these little guys will eventually get
> much
> > bigger. They'll grow
> >
> > As you can see, it is starting with a period and ending on a word
> > character! It's almost as if the fragments are just coming out as they
> will
> > and the regex isn't doing anything at all, but the results are different
> > when I use the gap fragmenter. In the above result I don't see any reason
> > why it shouldn't have stripped out the preceding period and the last two
> > words, there is plenty of room in the slop and in the regex pattern.
> Please
> > help me figure out what I'm doing wrong...
> >
> > Thanks a lot,
> >
> > Mark Ferguson
> >
>

Re: Using Regex fragmenter to extract paragraphs

Reply via email to