Re: Using Regex fragmenter to extract paragraphs

Mark Ferguson Mon, 15 Dec 2008 15:20:57 -0800

You actually don't need to escape most characters inside a character class,
the escaping of the period was unnecessary.


I've tried using the example regex ([-\w ,/\n\"']{20,200}), and I'm _still_
getting lots of highlighted snippets that don't match the regex (starting
with a period, etc.) Has anyone else has any trouble with the default regex
fragmenter? If someone has used it and gotten the expected results, can you
let me know, so I know that the problem is on my end?

Thanks for your help,

Mark


On Sun, Dec 14, 2008 at 8:34 AM, Erick Erickson <erickerick...@gmail.com>wrote:

> Shouldn't you escape the question mark at the end too?
>
> On Fri, Dec 12, 2008 at 6:22 PM, Mark Ferguson <mark.a.fergu...@gmail.com
> >wrote:
>
> > Someone helped me with the regex and pointed out a couple mistakes, most
> > notably the extra quantifier in .*{400,600}. My new regex is this:
> >
> > \w.{400,600}[\.!?]
> >
> > Unfortunately, my results still aren't any better. Some results start
> with
> > a
> > word character, some don't, and none seem to end with punctuation. Any
> > ideas
> > would else could be wrong?
> >
> > Mark
> >
> >
> >
> > On Fri, Dec 12, 2008 at 2:37 PM, Mark Ferguson <
> mark.a.fergu...@gmail.com
> > >wrote:
> >
> > > Hello,
> > >
> > > I am trying to use the regex fragmenter and am having a hard time
> getting
> > > the results I want. I am trying to get fragments that start on a word
> > > character and end on punctuation, but for some reason the fragments
> being
> > > returned to me seem to be very inflexible, despite that I've provided a
> > > large slop. Here are the relevant parameters I'm using, maybe someone
> can
> > > help point out where I've gone wrong:
> > >
> > > <str name="hl.fragsize">500</str>
> > > <str name="hl.fragmenter">regex</str>
> > > <str name="hl.regex.slop">0.8</str>
> > > <str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
> > > <str name="hl">true</str>
> > > <str name="q">chinese</str>
> > >
> > > This should be matching between 400-600 characters, beginning with a
> word
> > > character and ending with one of .!?. Here is an example of a typical
> > > result:
> > >
> > > . Check these pictures out. Nine panda cubs on display for the first
> time
> > > Thursday in southwest China. They're less than a year old. They just
> > > recently stopped nursing. There are only 1,600 of these guys left in
> the
> > > mountain forests of central China, another 120 in <span
> > > class='hl'>Chinese</span> breeding facilities and zoos. And they're
> about
> > 20
> > > that live outside China in zoos. They exist almost entirely on bamboo.
> > They
> > > can live to be 30 years old. And these little guys will eventually get
> > much
> > > bigger. They'll grow
> > >
> > > As you can see, it is starting with a period and ending on a word
> > > character! It's almost as if the fragments are just coming out as they
> > will
> > > and the regex isn't doing anything at all, but the results are
> different
> > > when I use the gap fragmenter. In the above result I don't see any
> reason
> > > why it shouldn't have stripped out the preceding period and the last
> two
> > > words, there is plenty of room in the slop and in the regex pattern.
> > Please
> > > help me figure out what I'm doing wrong...
> > >
> > > Thanks a lot,
> > >
> > > Mark Ferguson
> > >
> >
>

Re: Using Regex fragmenter to extract paragraphs

Reply via email to