Re: Using Regex fragmenter to extract paragraphs

Mark Ferguson Fri, 12 Dec 2008 15:23:25 -0800

Someone helped me with the regex and pointed out a couple mistakes, most
notably the extra quantifier in .*{400,600}. My new regex is this:


\w.{400,600}[\.!?]

Unfortunately, my results still aren't any better. Some results start with a
word character, some don't, and none seem to end with punctuation. Any ideas
would else could be wrong?

Mark



On Fri, Dec 12, 2008 at 2:37 PM, Mark Ferguson <mark.a.fergu...@gmail.com>wrote:

> Hello,
>
> I am trying to use the regex fragmenter and am having a hard time getting
> the results I want. I am trying to get fragments that start on a word
> character and end on punctuation, but for some reason the fragments being
> returned to me seem to be very inflexible, despite that I've provided a
> large slop. Here are the relevant parameters I'm using, maybe someone can
> help point out where I've gone wrong:
>
> <str name="hl.fragsize">500</str>
> <str name="hl.fragmenter">regex</str>
> <str name="hl.regex.slop">0.8</str>
> <str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
> <str name="hl">true</str>
> <str name="q">chinese</str>
>
> This should be matching between 400-600 characters, beginning with a word
> character and ending with one of .!?. Here is an example of a typical
> result:
>
> . Check these pictures out. Nine panda cubs on display for the first time
> Thursday in southwest China. They're less than a year old. They just
> recently stopped nursing. There are only 1,600 of these guys left in the
> mountain forests of central China, another 120 in <span
> class='hl'>Chinese</span> breeding facilities and zoos. And they're about 20
> that live outside China in zoos. They exist almost entirely on bamboo. They
> can live to be 30 years old. And these little guys will eventually get much
> bigger. They'll grow
>
> As you can see, it is starting with a period and ending on a word
> character! It's almost as if the fragments are just coming out as they will
> and the regex isn't doing anything at all, but the results are different
> when I use the gap fragmenter. In the above result I don't see any reason
> why it shouldn't have stripped out the preceding period and the last two
> words, there is plenty of room in the slop and in the regex pattern. Please
> help me figure out what I'm doing wrong...
>
> Thanks a lot,
>
> Mark Ferguson
>

Re: Using Regex fragmenter to extract paragraphs

Reply via email to