Shouldn't you escape the question mark at the end too? On Fri, Dec 12, 2008 at 6:22 PM, Mark Ferguson <mark.a.fergu...@gmail.com>wrote:
> Someone helped me with the regex and pointed out a couple mistakes, most > notably the extra quantifier in .*{400,600}. My new regex is this: > > \w.{400,600}[\.!?] > > Unfortunately, my results still aren't any better. Some results start with > a > word character, some don't, and none seem to end with punctuation. Any > ideas > would else could be wrong? > > Mark > > > > On Fri, Dec 12, 2008 at 2:37 PM, Mark Ferguson <mark.a.fergu...@gmail.com > >wrote: > > > Hello, > > > > I am trying to use the regex fragmenter and am having a hard time getting > > the results I want. I am trying to get fragments that start on a word > > character and end on punctuation, but for some reason the fragments being > > returned to me seem to be very inflexible, despite that I've provided a > > large slop. Here are the relevant parameters I'm using, maybe someone can > > help point out where I've gone wrong: > > > > <str name="hl.fragsize">500</str> > > <str name="hl.fragmenter">regex</str> > > <str name="hl.regex.slop">0.8</str> > > <str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str> > > <str name="hl">true</str> > > <str name="q">chinese</str> > > > > This should be matching between 400-600 characters, beginning with a word > > character and ending with one of .!?. Here is an example of a typical > > result: > > > > . Check these pictures out. Nine panda cubs on display for the first time > > Thursday in southwest China. They're less than a year old. They just > > recently stopped nursing. There are only 1,600 of these guys left in the > > mountain forests of central China, another 120 in <span > > class='hl'>Chinese</span> breeding facilities and zoos. And they're about > 20 > > that live outside China in zoos. They exist almost entirely on bamboo. > They > > can live to be 30 years old. And these little guys will eventually get > much > > bigger. They'll grow > > > > As you can see, it is starting with a period and ending on a word > > character! It's almost as if the fragments are just coming out as they > will > > and the regex isn't doing anything at all, but the results are different > > when I use the gap fragmenter. In the above result I don't see any reason > > why it shouldn't have stripped out the preceding period and the last two > > words, there is plenty of room in the slop and in the regex pattern. > Please > > help me figure out what I'm doing wrong... > > > > Thanks a lot, > > > > Mark Ferguson > > >