You actually don't need to escape most characters inside a character class, the escaping of the period was unnecessary.
I've tried using the example regex ([-\w ,/\n\"']{20,200}), and I'm _still_ getting lots of highlighted snippets that don't match the regex (starting with a period, etc.) Has anyone else has any trouble with the default regex fragmenter? If someone has used it and gotten the expected results, can you let me know, so I know that the problem is on my end? Thanks for your help, Mark On Sun, Dec 14, 2008 at 8:34 AM, Erick Erickson <erickerick...@gmail.com>wrote: > Shouldn't you escape the question mark at the end too? > > On Fri, Dec 12, 2008 at 6:22 PM, Mark Ferguson <mark.a.fergu...@gmail.com > >wrote: > > > Someone helped me with the regex and pointed out a couple mistakes, most > > notably the extra quantifier in .*{400,600}. My new regex is this: > > > > \w.{400,600}[\.!?] > > > > Unfortunately, my results still aren't any better. Some results start > with > > a > > word character, some don't, and none seem to end with punctuation. Any > > ideas > > would else could be wrong? > > > > Mark > > > > > > > > On Fri, Dec 12, 2008 at 2:37 PM, Mark Ferguson < > mark.a.fergu...@gmail.com > > >wrote: > > > > > Hello, > > > > > > I am trying to use the regex fragmenter and am having a hard time > getting > > > the results I want. I am trying to get fragments that start on a word > > > character and end on punctuation, but for some reason the fragments > being > > > returned to me seem to be very inflexible, despite that I've provided a > > > large slop. Here are the relevant parameters I'm using, maybe someone > can > > > help point out where I've gone wrong: > > > > > > <str name="hl.fragsize">500</str> > > > <str name="hl.fragmenter">regex</str> > > > <str name="hl.regex.slop">0.8</str> > > > <str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str> > > > <str name="hl">true</str> > > > <str name="q">chinese</str> > > > > > > This should be matching between 400-600 characters, beginning with a > word > > > character and ending with one of .!?. Here is an example of a typical > > > result: > > > > > > . Check these pictures out. Nine panda cubs on display for the first > time > > > Thursday in southwest China. They're less than a year old. They just > > > recently stopped nursing. There are only 1,600 of these guys left in > the > > > mountain forests of central China, another 120 in <span > > > class='hl'>Chinese</span> breeding facilities and zoos. And they're > about > > 20 > > > that live outside China in zoos. They exist almost entirely on bamboo. > > They > > > can live to be 30 years old. And these little guys will eventually get > > much > > > bigger. They'll grow > > > > > > As you can see, it is starting with a period and ending on a word > > > character! It's almost as if the fragments are just coming out as they > > will > > > and the regex isn't doing anything at all, but the results are > different > > > when I use the gap fragmenter. In the above result I don't see any > reason > > > why it shouldn't have stripped out the preceding period and the last > two > > > words, there is plenty of room in the slop and in the regex pattern. > > Please > > > help me figure out what I'm doing wrong... > > > > > > Thanks a lot, > > > > > > Mark Ferguson > > > > > >