Regular expressions won't work well for sentence boundary detection.
If you want something free, you could plug in OpenNLP or GATE. Or LingPipe,
but that's not free.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: Caleb Land
> To: solr-u
On Wed, Jan 6, 2010 at 4:30 PM, Erick Erickson wrote:
> Hmmm, I'll have to defer to the highlighter experts here
>
>
I've looked at the source code for the highlighter, and I think I know
what's going on. I haven't had time to play with this yet, so I could be
wrong, but this is my impression.
Hmmm, I'll have to defer to the highlighter experts here
Erick
On Wed, Jan 6, 2010 at 3:23 PM, Caleb Land wrote:
> I've looked at the docs/source for WordDelimiterFilter, and I understand
> what it does now.
>
> Here is my configuration:
>
> http://gist.github.com/270590
>
> I've tried the
I've looked at the docs/source for WordDelimiterFilter, and I understand
what it does now.
Here is my configuration:
http://gist.github.com/270590
I've tried the StandardTokenizerFactory instead of the
WhitespaceTokenizerFactory, but I get the same problem as before, a the
period from the previo
Hmmm, the name WordDelimiterFilterFactory might be leading
you astray. Its purpose isn't to break things up into "words"
that have anything to do with grammatical rules. Rather, it's
purpose is to break up strings of funky characters into
searchable stuff. see:
http://wiki.apache.org/solr/Analyzers
I've tracked this problem down to the fact that I'm using the
WordDelimiterFilter. I don't quite understand what's happening, but if I
add preserveOriginal="1" as an option, everything looks fine. I think it has
to do with the period being stripped in the token stream.
On Tue, Jan 5, 2010 at 2:05