alternatively, you can use "+" instead of "*" in your regular expressions so
that you dont match them at all...

I think the PatternTokenizer is doing the right thing, if your expression
says that a blank term is acceptable.

On Tue, Oct 5, 2010 at 6:39 PM, Markus Jelsma <markus.jel...@buyways.nl>wrote:

> Actually, it might be a good idea to add an optional setting to the
> PatternTokenizer that doesn't emit blank terms. Perhaps a
> allowBlanks="false" would be a pleasant additional to the PatternTokenizer
> so an additional LengthFilter can be left out and thus spare CPU cycles and
> some memory.
>
> -----Original message-----
> From: Markus Jelsma <markus.jel...@buyways.nl>
> Sent: Wed 06-10-2010 00:29
> To: solr-user@lucene.apache.org;
> Subject: RE: PatternReplaceFilterFactory creating empty string as a term
>
> I'm not sure if this is the best approach but a LengthFilter will stop
> blank terms.
>
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory
>
> -----Original message-----
> From: Shawn Heisey <s...@elyograg.org>
> Sent: Wed 06-10-2010 00:25
> To: solr-user@lucene.apache.org;
> Subject: PatternReplaceFilterFactory creating empty string as a term
>
>  I am developing a new schema. It has a pattern filter that trims
> leading and trailing punctuation from terms.
>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
> replacement="$2"
> />
>
> It is resulting in empty terms, because there are situations in the
> analyzer stream where a term happens to be composed of nothing but
> punctuation. This problem is not happening in production. I want those
> terms removed.
>
> This blank term makes the top of the list as far as term frequency. Out
> of 7.6 million documents, 4.8 million of them have it. From TermsComponent:
>
> <response>
> ???
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">19106</int>
> </lst>
> ???
> <lst name="terms">
> ???
> <lst name="catchall">
> <int name="">4830648</int>
> <int name="usa">1863264</int>
> <int name="photo">1743551</int>
> <int name="new">1544314</int>
> <int name="de">1455691</int>
> <int name="during">1412551</int>
> <int name="los">1408855</int>
> <int name="united">1368594</int>
> <int name="2009">1271103</int>
> <int name="la">1204441</int>
> </lst>
> </lst>
> </response>
>
> Is there any existing way to remove empty terms during analysis? I tried
> TrimFilterFactory but that made no difference. Is this a bug in
> PatternReplaceFilterFactory?
>
> Shawn
>
>


-- 
Robert Muir
rcm...@gmail.com

Reply via email to