Actually, it might be a good idea to add an optional setting to the 
PatternTokenizer that doesn't emit blank terms. Perhaps a allowBlanks="false" 
would be a pleasant additional to the PatternTokenizer so an additional 
LengthFilter can be left out and thus spare CPU cycles and some memory.
 
-----Original message-----
From: Markus Jelsma <markus.jel...@buyways.nl>
Sent: Wed 06-10-2010 00:29
To: solr-user@lucene.apache.org; 
Subject: RE: PatternReplaceFilterFactory creating empty string as a term

I'm not sure if this is the best approach but a LengthFilter will stop blank 
terms.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory
 
-----Original message-----
From: Shawn Heisey <s...@elyograg.org>
Sent: Wed 06-10-2010 00:25
To: solr-user@lucene.apache.org; 
Subject: PatternReplaceFilterFactory creating empty string as a term

 I am developing a new schema. It has a pattern filter that trims 
leading and trailing punctuation from terms.

<filter class="solr.PatternReplaceFilterFactory"
pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
replacement="$2"
/>

It is resulting in empty terms, because there are situations in the 
analyzer stream where a term happens to be composed of nothing but 
punctuation. This problem is not happening in production. I want those 
terms removed.

This blank term makes the top of the list as far as term frequency. Out 
of 7.6 million documents, 4.8 million of them have it. From TermsComponent:

<response>
???
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">19106</int>
</lst>
???
<lst name="terms">
???
<lst name="catchall">
<int name="">4830648</int>
<int name="usa">1863264</int>
<int name="photo">1743551</int>
<int name="new">1544314</int>
<int name="de">1455691</int>
<int name="during">1412551</int>
<int name="los">1408855</int>
<int name="united">1368594</int>
<int name="2009">1271103</int>
<int name="la">1204441</int>
</lst>
</lst>
</response>

Is there any existing way to remove empty terms during analysis? I tried 
TrimFilterFactory but that made no difference. Is this a bug in 
PatternReplaceFilterFactory?

Shawn

Reply via email to