Re: PatternReplaceFilterFactory creating empty string as a term

Ken Krugler Tue, 05 Oct 2010 15:35:01 -0700


On Oct 5, 2010, at 6:24pm, Shawn Heisey wrote:

I am developing a new schema. It has a pattern filter that trimsleading and trailing punctuation from terms.
<filter class="solr.PatternReplaceFilterFactory"
pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
replacement="$2"
/>
It is resulting in empty terms, because there are situations in theanalyzer stream where a term happens to be composed of nothing butpunctuation. This problem is not happening in production. I wantthose terms removed.
This blank term makes the top of the list as far as term frequency.Out of 7.6 million documents, 4.8 million of them have it. FromTermsComponent:
<response>
−
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">19106</int>
</lst>
−
<lst name="terms">
−
<lst name="catchall">
<int name="">4830648</int>


[snip]

Is there any existing way to remove empty terms during analysis? Itried TrimFilterFactory but that made no difference.

You could use LengthFilterFactory to restrict terms to being at leastone character long.

Is this a bug in PatternReplaceFilterFactory?

No, I don't believe so. PatternReplaceFilterFactory creates aPatternReplaceFilter, and the JavaDoc for that says:

Note: Depending on the input and the pattern used and the inputTokenStream, this TokenFilter may produce Tokens whose text is theempty string.


-- Ken

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225




--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: PatternReplaceFilterFactory creating empty string as a term

Reply via email to