I am developing a new schema. It has a pattern filter that trims
leading and trailing punctuation from terms.
<filter class="solr.PatternReplaceFilterFactory"
pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
replacement="$2"
/>
It is resulting in empty terms, because there are situations in the
analyzer stream where a term happens to be composed of nothing but
punctuation. This problem is not happening in production. I want those
terms removed.
This blank term makes the top of the list as far as term frequency. Out
of 7.6 million documents, 4.8 million of them have it. From TermsComponent:
<response>
−
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">19106</int>
</lst>
−
<lst name="terms">
−
<lst name="catchall">
<int name="">4830648</int>
<int name="usa">1863264</int>
<int name="photo">1743551</int>
<int name="new">1544314</int>
<int name="de">1455691</int>
<int name="during">1412551</int>
<int name="los">1408855</int>
<int name="united">1368594</int>
<int name="2009">1271103</int>
<int name="la">1204441</int>
</lst>
</lst>
</response>
Is there any existing way to remove empty terms during analysis? I tried
TrimFilterFactory but that made no difference. Is this a bug in
PatternReplaceFilterFactory?
Shawn