alternatively, you can use "+" instead of "*" in your regular expressions so that you dont match them at all...
I think the PatternTokenizer is doing the right thing, if your expression says that a blank term is acceptable. On Tue, Oct 5, 2010 at 6:39 PM, Markus Jelsma <markus.jel...@buyways.nl>wrote: > Actually, it might be a good idea to add an optional setting to the > PatternTokenizer that doesn't emit blank terms. Perhaps a > allowBlanks="false" would be a pleasant additional to the PatternTokenizer > so an additional LengthFilter can be left out and thus spare CPU cycles and > some memory. > > -----Original message----- > From: Markus Jelsma <markus.jel...@buyways.nl> > Sent: Wed 06-10-2010 00:29 > To: solr-user@lucene.apache.org; > Subject: RE: PatternReplaceFilterFactory creating empty string as a term > > I'm not sure if this is the best approach but a LengthFilter will stop > blank terms. > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory > > -----Original message----- > From: Shawn Heisey <s...@elyograg.org> > Sent: Wed 06-10-2010 00:25 > To: solr-user@lucene.apache.org; > Subject: PatternReplaceFilterFactory creating empty string as a term > > I am developing a new schema. It has a pattern filter that trims > leading and trailing punctuation from terms. > > <filter class="solr.PatternReplaceFilterFactory" > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" > replacement="$2" > /> > > It is resulting in empty terms, because there are situations in the > analyzer stream where a term happens to be composed of nothing but > punctuation. This problem is not happening in production. I want those > terms removed. > > This blank term makes the top of the list as far as term frequency. Out > of 7.6 million documents, 4.8 million of them have it. From TermsComponent: > > <response> > ??? > <lst name="responseHeader"> > <int name="status">0</int> > <int name="QTime">19106</int> > </lst> > ??? > <lst name="terms"> > ??? > <lst name="catchall"> > <int name="">4830648</int> > <int name="usa">1863264</int> > <int name="photo">1743551</int> > <int name="new">1544314</int> > <int name="de">1455691</int> > <int name="during">1412551</int> > <int name="los">1408855</int> > <int name="united">1368594</int> > <int name="2009">1271103</int> > <int name="la">1204441</int> > </lst> > </lst> > </response> > > Is there any existing way to remove empty terms during analysis? I tried > TrimFilterFactory but that made no difference. Is this a bug in > PatternReplaceFilterFactory? > > Shawn > > -- Robert Muir rcm...@gmail.com