Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

Jack Krupansky Sat, 23 Jun 2012 19:17:55 -0700

The char filter's attribute name is "replacement", not "replaceWith". Itried it and it seems to work fine (with Solr 3.6).


<charFilter class="solr.PatternReplaceCharFilterFactory"
  pattern="(\w)\1{2,}+"
  replacement="$1$1"/>


See:
http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html

-- Jack Krupansky

-----Original Message-----From: Timothy Potter

Sent: Saturday, June 23, 2012 7:11 PM
To: solr-user@lucene.apache.org

Subject: Having an issue with the solr.PatternReplaceCharFilterFactory notreplacing characters correctly


Using 3.5 (also tried trunk), I have the following charFilter defined
on my fieldType (just extended text_general to keep things simple):

<charFilter class="solr.PatternReplaceCharFilterFactory"
           pattern="(\w)\1{2,}+"
           replaceWith="$1$1"/>

The intent of this charFilter is to match any characters that are
repeated in a string more than twice and collapse down to a max of
two, i.e.

fooobarrrr  =>  foobarr

Using the analysis form, I end up with: fba

Here is the full <fieldType> definition (just the one addition of the
leading <charFilter>):

   <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
     <analyzer type="index">
       <charFilter class="solr.PatternReplaceCharFilterFactory"
          pattern="(\w)\1{2,}+"
          replaceWith="$1$1"/>
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
       <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
       <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
       <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
   </fieldType>

It seems like my regex and replacement strategy should work ... to
prove it, I wrote a little Regex.java class in which I borrowed some
from the PatternReplaceCharFilter class ... when I execute the
following with my little hack, I get the expected results:

[~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1"
result: foobarr

Is this a known issue or does anyone know how to work-around this? If
not, I'll open a JIRA but wanted to check here first.

Cheers,
Tim

Regex.java  <<<<


import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Regex {
   public static void main(String[] args) throws Exception {
       String toCompile = args[0];
       Pattern p = Pattern.compile(toCompile);
       System.out.println("result: "+processPattern(p, args[1], args[2]));
   }

 // borrowed from PatternReplaceCharFilter.java
 private static CharSequence processPattern(Pattern pattern,
CharSequence input, String replacement) {
   final Matcher m = pattern.matcher(input);

   final StringBuffer cumulativeOutput = new StringBuffer();
   int cumulative = 0;
   int lastMatchEnd = 0;
   while (m.find()) {
     final int groupSize = m.end() - m.start();
     final int skippedSize = m.start() - lastMatchEnd;
     lastMatchEnd = m.end();
     final int lengthBeforeReplacement = cumulativeOutput.length() +
skippedSize;
     m.appendReplacement(cumulativeOutput, replacement);
     final int replacementSize = cumulativeOutput.length() -
lengthBeforeReplacement;
     if (groupSize != replacementSize) {
       if (replacementSize < groupSize) {
         cumulative += groupSize - replacementSize;
         int atIndex = lengthBeforeReplacement + replacementSize;
         //System.err.println(atIndex + "!" + cumulative);
         //addOffCorrectMap(atIndex, cumulative);
       }
     }
   }
   m.appendTail(cumulativeOutput);
   return cumulativeOutput;
 }

}

Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

Reply via email to