Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

Timothy Potter Sun, 24 Jun 2012 09:18:35 -0700

Awesome find Jack - thanks! Copied the "replaceWith" bit from
http://lucidworks.lucidimagination.com/display/solr/CharFilterFactories


Cheers,
Tim

On Sat, Jun 23, 2012 at 8:16 PM, Jack Krupansky <j...@basetechnology.com> wrote:
> The char filter's attribute name is "replacement", not "replaceWith". I
> tried it and it seems to work fine (with Solr 3.6).
>
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
>  pattern="(\w)\1{2,}+"
>  replacement="$1$1"/>
>
> See:
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: Timothy Potter
> Sent: Saturday, June 23, 2012 7:11 PM
> To: solr-user@lucene.apache.org
> Subject: Having an issue with the solr.PatternReplaceCharFilterFactory not
> replacing characters correctly
>
>
> Using 3.5 (also tried trunk), I have the following charFilter defined
> on my fieldType (just extended text_general to keep things simple):
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
>           pattern="(\w)\1{2,}+"
>           replaceWith="$1$1"/>
>
> The intent of this charFilter is to match any characters that are
> repeated in a string more than twice and collapse down to a max of
> two, i.e.
>
> fooobarrrr  =>  foobarr
>
> Using the analysis form, I end up with: fba
>
> Here is the full <fieldType> definition (just the one addition of the
> leading <charFilter>):
>
>   <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer type="index">
>       <charFilter class="solr.PatternReplaceCharFilterFactory"
>          pattern="(\w)\1{2,}+"
>          replaceWith="$1$1"/>
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>       <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> It seems like my regex and replacement strategy should work ... to
> prove it, I wrote a little Regex.java class in which I borrowed some
> from the PatternReplaceCharFilter class ... when I execute the
> following with my little hack, I get the expected results:
>
> [~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1"
> result: foobarr
>
> Is this a known issue or does anyone know how to work-around this? If
> not, I'll open a JIRA but wanted to check here first.
>
> Cheers,
> Tim
>
>>>>> Regex.java  <<<<
>
>
> import java.util.regex.Pattern;
> import java.util.regex.Matcher;
>
> public class Regex {
>   public static void main(String[] args) throws Exception {
>       String toCompile = args[0];
>       Pattern p = Pattern.compile(toCompile);
>       System.out.println("result: "+processPattern(p, args[1], args[2]));
>   }
>
>  // borrowed from PatternReplaceCharFilter.java
>  private static CharSequence processPattern(Pattern pattern,
> CharSequence input, String replacement) {
>   final Matcher m = pattern.matcher(input);
>
>   final StringBuffer cumulativeOutput = new StringBuffer();
>   int cumulative = 0;
>   int lastMatchEnd = 0;
>   while (m.find()) {
>     final int groupSize = m.end() - m.start();
>     final int skippedSize = m.start() - lastMatchEnd;
>     lastMatchEnd = m.end();
>     final int lengthBeforeReplacement = cumulativeOutput.length() +
> skippedSize;
>     m.appendReplacement(cumulativeOutput, replacement);
>     final int replacementSize = cumulativeOutput.length() -
> lengthBeforeReplacement;
>     if (groupSize != replacementSize) {
>       if (replacementSize < groupSize) {
>         cumulative += groupSize - replacementSize;
>         int atIndex = lengthBeforeReplacement + replacementSize;
>         //System.err.println(atIndex + "!" + cumulative);
>         //addOffCorrectMap(atIndex, cumulative);
>       }
>     }
>   }
>   m.appendTail(cumulativeOutput);
>   return cumulativeOutput;
>  }
> }

Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

Reply via email to