Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

Lance Norskog Sat, 23 Jun 2012 17:18:02 -0700

Please 1) make sure with a separate program that these are the right
Java regex patterns, and 2) write a unit test with all of the cases
you expect this to handle. Then file a JIRA with the unit test code.


On Sat, Jun 23, 2012 at 5:11 PM, Timothy Potter <[email protected]> wrote:
> Using 3.5 (also tried trunk), I have the following charFilter defined
> on my fieldType (just extended text_general to keep things simple):
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
>             pattern="(\w)\1{2,}+"
>             replaceWith="$1$1"/>
>
> The intent of this charFilter is to match any characters that are
> repeated in a string more than twice and collapse down to a max of
> two, i.e.
>
> fooobarrrr  =>  foobarr
>
> Using the analysis form, I end up with: fba
>
> Here is the full <fieldType> definition (just the one addition of the
> leading <charFilter>):
>
>     <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <charFilter class="solr.PatternReplaceCharFilterFactory"
>            pattern="(\w)\1{2,}+"
>            replaceWith="$1$1"/>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> It seems like my regex and replacement strategy should work ... to
> prove it, I wrote a little Regex.java class in which I borrowed some
> from the PatternReplaceCharFilter class ... when I execute the
> following with my little hack, I get the expected results:
>
> [~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1"
> result: foobarr
>
> Is this a known issue or does anyone know how to work-around this? If
> not, I'll open a JIRA but wanted to check here first.
>
> Cheers,
> Tim
>
>>>>> Regex.java  <<<<
>
> import java.util.regex.Pattern;
> import java.util.regex.Matcher;
>
> public class Regex {
>     public static void main(String[] args) throws Exception {
>         String toCompile = args[0];
>         Pattern p = Pattern.compile(toCompile);
>         System.out.println("result: "+processPattern(p, args[1], args[2]));
>     }
>
>   // borrowed from PatternReplaceCharFilter.java
>   private static CharSequence processPattern(Pattern pattern,
> CharSequence input, String replacement) {
>     final Matcher m = pattern.matcher(input);
>
>     final StringBuffer cumulativeOutput = new StringBuffer();
>     int cumulative = 0;
>     int lastMatchEnd = 0;
>     while (m.find()) {
>       final int groupSize = m.end() - m.start();
>       final int skippedSize = m.start() - lastMatchEnd;
>       lastMatchEnd = m.end();
>       final int lengthBeforeReplacement = cumulativeOutput.length() +
> skippedSize;
>       m.appendReplacement(cumulativeOutput, replacement);
>       final int replacementSize = cumulativeOutput.length() -
> lengthBeforeReplacement;
>       if (groupSize != replacementSize) {
>         if (replacementSize < groupSize) {
>           cumulative += groupSize - replacementSize;
>           int atIndex = lengthBeforeReplacement + replacementSize;
>           //System.err.println(atIndex + "!" + cumulative);
>           //addOffCorrectMap(atIndex, cumulative);
>         }
>       }
>     }
>     m.appendTail(cumulativeOutput);
>     return cumulativeOutput;
>   }
> }



-- 
Lance Norskog
[email protected]

Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

Reply via email to