Please 1) make sure with a separate program that these are the right Java regex patterns, and 2) write a unit test with all of the cases you expect this to handle. Then file a JIRA with the unit test code.
On Sat, Jun 23, 2012 at 5:11 PM, Timothy Potter <thelabd...@gmail.com> wrote: > Using 3.5 (also tried trunk), I have the following charFilter defined > on my fieldType (just extended text_general to keep things simple): > > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="(\w)\1{2,}+" > replaceWith="$1$1"/> > > The intent of this charFilter is to match any characters that are > repeated in a string more than twice and collapse down to a max of > two, i.e. > > fooobarrrr => foobarr > > Using the analysis form, I end up with: fba > > Here is the full <fieldType> definition (just the one addition of the > leading <charFilter>): > > <fieldType name="text_general" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="(\w)\1{2,}+" > replaceWith="$1$1"/> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true" /> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true" /> > <filter class="solr.SynonymFilterFactory" > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > It seems like my regex and replacement strategy should work ... to > prove it, I wrote a little Regex.java class in which I borrowed some > from the PatternReplaceCharFilter class ... when I execute the > following with my little hack, I get the expected results: > > [~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1" > result: foobarr > > Is this a known issue or does anyone know how to work-around this? If > not, I'll open a JIRA but wanted to check here first. > > Cheers, > Tim > >>>>> Regex.java <<<< > > import java.util.regex.Pattern; > import java.util.regex.Matcher; > > public class Regex { > public static void main(String[] args) throws Exception { > String toCompile = args[0]; > Pattern p = Pattern.compile(toCompile); > System.out.println("result: "+processPattern(p, args[1], args[2])); > } > > // borrowed from PatternReplaceCharFilter.java > private static CharSequence processPattern(Pattern pattern, > CharSequence input, String replacement) { > final Matcher m = pattern.matcher(input); > > final StringBuffer cumulativeOutput = new StringBuffer(); > int cumulative = 0; > int lastMatchEnd = 0; > while (m.find()) { > final int groupSize = m.end() - m.start(); > final int skippedSize = m.start() - lastMatchEnd; > lastMatchEnd = m.end(); > final int lengthBeforeReplacement = cumulativeOutput.length() + > skippedSize; > m.appendReplacement(cumulativeOutput, replacement); > final int replacementSize = cumulativeOutput.length() - > lengthBeforeReplacement; > if (groupSize != replacementSize) { > if (replacementSize < groupSize) { > cumulative += groupSize - replacementSize; > int atIndex = lengthBeforeReplacement + replacementSize; > //System.err.println(atIndex + "!" + cumulative); > //addOffCorrectMap(atIndex, cumulative); > } > } > } > m.appendTail(cumulativeOutput); > return cumulativeOutput; > } > } -- Lance Norskog goks...@gmail.com