The char filter's attribute name is "replacement", not "replaceWith". I
tried it and it seems to work fine (with Solr 3.6).
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\w)\1{2,}+"
replacement="$1$1"/>
See:
http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html
-- Jack Krupansky
-----Original Message-----
From: Timothy Potter
Sent: Saturday, June 23, 2012 7:11 PM
To: solr-user@lucene.apache.org
Subject: Having an issue with the solr.PatternReplaceCharFilterFactory not
replacing characters correctly
Using 3.5 (also tried trunk), I have the following charFilter defined
on my fieldType (just extended text_general to keep things simple):
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\w)\1{2,}+"
replaceWith="$1$1"/>
The intent of this charFilter is to match any characters that are
repeated in a string more than twice and collapse down to a max of
two, i.e.
fooobarrrr => foobarr
Using the analysis form, I end up with: fba
Here is the full <fieldType> definition (just the one addition of the
leading <charFilter>):
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\w)\1{2,}+"
replaceWith="$1$1"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
It seems like my regex and replacement strategy should work ... to
prove it, I wrote a little Regex.java class in which I borrowed some
from the PatternReplaceCharFilter class ... when I execute the
following with my little hack, I get the expected results:
[~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1"
result: foobarr
Is this a known issue or does anyone know how to work-around this? If
not, I'll open a JIRA but wanted to check here first.
Cheers,
Tim
Regex.java <<<<
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Regex {
public static void main(String[] args) throws Exception {
String toCompile = args[0];
Pattern p = Pattern.compile(toCompile);
System.out.println("result: "+processPattern(p, args[1], args[2]));
}
// borrowed from PatternReplaceCharFilter.java
private static CharSequence processPattern(Pattern pattern,
CharSequence input, String replacement) {
final Matcher m = pattern.matcher(input);
final StringBuffer cumulativeOutput = new StringBuffer();
int cumulative = 0;
int lastMatchEnd = 0;
while (m.find()) {
final int groupSize = m.end() - m.start();
final int skippedSize = m.start() - lastMatchEnd;
lastMatchEnd = m.end();
final int lengthBeforeReplacement = cumulativeOutput.length() +
skippedSize;
m.appendReplacement(cumulativeOutput, replacement);
final int replacementSize = cumulativeOutput.length() -
lengthBeforeReplacement;
if (groupSize != replacementSize) {
if (replacementSize < groupSize) {
cumulative += groupSize - replacementSize;
int atIndex = lengthBeforeReplacement + replacementSize;
//System.err.println(atIndex + "!" + cumulative);
//addOffCorrectMap(atIndex, cumulative);
}
}
}
m.appendTail(cumulativeOutput);
return cumulativeOutput;
}
}