Yeah, it was kind of unfortunate that the posted example in SOLR-1653 used
"replaceWith" but the committed code used "replacement". The detailed
commentary on the issue notes the change, but the change occurred between
the last posted patch and the commit. The source code and javadoc "rule",
but we tend to assume that the Jira is more accurate than it necessarily is.
-- Jack Krupansky
-----Original Message-----
From: Timothy Potter
Sent: Sunday, June 24, 2012 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Having an issue with the solr.PatternReplaceCharFilterFactory
not replacing characters correctly
Awesome find Jack - thanks! Copied the "replaceWith" bit from
http://lucidworks.lucidimagination.com/display/solr/CharFilterFactories
Cheers,
Tim
On Sat, Jun 23, 2012 at 8:16 PM, Jack Krupansky <j...@basetechnology.com>
wrote:
The char filter's attribute name is "replacement", not "replaceWith". I
tried it and it seems to work fine (with Solr 3.6).
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\w)\1{2,}+"
replacement="$1$1"/>
See:
http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html
-- Jack Krupansky
-----Original Message----- From: Timothy Potter
Sent: Saturday, June 23, 2012 7:11 PM
To: solr-user@lucene.apache.org
Subject: Having an issue with the solr.PatternReplaceCharFilterFactory not
replacing characters correctly
Using 3.5 (also tried trunk), I have the following charFilter defined
on my fieldType (just extended text_general to keep things simple):
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\w)\1{2,}+"
replaceWith="$1$1"/>
The intent of this charFilter is to match any characters that are
repeated in a string more than twice and collapse down to a max of
two, i.e.
fooobarrrr => foobarr
Using the analysis form, I end up with: fba
Here is the full <fieldType> definition (just the one addition of the
leading <charFilter>):
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\w)\1{2,}+"
replaceWith="$1$1"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
It seems like my regex and replacement strategy should work ... to
prove it, I wrote a little Regex.java class in which I borrowed some
from the PatternReplaceCharFilter class ... when I execute the
following with my little hack, I get the expected results:
[~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1"
result: foobarr
Is this a known issue or does anyone know how to work-around this? If
not, I'll open a JIRA but wanted to check here first.
Cheers,
Tim
Regex.java <<<<
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Regex {
public static void main(String[] args) throws Exception {
String toCompile = args[0];
Pattern p = Pattern.compile(toCompile);
System.out.println("result: "+processPattern(p, args[1], args[2]));
}
// borrowed from PatternReplaceCharFilter.java
private static CharSequence processPattern(Pattern pattern,
CharSequence input, String replacement) {
final Matcher m = pattern.matcher(input);
final StringBuffer cumulativeOutput = new StringBuffer();
int cumulative = 0;
int lastMatchEnd = 0;
while (m.find()) {
final int groupSize = m.end() - m.start();
final int skippedSize = m.start() - lastMatchEnd;
lastMatchEnd = m.end();
final int lengthBeforeReplacement = cumulativeOutput.length() +
skippedSize;
m.appendReplacement(cumulativeOutput, replacement);
final int replacementSize = cumulativeOutput.length() -
lengthBeforeReplacement;
if (groupSize != replacementSize) {
if (replacementSize < groupSize) {
cumulative += groupSize - replacementSize;
int atIndex = lengthBeforeReplacement + replacementSize;
//System.err.println(atIndex + "!" + cumulative);
//addOffCorrectMap(atIndex, cumulative);
}
}
}
m.appendTail(cumulativeOutput);
return cumulativeOutput;
}
}