climbingrose wrote:
Here is how I did it (the code is from memory so it might not be correct
100%):
private boolean hasAccents;
private Token filteredToken;
public final Token next() throws IOException {
if (hasAccents) {
hasAccents = false;
return filteredToken;
}
Token t = input.next();
String filteredText = removeAccents(t.termText());
if (filteredText.equals(t.termText()) { //no accents
return t;
} else {
filteredToken = (Token) t.clone();
filteredToken.setTermText(filteredText):
filteredToken.setPositionIncrement(0);
hasAccents = true;
}
return t;
}
On Sat, Jun 21, 2008 at 2:37 AM, Phillip Farber <[EMAIL PROTECTED]> wrote:
Regarding indexing words with accented and unaccented characters with
positionIncrement zero:
Chris Hostetter wrote:
you don't really need a custom tokenizer -- just a buffered TokenFilter
that clones the original token if it contains accent chars, mutates the
clone, and then emits it next with a positionIncrement of 0.
Could someone expand on how to implement this technique of buffering and
cloning?
Thanks,
Phil
I just was facing the same issue and came up with the following as a
solution.
I changed the Schema.xml file so that for the text field the analyzers
and filters are as follows:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="schema.UnicodeNormalizationFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="schema.UnicodeNormalizationFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
These two lines are the new ones:
<filter class="schema.UnicodeNormalizationFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
the first line invokes a custom filter that I borrowed and modified that
turns decomposed unicode ( like Pe'rez ) to the composed form ( Pérez )
the second line replaces accented characters with their unaccented
equivalents ( Perez )
For the custom filter to work, you must create a lib directory as a
sibling to the conf directory and place the jar files containing the
custom filter there.
The Jars can be downloaded from the blacklight subversion repository at:
http://blacklight.rubyforge.org/svn/trunk/solr/lib/
The SolrPlugin.jar contains the classes UnicodeNormalizationFilter and
UnicodeNormalizationFilterFactory which merely invokes the
Normalizer.normalize function in the normalizer jar (which is taken from
the marc4j distribution and which is a subset og the icu4j library)
-Robert Haschart