climbingrose wrote:

Here is how I did it (the code is from memory so it might not be correct
100%):
private boolean hasAccents;
private Token filteredToken;

public final Token next() throws IOException {
 if (hasAccents) {
   hasAccents = false;
   return filteredToken;
 }
 Token t = input.next();
 String filteredText = removeAccents(t.termText());
 if (filteredText.equals(t.termText()) { //no accents
   return t;
 } else {
   filteredToken = (Token) t.clone();
   filteredToken.setTermText(filteredText):
   filteredToken.setPositionIncrement(0);
   hasAccents = true;
 }
 return t;
}

On Sat, Jun 21, 2008 at 2:37 AM, Phillip Farber <[EMAIL PROTECTED]> wrote:

Regarding indexing words with accented and unaccented characters with
positionIncrement zero:

Chris Hostetter wrote:

you don't really need a custom tokenizer -- just a buffered TokenFilter
that clones the original token if it contains accent chars, mutates the
clone, and then emits it next with a positionIncrement of 0.


Could someone expand on how to implement this technique of buffering and
cloning?

Thanks,

Phil


I just was facing the same issue and came up with the following as a solution.

I changed the Schema.xml file so that for the text field the analyzers and filters are as follows:


  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
     <analyzer type="index">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="schema.UnicodeNormalizationFilterFactory"/>
       <filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
       <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="schema.UnicodeNormalizationFilterFactory"/>
       <filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
       <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
   </fieldType>

These two lines are the new ones:
       <filter class="schema.UnicodeNormalizationFilterFactory"/>
       <filter class="solr.ISOLatin1AccentFilterFactory"/>

the first line invokes a custom filter that I borrowed and modified that turns decomposed unicode ( like Pe'rez ) to the composed form ( Pérez ) the second line replaces accented characters with their unaccented equivalents ( Perez )

For the custom filter to work, you must create a lib directory as a sibling to the conf directory and place the jar files containing the custom filter there.

The Jars can be downloaded from the blacklight subversion repository at:

http://blacklight.rubyforge.org/svn/trunk/solr/lib/

The SolrPlugin.jar contains the classes UnicodeNormalizationFilter and UnicodeNormalizationFilterFactory which merely invokes the Normalizer.normalize function in the normalizer jar (which is taken from the marc4j distribution and which is a subset og the icu4j library)
-Robert Haschart

Reply via email to