Re: Accented search

Robert Haschart Tue, 24 Jun 2008 08:35:51 -0700


climbingrose wrote:

Here is how I did it (the code is from memory so it might not be correct
100%):
private boolean hasAccents;
private Token filteredToken;

public final Token next() throws IOException {
 if (hasAccents) {
   hasAccents = false;
   return filteredToken;
 }
 Token t = input.next();
 String filteredText = removeAccents(t.termText());
 if (filteredText.equals(t.termText()) { //no accents
   return t;
 } else {
   filteredToken = (Token) t.clone();
   filteredToken.setTermText(filteredText):
   filteredToken.setPositionIncrement(0);
   hasAccents = true;
 }
 return t;
}

On Sat, Jun 21, 2008 at 2:37 AM, Phillip Farber <[EMAIL PROTECTED]> wrote:

Regarding indexing words with accented and unaccented characters with
positionIncrement zero:

Chris Hostetter wrote:

you don't really need a custom tokenizer -- just a buffered TokenFilter
that clones the original token if it contains accent chars, mutates the
clone, and then emits it next with a positionIncrement of 0.

Could someone expand on how to implement this technique of buffering and
cloning?

Thanks,

Phil

I just was facing the same issue and came up with the following as asolution.

I changed the Schema.xml file so that for the text field the analyzersand filters are as follows:



  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
     <analyzer type="index">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="schema.UnicodeNormalizationFilterFactory"/>
       <filter class="solr.ISOLatin1AccentFilterFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/><filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1" catenateWords="1"catenateNumbers="1" catenateAll="0"/>

       <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EnglishPorterFilterFactory"protected="protwords.txt"/>

       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="schema.UnicodeNormalizationFilterFactory"/>
       <filter class="solr.ISOLatin1AccentFilterFactory"/>

<filter class="solr.SynonymFilterFactory"synonyms="synonyms.txt" ignoreCase="true" expand="true"/><filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/><filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1" catenateWords="0"catenateNumbers="0" catenateAll="0"/>

       <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EnglishPorterFilterFactory"protected="protwords.txt"/>

       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
   </fieldType>

These two lines are the new ones:
       <filter class="schema.UnicodeNormalizationFilterFactory"/>
       <filter class="solr.ISOLatin1AccentFilterFactory"/>

the first line invokes a custom filter that I borrowed and modified thatturns decomposed unicode ( like Pe'rez ) to the composed form ( Pérez )the second line replaces accented characters with their unaccentedequivalents ( Perez )

For the custom filter to work, you must create a lib directory as asibling to the conf directory and place the jar files containing thecustom filter there.


The Jars can be downloaded from the blacklight subversion repository at:

http://blacklight.rubyforge.org/svn/trunk/solr/lib/

The SolrPlugin.jar contains the classes UnicodeNormalizationFilter andUnicodeNormalizationFilterFactory which merely invokes theNormalizer.normalize function in the normalizer jar (which is taken fromthe marc4j distribution and which is a subset og the icu4j library)

-Robert Haschart

Re: Accented search

Reply via email to