Robert, does your code do something that IUC doesn't do? See http://www.icu-project.org/
Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Robert Haschart <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, June 26, 2008 4:41:02 PM > Subject: Re: UnicodeNormalizationFilterFactory > > Lance Norskog wrote: > > >ISOLatin1AccentFilterFactory works quite well for us. It solves our basic > >euro-text keyboard searching problem, where "protege" should find protégé. > >("protege" with two accents.) > > > >-----Original Message----- > >From: Chris Hostetter [mailto:[EMAIL PROTECTED] > >Sent: Tuesday, June 24, 2008 4:05 PM > >To: solr-user@lucene.apache.org > >Subject: Re: UnicodeNormalizationFilterFactory > > > > > >: I've seen mention of these filters: > >: > >: > >: > > > >Are you asking because you saw these in Robert Haschart's reply to your > >previous question? I think those are custom Filters that he has in his > >project ... not open source (but i may be wrong) > > > >they are certainly not something that comes out of the box w/ Solr. > > > > > >-Hoss > > > > > The ISOLatin1AccentFilter works well in the case above described by > Lance Norskog, ie. for words containing characters with accents where > the accented character is a single unicode character for the letter with > the accent mark as in protégé. However in the data that we work with, > often accented characters will be represented by a plain unaccented > character followed by the Unicode combining character for the accent > mark, roughly like this: prote'ge' which emerge from the > ISOLatin1AccentFilter unchanged. > > After some research I found the UnicodeNormalizationFilter mentioned > above, which did not work on my development system (because it relies > features only available in java 6), and which when combined with the > DiacriticsFilter also mentioned above would remove diacritics from > characters, but also discard any Chinese characters or Russian > characters, or anything else outside the 0x0--0x7f range. Which is bad. > > I first modified the filter to normalize the characters to the composed > normalized form, (changing prote'ge' to protégé) and then pass the > results through the ISOLatin1AccentFilter. However for accented > characters for which there is no composed normailzed form (such as the n > and s in Zarin̦š) the accents are not removed. > > So I took the approach of decomposing the accented characters, and then > only removing the valid diacritics and zero-width composing characters > from the result, and the resulting filter works quite well. And since it > was developed as a part of the blacklight project at the University of > Virginia it is Open Source under the Apache License. > > If anyone is interested in evaluating of using the > UnicodeNormalizationFilter in conjunction with their Solr installation > get the UnicodeNormalizeFilter.jar from: > > http://blacklight.rubyforge.org/svn/trunk/solr/lib/ > > and place it in a lib directory next to the conf directory in your Solr > home directory. > > Robert Haschart