Hi Alejandro,

Solr is Unicode aware.  The ISOLatin1AccentFilterFactory handles diacritics for 
the ISO Latin-1 section of the Unicode character set.  UTF (do you mean UTF-8?) 
is a (set of) Unicode serialization(s), and once Solr has deserialized it, it 
is just Unicode characters (Java's in-memory UTF-16 representation).

So as long as you're only concerned about removing diacritics from the set of 
Unicode characters that overlaps ISO Latin-1, and not about other Unicode 
characters, then ISOLatin1AccentFilterFactory should work for you.

Steve

On 08/11/2008 at 7:22 PM, Alejandro Garza Gonzalez wrote:
> I have utf-8 content that I wat to index, however I want searches
> without diacritics to return results.
> 
> For example, a document with the words "nino en mexico" should return
> results like a document with the phrase "Niño en México".
> 
> Ideally, exact diacritic matches should score higher (searching for
> "niño" exactly should make a document with "niño" score higher than a
> document with "nino")
> 
> Any pointers on how to do this? I found about the
> /solr/.ISOLatin1AccentFilterFactory but it seems to only strip
> diacritics from iso-latin characters. How about UTF diacritics? --
> _________________ ___ _ _ _ _ _ _ _ *Ing. Alejandro Garza González*
> Director, Tecnología e Innovación, Biblioteca Tecnológico de Monterrey,
> Campus Monterrey
> 
> Tel.: 52(81) 8358-1400 ext. 4037 Fax: 52(81) 8328-4067
> Enlace Intercampus: 80 689 4037
> http://biblioteca.mty.itesm.mx
> 
> El contenido de este mensaje de datos no se considera oferta, propuesta
> o acuerdo, sino hasta que sea confirmado en documento por escrito que
> contenga la firma autógrafa del apoderado legal del ITESM. El contenido
> de este mensaje de datos es confidencial y se entiende dirigido y para
> uso exclusivo del destinatario, por lo que no podrá distribuirse y/o
> difundirse por ningún medio sin la previa autorización del emisor
> original. Si usted no es el destinatario, se le prohíbe su utilización
> total o parcial para cualquier fin.
> 
> The content of this data transmission must not be considered
> an offer,
> proposal, understanding or agreement unless it is confirmed in a
> document signed by a legal representative of ITESM. The
> content of this
> data transmission is confidential and is intended to be
> delivered only
> to the addressees. Therefore, it shall not be distributed and/or
> disclosed through any means without the authorization of the original
> sender. If you are not the addressee, you are forbidden from
> using it,
> either totally or partially, for any purpose.
> 
>

 

Reply via email to