Re: search ignoring accents

Erick Erickson Fri, 17 Apr 2015 10:03:04 -0700

Pedro:

For your example, don't use EdgeNgrams, use just NGrams. That'll index
tokens like
(in the 2gram case) pe er dr ro and searching against edr would look
for "ed dr". which would match.


However, this isn't in line with your first example where you got
results you didn't expect. You'll have to
be careful to search for these pairwise tokens as _phrases_ to prevent
false matches.

Best,
Erick

On Fri, Apr 17, 2015 at 4:50 AM, Pedro Figueiredo
<pjlfigueir...@criticalsoftware.com> wrote:
> And for this example what filter should I use?
>
> Filter by "edr" should give the result "Pedro"
> The NGram create tokens starting at the beginning or the ending, and in the 
> middle?
>
> Thanks!
>
> Pedro Figueiredo
> Senior Engineer
>
> pjlfigueir...@criticalsoftware.com
> M. 934058150
>
>
> Rua Engº Frederico Ulrich, nº 2650 4470-605 Moreira da Maia, Portugal
> T. +351 229 446 927 | F. +351 229 446 929
> www.criticalsoftware.com
>
> PORTUGAL | UK | GERMANY | USA | BRAZIL | MOZAMBIQUE | ANGOLA
> A CMMI® LEVEL 5 RATED COMPANY CMMI® is registered in the USPTO by CMU"
>
>
>
> -----Original Message-----
> From: Pedro Figueiredo [mailto:pjlfigueir...@criticalsoftware.com]
> Sent: 17 April 2015 12:22
> To: solr-user@lucene.apache.org; 'Ahmet Arslan'
> Subject: RE: search ignoring accents
>
> Hi Ahmet,
>
> Yes... the EdgeNGram is what produces those results...
> I need it to improve the search by name by the applications users.
>
> Thanks.
>
> Pedro Figueiredo
> Senior Engineer
>
> pjlfigueir...@criticalsoftware.com
> M. 934058150
>
>
> Rua Engº Frederico Ulrich, nº 2650 4470-605 Moreira da Maia, Portugal T. +351 
> 229 446 927 | F. +351 229 446 929 www.criticalsoftware.com
>
> PORTUGAL | UK | GERMANY | USA | BRAZIL | MOZAMBIQUE | ANGOLA A CMMI® LEVEL 5 
> RATED COMPANY CMMI® is registered in the USPTO by CMU"
>
>
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
> Sent: 17 April 2015 12:01
> To: solr-user@lucene.apache.org
> Subject: Re: search ignoring accents
>
> Hi Pedro,
>
> solr.ASCIIFoldingFilterFactory is one way to remove diacritics.
> Confusion comes from EdgeNGram, why do you need it?
>
> Ahmet
>
>
>
> On Friday, April 17, 2015 1:38 PM, Pedro Figueiredo 
> <pjlfigueir...@criticalsoftware.com> wrote:
>
>
>
> Hello,
>
> What is the best way to search in a field ignoring accents?
>
> The field has the type:
>                 <fieldType name="text_general_edge_ngram" 
> class="solr.TextField" positionIncrementGap="100">
>                                <analyzer type="index">
>                                                <tokenizer 
> class="solr.LowerCaseTokenizerFactory"/>
>                                                <filter 
> class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
>                                </analyzer>
>                                <analyzer type="query">
>                                                <tokenizer 
> class="solr.LowerCaseTokenizerFactory"/>
>                                                <filter 
> class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
>                                </analyzer>
>                 </fieldType>
>
> I’ve tried adding the filter:  <filter 
> class="solr.ASCIIFoldingFilterFactory"/>
> but some strange results happened.. like:
>
> Search by “Mourao” and the results were:
> Mourão -> OK
> Monteiro -> NOTOK
> Morais -> NOTOK
>
> Thanks in advanced,
>
> Pedro Figueiredo
> Senior Engineer
>
> pjlfigueir...@criticalsoftware.com
> M. 934058150
>
> Rua Engº Frederico Ulrich, nº 2650 4470-605 Moreira da Maia, Portugal T. +351 
> 229 446 927 | F. +351 229 446 929 www.criticalsoftware.com
>
> PORTUGAL | UK | GERMANY | USA | BRAZIL | MOZAMBIQUE | ANGOLA A CMMI® LEVEL 5 
> RATED COMPANY CMMI® is registered in the USPTO by CMU"
>

Re: search ignoring accents

Reply via email to