Iso accents and wildcards

2009-10-30 Thread Nicolas Leconte

Hi all,

I have a field that contains accentuated char in it, what I whant is to 
be able to search with ignore accents.

I have set up that field with :



generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1" />


words="stopwords.txt" />







In the index the word "économie" is translated to  "econom", the  accent 
is removed thanks to the ISOLatin1AccentFilterFactory and the end of the 
word removent thanks to the SnowballPorterFilterFactory.


When I request with title:econ* I can have the correct  answers, but if  
I request  with  title:écon*  I  have no  answers.
If I request with title:économ (the exact word of the index) it works, 
so there might be something wrong with the wildcard.
As far as I can understand the analyser should be use exactly the same 
in both index and query time.


I have tested with changing the order of the filters (putting the 
ISOLatin1AccentFilterFactory on top) without any result.


Could anybody help me with that and point me what may be wrong with my 
shema ?


Re: Iso accents and wildcards

2009-11-01 Thread Nicolas Leconte
Tks for the explain now I can clearly understand why it doesn't work as 
I was expecting :)


jfmel...@free.fr a écrit :

if the request contains any wilcard then filters are not called :
no ISOLatin1AccentFilterFactory and no SnowballPorterFilterFactory  !

"économie" is indexed to "econom"

solr don't found :
 - term starts with "éco" (éco*)
 - term starts with "economi" (economi*)

if you index manger, mangé and mangue, the indexed terms will be mang and mangu

requests  ->  results

manger   ->   mange, mangé
mangé->   mange, mangé
mang ->   mange, manger
mangu->   mangue
mang*->   manger, mangé, mangue
mang?->   mangue  (and not mangé)
mangé*   ->   nothing

Jean-François


- "Nicolas Leconte"  a écrit :

| Hi all,
| 
| I have a field that contains accentuated char in it, what I whant is
| to 
| be able to search with ignore accents.

| I have set up that field with :
| 
| 
| 
| | 
| generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
| catenateAll="0" splitOnCaseChange="1" />

| 
| | words="stopwords.txt" />

| 
| 
| 
| 
| 
| 
| In the index the word "économie" is translated to  "econom", the 
| accent 
| is removed thanks to the ISOLatin1AccentFilterFactory and the end of
| the 
| word removent thanks to the SnowballPorterFilterFactory.
| 
| When I request with title:econ* I can have the correct  answers, but
| if  
| I request  with  title:écon*  I  have no  answers.

| If I request with title:économ (the exact word of the index) it works,
| 
| so there might be something wrong with the wildcard.

| As far as I can understand the analyser should be use exactly the same
| 
| in both index and query time.
| 
| I have tested with changing the order of the filters (putting the 
| ISOLatin1AccentFilterFactory on top) without any result.
| 
| Could anybody help me with that and point me what may be wrong with my
| 
| shema ?



  




Re: Iso accents and wildcards

2009-11-01 Thread Nicolas Leconte

Tks for the tips, I will try to do exactly what u suggest.

Avlesh Singh a écrit :

When I request with title:econ* I can have the correct  answers, but if  I
request  with  title:écon*  I  have no  answers.
If I request with title:économ (the exact word of the index) it works, so
there might be something wrong with the wildcard.
As far as I can understand the analyser should be use exactly the same in
both index and query time.



Wildcard queries are not analyzed and hence the "inconsistent" behaviour.
The easiest way out is to define one more field "title_orginal" as an
untokenized field. While querying, you can use both the fields at the same
time. e.g. q=(title:écon* title_orginal:écon*). In any case, you would get
desired matches.

Cheers
Avlesh

On Fri, Oct 30, 2009 at 9:19 PM, Nicolas Leconte wrote:

  

Hi all,

I have a field that contains accentuated char in it, what I whant is to be
able to search with ignore accents.
I have set up that field with :












In the index the word "économie" is translated to  "econom", the  accent is
removed thanks to the ISOLatin1AccentFilterFactory and the end of the word
removent thanks to the SnowballPorterFilterFactory.

When I request with title:econ* I can have the correct  answers, but if  I
request  with  title:écon*  I  have no  answers.
If I request with title:économ (the exact word of the index) it works, so
there might be something wrong with the wildcard.
As far as I can understand the analyser should be use exactly the same in
both index and query time.

I have tested with changing the order of the filters (putting the
ISOLatin1AccentFilterFactory on top) without any result.

Could anybody help me with that and point me what may be wrong with my
shema ?