RE: Using asterik(*) with unicode characters.

Preeti Bhat Thu, 29 Jun 2017 02:06:01 -0700

Thanks Erick, its working now as expected.

Thanks and Regards,
Preeti Bhat

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, June 28, 2017 9:20 PM
To: solr-user
Subject: Re: Using asterik(*) with unicode characters.

There's a long blog on wildcards here:
https://lucidworks.com/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

The gist is that when you are analyzing a token, if the analysis chain splits a 
token into more than one part then wildcards are impossible to get right. So 
any "MultiTermAware" filter will barf if you ask it to emit more than one token 
when doing wildcard searches. For filters that are _not_ "MultiTermAware", 
they're just skipped in the query analysis chain.

That leaves the question of why your query chain seems to emit two tokens for  
MöllerGruppen but not MollerGruppen. I think it's because you have 
preserveOriginal set to true in the query analysis chain
here:
 <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>

So this entry emits both
MöllerGruppen and MollerGruppen
for the input
MöllerGruppen
but not for
MollerGruppen
since MollerGruppen doesn't need any folding. This violates this constraint 
imposed by ASCIIFoldingFilterFactory being "MultiTermAware", which means if it 
emits two tokens it barfs.

You do not need to set "preserveOriginal='true' " in your _query_ chain since 
your indexing chain puts both the folded and un-folded versions in the index at 
the same position.

So I think if you set perserveOriginal to false (again, in the _query_ analysis 
chain, leave it true in the index analysis chain) you'll be OK. Your queries 
will also be somewhat faster.

Best,
Erick

On Wed, Jun 28, 2017 at 6:25 AM, Preeti Bhat <preeti.b...@shoregrp.com> wrote:
> Hi All,
>
> I have a requirement where the user can give an Unicode or ascii character as 
> input but expects same result.
>
> For example: MöllerGruppen AS vs MollerGruppen AS should give out same result.
>
> I am able to get this done using <filter 
> class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>, but due to 
> some reason when it try to do MöllerGruppen* I am getting the below message.
>
> ""metadata":[
>       "error-class","org.apache.solr.common.SolrException",
>       "root-error-class","org.apache.solr.common.SolrException"],
>     "msg":"analyzer returned too many terms for multiTerm term: 
> MöllerGruppen",
>     "code":400}}
> "
>
> It works for MollerGruppen* though.
>
> Could someone please advise on this.
>
> Below is the fieldtype of this field.
>
> <fieldType name="string_ci" class="solr.TextField">
>     <analyzer type="index">
>             <charFilter class="solr.HTMLStripCharFilterFactory"/>
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>               <filter class="solr.ASCIIFoldingFilterFactory" 
> preserveOriginal="true"/>
>               <filter class="solr.TrimFilterFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>               <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1" splitOnCaseChange="0" catenateWords="1" 
> splitOnNumerics="0" stemEnglishPossessive="0" preserveOriginal="1"/>
>     </analyzer>
>     <analyzer type="query">
>             <charFilter class="solr.HTMLStripCharFilterFactory"/>
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>               <filter class="solr.ASCIIFoldingFilterFactory" 
> preserveOriginal="true"/>
>               <filter class="solr.TrimFilterFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>               <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1" splitOnCaseChange="0" catenateWords="1" 
> splitOnNumerics="0" stemEnglishPossessive="0" preserveOriginal="1"/>
>     </analyzer>
>   </fieldType>
>
>
>
> Thanks and Regards,
> Preeti
>
>
>
> NOTICE TO RECIPIENTS: This communication may contain confidential and/or 
> privileged information. If you are not the intended recipient (or have 
> received this communication in error) please notify the sender and 
> it-supp...@shoregrp.com immediately, and destroy this communication. Any 
> unauthorized copying, disclosure or distribution of the material in this 
> communication is strictly forbidden. Any views or opinions presented in this 
> email are solely those of the author and do not necessarily represent those 
> of the company. Finally, the recipient should check this email and any 
> attachments for the presence of viruses. The company accepts no liability for 
> any damage caused by any virus transmitted by this email.
>
>

NOTICE TO RECIPIENTS: This communication may contain confidential and/or 
privileged information. If you are not the intended recipient (or have received 
this communication in error) please notify the sender and 
it-supp...@shoregrp.com immediately, and destroy this communication. Any 
unauthorized copying, disclosure or distribution of the material in this 
communication is strictly forbidden. Any views or opinions presented in this 
email are solely those of the author and do not necessarily represent those of 
the company. Finally, the recipient should check this email and any attachments 
for the presence of viruses. The company accepts no liability for any damage 
caused by any virus transmitted by this email.

RE: Using asterik(*) with unicode characters.

Reply via email to