Re: Problem with solr suggester in case of non-ASCII characters

Furkan KAMACI Tue, 30 Jul 2019 07:17:49 -0700

Hi Roland,

Could you check Analysis tab (
https://lucene.apache.org/solr/guide/8_1/analysis-screen.html) and tell how
the term is analyzed for both query and index?


Kind Regards,
Furkan KAMACI

On Tue, Jul 30, 2019 at 4:50 PM Szűcs Roland <szucs.rol...@bookandwalk.hu>
wrote:

> Hi All,
>
> I have an author suggester (searchcomponent and the related request
> handler) defined in solrconfig:
> <searchComponent name="suggest" class="solr.SuggestComponent">
>     <!-- All suggester component must have different filepath to avoid
>     write lock issues-->>
>     <lst name="suggester">
>       <str name="name">author</str>
>       <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
>       <str name="dictionaryImpl">DocumentDictionaryFactory</str>
>       <str name="field">BOOK_productAuthor</str>
>       <str name="suggestAnalyzerFieldType">short_text_hu</str>
>       <str name="indexPath">suggester_infix_author</str>
>       <str name="buildOnStartup">false</str>
>       <str name="buildOnCommit">false</str>
>       <str name="minPrefixChars">2</str>
>     </lst>
> </searchComponent>
>
> <requestHandler name="/suggesthandler" class="solr.SearchHandler"
> startup="lazy" >
> <lst name="defaults">
>   <str name="suggest">true</str>
>   <str name="suggest.count">10</str>
>   <str name="suggest.dictionary">author</str>
> </lst>
> <arr name="components">
>   <str>suggest</str>
> </arr>
> </requestHandler>
>
> Author field has just a minimal text processing in query and index time
> based on the following definition:
> <fieldType name="short_text_hu" class="solr.TextField"
> positionIncrementGap="100" multiValued="true">
>     <analyzer type="index">
>       <charFilter class="solr.HTMLStripCharFilterFactory"/>
>       <tokenizer class="solr.ClassicTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords_hu.txt"
> ignoreCase="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.ClassicTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords_hu.txt"
> ignoreCase="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>   <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> docValues="true"/>
>   <fieldType name="strings" class="solr.StrField" sortMissingLast="true"
> docValues="true" multiValued="true"/>
>   <fieldType name="text_ar" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer>
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.StopFilterFactory" words="lang/stopwords_ar.txt"
> ignoreCase="true"/>
>       <filter class="solr.ArabicNormalizationFilterFactory"/>
>       <filter class="solr.ArabicStemFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> When I use qeries with only ASCII characters, the results are correct:
> "Al":{
> "term":"<b>Al</b>exandre Dumas", "weight":0, "payload":""}
>
> When I try it with Hungarian authorname with special character:
> "Jó":"author":{
> "Jó":{ "numFound":0, "suggestions":[]}}
>
> When I try it with three letters, it works again:
> "Józ":"author":{
> "Józ":{ "numFound":10, "suggestions":[{ "term":"Bajza <b>Józ</b>sef", "
> weight":0, "payload":""}, { "term":"Eötvös <b>Józ</b>sef", "weight":0, "
> payload":""}, { "term":"Eötvös <b>Józ</b>sef", "weight":0, "payload":""}, {
> "term":"Eötvös <b>Józ</b>sef", "weight":0, "payload":""}, {
> "term":"<b>Józ</b>sef
> Attila", "weight":0, "payload":""}..
>
> Any idea how can it happen that a longer string has more matches than a
> shorter one. It is inconsistent. What can I do to fix it as it would
> results poor customer experience.
> They would feel that sometimes they need 2 sometimes 3 characters to get
> suggestions.
>
> Thanks in advance,
> Roland
>

Re: Problem with solr suggester in case of non-ASCII characters

Reply via email to