Hi,

I'm trying to improve the search box on our website by adding an autosuggest field. The dataset is a set of properties in the world (mostly europe) and the searchbox is intended to be filled with a country-, region- or city name. To do this I've created a separate, simple core with one document per geographic location, for example the document for the country "France" contains several fields including the number of properties (so we can show the approximate amount of results in the autosuggest box) and the name of the country France in several languages and some other bookkeeping information. The name of the property is stored in two fields: "name" which simple contains the canonical name of the country, region or city and "names" which is a multivalued field containing the name in several different languages. Both fields use an EdgeNGramFilter during analysis so the query "Fr" can match "France".

This all seems to work, the autosuggest box gives appropriate suggestions. But when I turn on highlighting the results are less than desirable, for example the query "rho" using dismax (and hl.snippets=5) returns the following:

<lst name="5119">
<arr name="names">
<str><em>Rég</em>ion Rhône-Alpes</str>
<str><em>Rhô</em>ne-Alpes</str>
<str><em>Rhô</em>ne-Alpes</str>
<str><em>Rhô</em>ne-Alpes</str>
<str><em>Rhô</em>ne-Alpes</str>
</arr>
<arr name="name">
<str><em>Rég</em>ion Rhône-Alpes</str>
</arr>
</lst>
<lst name="5440">
<arr name="names">
<str><em>Dép</em>artement du Rhône</str>
<str><em>Dép</em>artement du Rhône</str>
<str><em>Rhô</em>ne</str>
<str><em>Dép</em>artement du Rhône</str>
<str><em>Rhô</em>ne</str>
</arr>
<arr name="name">
<str><em>Dép</em>artement du Rhône</str>
</arr>
</lst>

As you can see, no matter where the match is, the first 3 characters are highlighted. Obviously not correct for many of the fields. Is this because of the NGramFilterFactory or am I doing something wrong?

The field definition for 'name' and 'names' is:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>


Regards,

gwk

Reply via email to