Autosuggest and highlighting

gwk Tue, 09 Feb 2010 05:40:10 -0800

Hi,

I'm trying to improve the search box on our website by adding anautosuggest field. The dataset is a set of properties in the world(mostly europe) and the searchbox is intended to be filled with acountry-, region- or city name. To do this I've created a separate,simple core with one document per geographic location, for example thedocument for the country "France" contains several fields including thenumber of properties (so we can show the approximate amount of resultsin the autosuggest box) and the name of the country France in severallanguages and some other bookkeeping information. The name of theproperty is stored in two fields: "name" which simple contains thecanonical name of the country, region or city and "names" which is amultivalued field containing the name in several different languages.Both fields use an EdgeNGramFilter during analysis so the query "Fr" canmatch "France".

This all seems to work, the autosuggest box gives appropriatesuggestions. But when I turn on highlighting the results are less thandesirable, for example the query "rho" using dismax (and hl.snippets=5)returns the following:


<lst name="5119">
<arr name="names">
<str><em>Rég</em>ion Rhône-Alpes</str>
<str><em>Rhô</em>ne-Alpes</str>
<str><em>Rhô</em>ne-Alpes</str>
<str><em>Rhô</em>ne-Alpes</str>
<str><em>Rhô</em>ne-Alpes</str>
</arr>
<arr name="name">
<str><em>Rég</em>ion Rhône-Alpes</str>
</arr>
</lst>
<lst name="5440">
<arr name="names">
<str><em>Dép</em>artement du Rhône</str>
<str><em>Dép</em>artement du Rhône</str>
<str><em>Rhô</em>ne</str>
<str><em>Dép</em>artement du Rhône</str>
<str><em>Rhô</em>ne</str>
</arr>
<arr name="name">
<str><em>Dép</em>artement du Rhône</str>
</arr>
</lst>

As you can see, no matter where the match is, the first 3 characters arehighlighted. Obviously not correct for many of the fields. Is thisbecause of the NGramFilterFactory or am I doing something wrong?


The field definition for 'name' and 'names' is:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"generateNumberParts="1" catenateWords="1" catenateNumbers="1"catenateAll="0" splitOnCaseChange="1"/>

<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"maxGramSize="20"/>

</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"ignoreCase="true" expand="true"/><filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/><filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"generateNumberParts="1" catenateWords="0" catenateNumbers="0"catenateAll="0" splitOnCaseChange="1"/>

<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>


Regards,

gwk

Autosuggest and highlighting

Reply via email to