Re: camel-casing and dismax troubles

Yonik Seeley Wed, 13 May 2009 06:24:11 -0700

On Tue, May 12, 2009 at 7:19 PM, Geoffrey Young
<ge...@modperlcookbook.org> wrote:
> hi all :)
>
> I'm having trouble with camel-cased query strings and the dismax handler.
>
> a user query
>
>  LeAnn Rimes
>
> isn't matching the indexed term
>
>  Leann Rimes


This is the camel-case case that can't currently be handled by a
single WordDelimiterFilter.

If the indexeddoc had LeAnn, then it would be indexed as
"le","ann"/"leann" and hence queries of both forms "le ann" and
"leann" would match.

However since the indexed term is simply "leann", a
WordDelimiterFilter configured to split won't match (a search for
"LeAnn" will be translated into a search for "le" "ann".

One way to work around this now is to do a copyField into another
field that catenates split terms in the query analyzer instead of
generating/splitting, and then search across both fields.

BTW, your parsed query below shows you turned on both catenation and
generation (or perhaps preserveOriginal) for split subwords in your
query analyzer.  Unfortunately this configuration doesn't work due to
the ambiguity of what it means to have multiple terms at the same
position (this is the same problem for multi-word synonyms at query
time).  The query shown below looks for "leann" or "le" followed by
"ann" and hence an indexed term of "leann" won't match.

-Yonik
http://www.lucidimagination.com

> even though both are lower-cased in the end.  furthermore, the
> analysis tool shows a match.
>
> the debug query looks like
>
>  "parsedquery":"+((DisjunctionMaxQuery((search-en:\"(leann le)
> ann\")) DisjunctionMaxQuery((search-en:rimes)))~2) ()",
>
> I have a feeling it's due to how the broken up tokens are added back
> into the token stream with PreserveOriginal, and some strange
> interaction between that order and dismax, but I'm not entirely sure.
>
> configs follow.  thoughts appreciated.
>
> --Geoff
>
>  <fieldType name="search-en" class="solr.TextField"
> positionIncrementGap="100">
>    <analyzer type="index">
>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      <filter class="solr.ISOLatin1AccentFilterFactory" />
>      <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"
>                                                      generateWordParts="1"
>                                                      generateNumberParts="1"
>                                                      catenateWords="1"
>                                                      catenateNumbers="1"
>                                                      catenateAll="1"/>
>      <filter class="solr.LowerCaseFilterFactory"/>
>      <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>      <filter class="solr.StopFilterFactory" ignoreCase="false"
> words="stopwords-en.txt"/>
>    </analyzer>
>
>    <analyzer type="query">
>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      <filter class="solr.ISOLatin1AccentFilterFactory" />
>      <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"
>                                                      generateWordParts="1"
>                                                      generateNumberParts="1"
>                                                      catenateWords="0"
>                                                      catenateNumbers="0"
>                                                      catenateAll="0"/>
>      <filter class="solr.LowerCaseFilterFactory"/>
>      <filter class="solr.StopFilterFactory" ignoreCase="false"
> words="stopwords-en.txt"/>
>    </analyzer>
>  </fieldType>
>

Re: camel-casing and dismax troubles

Reply via email to