Re: DisMax and WordDelimiterFilterFactory

Erick Erickson Thu, 27 Oct 2011 05:21:03 -0700

What happens if you change your WDDF definition in the query part of
your analysis
chain to NOT split on case change? Then your index should contain the right
fragments (and combined words) and your queries would match.


I admit I haven't thought this through entirely, but this would work
for your example I
think. Unfortunately I suspect it would break other cases.... I
suspect you're in a
"lesser of two evils" situation.

But I can't imagine a 100% solution here. You're effectively asking to
compensate for
any fat-fingered thing a user does. Impossible I think...

Best
Erick

On Tue, Oct 25, 2011 at 1:13 PM, Demian Katz <demian.k...@villanova.edu> wrote:
> I've seen a couple of threads related to this subject (for example, 
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg33400.html), but I 
> haven't found an answer that addresses the aspect of the problem that 
> concerns me...
>
> I have a field type set up like this:
>
>    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.ICUTokenizerFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
> splitOnCaseChange="1"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true"/>
>        <filter class="solr.ICUFoldingFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory" 
> protected="protwords.txt"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.ICUTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
> ignoreCase="true" expand="true"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="1" />
>        <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true"/>
>        <filter class="solr.ICUFoldingFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory" 
> protected="protwords.txt"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> The important feature here is the use of WordDelimiterFilterFactory, which 
> allows a search for "WiFi" to match an indexed term of "wi fi" (for example).
>
> The problem, of course, is that if a user accidentally introduces a case 
> change in their query, the query analyzer chain breaks it into multiple words 
> and no hits are found...  so a search for "exaMple" will look for "exa mple" 
> and fail.
>
> I've found two solutions that resolve this problem in the admin panel field 
> analysis tool:
>
>
> 1.)    Turn on catenateWords and catenateNumbers in the query analyzer - this 
> reassembles the user's broken word and allows a match.
>
> 2.)    Turn on preserveOriginal in the query analyzer - this passes through 
> the user's original query, which then gets cleaned up bythe 
> ICUFoldingFilterFactory and allows a match.
>
> The problem is that in my real-world application, which uses DisMax, neither 
> of these solutions work.  It appears that even though (if I understand 
> correctly) the WordDelimiterFilterFactory is returning ALTERNATIVE tokens, 
> the DisMax handler is combining them a way that requires all of them to match 
> in an inappropriate way...  for example, here's partial debugQuery output for 
> the "exaMple" search using Dismax and solution #2 above:
>
>    "parsedquery":"+DisjunctionMaxQuery((genre:\"(exampl exa) mple\"^300.0 | 
> title_new:\"(exampl exa) mple\"^100.0 | topic:\"(exampl exa) mple\"^500.0 | 
> series:\"(exampl exa) mple\"^50.0 | title_full_unstemmed:\"(example exa) 
> mple\"^600.0 | geographic:\"(exampl exa) mple\"^300.0 | contents:\"(exampl 
> exa) mple\"^10.0 | fulltext_unstemmed:\"(example exa) mple\"^10.0 | 
> allfields_unstemmed:\"(example exa) mple\"^10.0 | title_alt:\"(exampl exa) 
> mple\"^200.0 | series2:\"(exampl exa) mple\"^30.0 | title_short:\"(exampl 
> exa) mple\"^750.0 | author:\"(example exa) mple\"^300.0 | title:\"(exampl 
> exa) mple\"^500.0 | topic_unstemmed:\"(example exa) mple\"^550.0 | 
> allfields:\"(exampl exa) mple\" | author_fuller:\"(example exa) mple\"^150.0 
> | title_full:\"(exampl exa) mple\"^400.0 | fulltext:\"(exampl exa) mple\")) 
> ()",
>
> Obviously, that is not what I want - ideally it would be something like 
> 'exampl OR "ex ample"'.
>
> I also read about the autoGeneratePhraseQueries setting, but that seems to 
> take things way too far in the opposite direction - if I set that to false, 
> then I get matches for any individual token; i.e. example OR ex OR ample - 
> not good at all!
>
> I have a sinking suspicion that there is not an easy solution to my problem, 
> but this seems to be a fairly basic need; splitOnCaseChange is a useful 
> feature to have, but it's more valuable if it serves as an ALTERNATIVE search 
> rather than a necessary query munge.  Any thoughts?
>
> thanks,
> Demian
>

Re: DisMax and WordDelimiterFilterFactory

Reply via email to